SASDb – An Integrated resource for South Asian genome variation

South Asian population, predominantly contributed by Indian sub-continent, represents a unique blend of genetically homogeneous, but socially and geographically stratified groups of individuals that can be leveraged to investigate complex diseases and disease phenotypes.  Although some studies have been published recently, no attempt has been made to develop a resource by systematically analyzing the South Asian Genome datasets.  In this study, we have analyzed 215 whole genomes sequences (WGS), 121 whole exomes sequences (WES), 324 whole genome array data (WGA) from individuals of five South Asian countries:  India, Pakistan, Sri Lanka, Bangladesh and Afghanistan.  Additionally, we have integrated South Asian sample variants from 1000 Genome phase3 (1KGP) and ExAC South Asian variants.  Overall, the integrated database contains 49.7 million variants of which 44.5 million are SNVs and 5.2 million are short InDels. Interestingly, among these, 5.6 million variants were not present in the 1000Genome and ExAC databases, which includes South Asian genomes.  Of the total variants in the database, 21.8 million (43.8%) of the variants fall inside known genes and 1 million (0.3%) of these variants were found to be of coding non-synonymous type.  Copy number analysis of the 215 whole genome samples revealed 50,829 duplications and 67,984 deletions.  Analysis of mitochondrial genome revealed that majority of samples belongs to M (45%) haplogroup followed by U (20%) and H (13%) haplogroups.  The database is searchable using rs-id, position and gene name and shows allele frequencies statistics at three levels: super-population.