ANN ARBOR—Ancestral background has much to do with our likelihood of developing or staving off disease. But separating the associations between who we are and where we come from, and genetic variations that cause disease, can be difficult and often result in false genetic study leads.
A new statistical method, developed by researchers at the University of Michigan School of Public Health, can help those who study the human genome better identify ancestry as they go about isolating the genes that cause disease.
The LASER (Locating Ancestry from SEquence Reads) software can establish ancestry using very small amounts of sequence data, scattered across 1-10 percent of the genome and adding only a few dollars to the cost of a genetic analysis.
"You can use our method to describe the ancestry of an individual very precisely, even separating individuals from different parts of Finland," said Goncalo Abecasis, the Felix E. Moore Collegiate Professor of Biostatistics at U-M. "In studies of genetic diseases, this information helps separate changes that cause disease from more numerous changes that specify ancestry."
A study explaining how the new software program was developed and tested is published online in Nature Genetics.
"Estimation of ancestry was previously challenging in many disease sequencing studies where only a small proportion of the genome is sequenced," said Chaolong Wang, who holds a doctorate in bioinformatics from Michigan and now is a research fellow at the Harvard School of Public Health.
"A major advantage of our approach is that it can use information from sequencing reads in the "off-target" regions of the genome, which are byproducts of the sequencing experiments and were previously discarded."
To test their method, the team used two reference groups with known ancestry and compared these against results from the software. One was a worldwide group from the Human Genome Diversity Panel that included a random sample of 238 individuals from 53 populations worldwide. Researchers used array genotypes at 632,958 loci as templates to simulate sequencing data.
The other group of 385 individuals was derived from the Population Reference Sample, consisting of 37 European populations. They simulated sequencing data based on genotypes at 318,682 loci for this group.
The team also evaluated the tool with data from the 1000 Genomes Project, using all of the individuals from the Human Genome Diversity Panel as the reference set, and they compared data from 3,159 samples previously sequenced for macular degeneration.
"The accurate ancestry estimates derived from our method allow us to correct for population stratification in genetic disease studies based on sequencing, as well as to carefully match ancestral background when combining genetic data from different sources, increasing our ability to find disease genes," said Wang, who is first author of the paper.
Wang's work was supported by the Howard Hughes Medical Institute International Student Research Fellowship. This study is supported by the U.S. National Institutes of Health and by the National Eye Institute Intramural Research Program.
- Goncalo Abecasis: www.sph.umich.edu/iscr/faculty/profile.cfm?uniqname=goncalo
- Chaolong Wang: www.hsph.harvard.edu/chaolong-wang
- Nature Genetics: www.nature.com/ng/index.html