Correcting model misspecification in relationship estimates

Ethan M Jewett
DOI: https://doi.org/10.1101/2024.05.13.594005
2024-09-04
Abstract:The datasets of large genotyping biobanks and direct-to-consumer genetic testing companies contain many related individuals. Until now, it has been widely accepted that the most distant relationships that can be detected are around fifteen degrees (approximately 8 cousins) and that practical relationship estimates have a ceiling around ten degrees (approximately 5 cousins). However, we show that these assumptions are incorrect and that they are due to a misapplication of relationship estimators. In particular, relationship estimators are applied almost exclusively to putative relatives who have been identified because they share detectable tracts of DNA identically by descent (IBD). However, no existing relationship estimator conditions on the event that two individuals share at least one detectable segment of IBD anywhere in the genome. As a result, the relationship estimates obtained using existing estimators are dramatically biased for distant relationships, inferring all sufficiently distant relationships to be around ten degrees regardless of the depth of the true relationship. Existing relationship estimators are derived under a model that assumes that each pair of related individuals shares a single common ancestor (or mating pair of ancestors). This model breaks down for relationships beyond 10 generations in the past because individuals share many thousands of cryptic common ancestors due to pedigree collapse. We first derive a corrected likelihood that conditions on the event that at least one segment is observed between a pair of putative relatives and we demonstrate that the corrected likelihood largely eliminates the bias in estimates of pairwise relationships and provides a more accurate characterization of the uncertainty in these estimates. We then reformulate the relationship inference problem to account for the fact that individuals share many common ancestors, not just one. We demonstrate that the most distant relationship that can be inferred using IBD may be 100 degrees or more, rather than ten, extending the time-to-common ancestor from approximately 200 years in the past to approximately 1,500 years in the past or more. This dramatic increase in the range of relationship estimators makes it possible to infer relationships whose common ancestors lived before historical events such as European settlement of the Americas and the Transatlantic Slave Trade, and possibly earlier.
Genetics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the bias existing in the existing genetic relationship estimation methods when dealing with distant relationships. Specifically, the existing relationship estimators assume that each pair of related individuals shares a single common ancestor or a pair of spousal ancestors. This model fails when dealing with distant relationships beyond 10 generations because in such cases, individuals share many hidden common ancestors. Due to pedigree collapse, the relationships between individuals become more complex, and the existing estimators do not take this into account, resulting in estimation bias. ### Main problem points: 1. **Limitations of the existing model**: The existing relationship estimators have significant biases when detecting distant relationships. The main reason is that these estimators are not conditioned on at least one detectable identical - by - descent (IBD) segment shared between two suspected relatives. Therefore, for distant relationships, the existing estimators will wrongly infer all sufficiently distant distances to be approximately 10 degrees, regardless of the true relationship depth. 2. **Impact of pedigree collapse**: As the degree of relationship increases, the number of common ancestors shared between individuals increases dramatically, and the existing estimators fail to consider this factor. This leads to a serious underestimation of distant relationships. ### Solutions: 1. **Modify the likelihood function**: The author derived a modified likelihood function that is conditioned on at least one detectable IBD shared between two suspected relatives. By using this modified likelihood function, the estimation bias can be significantly reduced, and a more accurate description of relationship uncertainty can be provided. 2. **Multi - ancestor model**: The author also proposed a new relationship estimation model that takes into account the situation where individuals share multiple common ancestors. This model is more in line with the actual situation, especially when dealing with distant relationships. ### Results: - **Modified estimator**: The modified estimator no longer shows significant bias and can estimate distant relationships more accurately. - **Extension of distant relationships**: Through the new model, the range of relationships that can be inferred extends from approximately 10 degrees to more than 200 degrees, and the time span extends from about 300 years ago to about 3,000 years ago or more. This makes it possible to infer common ancestors who lived before historical events, such as the European colonization of the Americas, the trans - Atlantic slave trade, and the rise and fall of the Roman Empire. ### Summary: The main contribution of this paper lies in proposing a new relationship estimation method that can handle distant relationships more accurately, thereby expanding the time range and accuracy of genetic relationship inference. This is of great significance for genomics research, family tree reconstruction, medical genetics and other fields.