METHODS AND SYSTEMS FOR DETERMINING AND DISPLAYING PEDIGREES
The disclosed embodiments concern methods, apparatus, systems and computer program products for determining and displaying pedigrees based on IBD data. Some implementations use a probabilistic relationship model to obtain various likelihoods of various potential relationships based on pairwise IBD data, and pairwise age data. Some implementations build large pedigrees by combining smaller pedigrees. Some implementations display pedigree graphs with various features that are informative and easy to understand.
An Application Data Sheet is filed concurrently with this specification as part of the present application. Each application that the present application claims benefit of or priority to as identified in the concurrently filed Application Data Sheet is incorporated by reference herein in its entirety and for all purposes.
BACKGROUNDA pedigree refers to the genetic relationships among a group of genetically related individuals. Pedigrees can be used to produce family trees for consumers or genealogists. They can also be used to determine the heritability and genetic models for traits and disorders. Pedigree structure can be used to enable or improve geneticanalysis tools such as linkage, familybased association, pedigreeaware imputation, and pedigreeaware phasing.
However, there are many technical challenges in determining pedigrees using genetic data. Manually reconstructing an unknown pedigree with pairwise relationship comparisons requires arduous, errorprone labor. For example, Pemberton et al. manually reconstructed cryptic HapMap3 pedigrees, but the authors encountered inconsistencies they could not resolve by hand. Pemberton, et al. (2010). Inference of unexpected genetic relatedness among individuals in HapMap Phase III. Am. J. Hum. Genet. 87, 457464. These problems become even more impractical or impossible to solve when the pedigrees are large and numerous.
Computer tools using identitybydescent (IBD) genetic data to construct pedigrees have been developed to address some of these problems. However, the accuracy, qualities, and efficiencies of available computer tools have many limitations. In various implementations, methods and systems disclosed herein for determining, constructing and visualizing pedigrees provide various advantages and improvements over conventional approaches.
SUMMARYThe disclosed implementation, concern methods, apparatus, systems, and computer program products for determining and displaying pedigrees among genetically individuals based on IBD data
A first aspect of the disclosure provides computerimplemented methods for determining pedigree relationships among a plurality of genetically related individuals.
Another aspect of the disclosure provides systems for determining pedigree relationships among a plurality of genetically related individuals. In some implementations, the system involves a processor and one or more computerreadable storage media having stored thereon instructions for execution on said processor determine pedigree relationships among a plurality of genetically related individuals.
Another aspect of the disclosure provides a computer program product including a nontransitory machine readable medium storing program code that, when executed by one or more processors of a computer system, causes the computer system to implement the methods above for determining pedigree relationships among a plurality of genetically related individuals.
Although the examples herein concern humans and the language is primarily directed to human concerns, the concepts described herein are applicable to genomes from any plant or animal. These and other objects and features of the present disclosure will become more fully apparent from the following description and appended claims, or may be learned by the practice of the disclosure as set forth hereinafter.
The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee.
For a more complete understanding of the invention, reference is made to the following description and accompanying drawings, in which:
The disclosure concerns methods, apparatus, systems, and computer program products for determining pedigree relationships among a plurality of genetically related individuals. Various implementations operate on IBD data to perform the disclosed functions. IBD data may be provided in different formats or obtained by various methods. For example, U.S. patent application Ser. No. 16/947,107, entitled: PHASEAWARE DETERMINATION OF IDENTITYBYDESCENT DNA SEGMENTS, filed on Jul. 1 2020, which is incorporated by reference in its entirety, discloses suitable methods for determining IBD using genotype data.
Numeric ranges are inclusive of the numbers defining the range. It is intended that every maximum numerical limitation given throughout this specification includes every lower numerical limitation, as if such lower numerical limitations were expressly written herein. Every minimal numerical limitation given throughout this specification will include every higher numerical limitation, as if such higher numerical limitations were expressly written herein. Every numerical range given throughout this specification will include every narrower numerical range that falls within such broader numerical range, as if such narrower numerical ranges were all expressly written herein.
The headings provided herein are not intended to limit the disclosure.
Unless defined otherwise herein, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art. Various scientific dictionaries that include the terms included herein are well known and available to those in the art. Although any methods and materials similar or equivalent to those described herein find use in the practice or testing of the embodiments disclosed herein, some methods and materials are described.
The terms defined immediately below are more fully described by reference to the
Specification as a whole. It is to be understood that this disclosure is not limited to the particular methodology, protocols, and reagents described, as these may vary, depending upon the context they are used by those of skill in the art.
DEFINITIONSAs used herein, the singular terms “a,” “an,” and “the” include the plural reference unless the context clearly indicates otherwise.
The term “plurality” refers to more than one element. For example, the term is used herein in reference to a number of nucleic acid molecules or sequence reads that is sufficient to identify significant differences in repeat expansions in test samples and control samples using the methods disclosed herein.
A DNA segment is identical by state (IBS) in two or more individuals if they have identical nucleotide sequences in this segment. An IBS segment is identical by descent (IBD) in two or more individuals if they have inherited it from a common ancestor without recombination, that is, the segment has the same ancestral origin in these individuals. DNA segments that are IBD are IBS per definition, hut segments that are not IBD can still be IBS due to the same mutations in different individuals or recombinations that do not alter the segment.
The terms “nucleic acid” and “nucleic acid molecules” are used interchangeably and refer to a covalently linked sequence of nucleotides (i.e., ribonucleotides for RNA and deoxyribonucleotides for DNA) in which the 3′ position of the pentose of one nucleotide is joined by a phosphodiester group to the 5′ position of the pentose of the next. The nucleotides include sequences of any form of nucleic acid, including, but not limited to RNA and DNA molecules such as cellfree DNA (cfDNA) molecules.
The term ‘parameter’ herein refers to a numerical value that characterizes a physical property. Frequently, a parameter numerically characterizes a quantitative data set and/or a numerical relationship between quantitative data sets. For example, the maximum degree of genetic distance between two genotyped individuals in a pedigree is a parameter of a genetic pedigree model.
The term “based on,” when used in the context of obtaining a specific quantitative value, herein refers to using another quantity as input to calculate the specific quantitative value as an output.
As used herein the term “chromosome” refers to the hereditybearing gene carrier of a living cell, which is derived from chromatin strands comprising DNA and protein components (especially histones). The conventional internationally recognized individual human genome chromosome numbering system employed herein.
Introduction and OverviewSome existing computer implemented methods use identitybydescent (IBD) data to estimate pedigrees. One such method for determining a pedigree is PRIMUS. Staples et al. (2014), PRIMUS: Rapid Reconstruction of Pedigrees from Genomewide Estimates of Identity by Descent. The American Journal of Human Genetics 95, 553564. PRIMUS uses the total lengths of half and full IBD between pairs of individuals to obtain likelihoods of different relationship types. It then attempts to construct a pedigree for which the product of all pairwise likelihoods induced by the pedigree is greatest. PRIMUS does not use the count of IBD fragments for determining a pedigree and it uses age information only to resolve apparent discrepancies, such as an inferred grandparent being younger than a grandchild or an inferred nephewuncle pair having an age difference greater than a specified threshold. In general, previous methods do not use age information as part of the likelihood of relationship in constructing pedigrees. In contrast, some implementations disclosed herein use the count of IBD segments and pairwise age differences directly in modeling the likelihoods of relationships. These aspects of the implementations improve the accuracy of relationship estimate.
PRIMUS uses a kernel density estimation to estimate IBDlength distributions. Kernel density estimation is a nonparametric technique that can result in overfitting data. In contrast, some implementations disclosed herein model distributions of IBD data (e.g., IBD length, number of IBD segments) and age difference data, as parametric probability distributions. In various implementations, the probability distributions are modeled as Gaussian distributions, exponential distributions, or Poisson distributions. The parametric approach can provide more reliable estimates by avoiding overfitting the data.
PRIMUS computes likelihoods for only six general categories corresponding to different degrees of relationship:

 Parentchild
 Fullsibling
 Halfsibling, avuncular, grandparental
 Firstcousin, greatgrandparental, greatavuncular, halfavuncular
 Distantly related
 Unrelated
PRIMUS does not compute separate estimates for different relationships within each category (e.g., half siblings, avuncular, and grandparental). This was because relationships within a category could be difficult to distinguish from one another based on genetic data alone. In contrast, methods disclosed herein estimate relationship likelihood for each specific relationship, including each relationship in a same category above. Methods disclosed herein use age data to determine likelihoods of relationships, which is especially helpful for distinguishing relationships with the same coefficient of relationship (e.g. half siblings vs. grandparents).
The coefficient of relationship is a measure of the degree of consanguinity (or biological relationship) between two individuals. With a simplifying assumption of nonconsanguineous common ancestors, it can be calculated as:
where p enumerates all paths connecting B and C to unique common ancestors, and L(p) is the length of a commonancestor path p, which may be expressed in generations or meioses through the path. The coefficient of relationship sometimes is also referred to as “average fraction of DNA shared.” Table 1 lists various relationships and corresponding coefficients of relationships.
PRIMUS lumps all relationships having the same coefficient of relationship into one category, and relationships with coefficients smaller than 25% cannot be used to generate a pedigree. The disclosed methods herein provide pairwise relationship estimates at many levels beyond a 25% coefficient of relationship. In many applications, relationships up to 15th degree are estimated. This makes it possible to build very large pedigrees with many degrees of relationship.
Conventional methods for determining pedigrees using IBD data do not properly address noise caused by background IBD. For example, some individuals from Ashkenazi Jewish, Mexican, Puerto Rican, and other populations share IBD due to historical bottlenecks, rather than true recent relationships. Such shared IBD constitutes noise in the IBD
data for determining recent relationships in pedigrees. Some implementations herein estimate the level of background IBD by computing the amount of IBD that each person in a group shares with him or herself between two chromosomes. Their IBD data can then be adjusted to remove the background IBD noise. This approach can help to improve the accuracy of pedigree estimates in various populations.
Some implementations described herein also statistically infer whether the IBD carried by an individual in a pedigree is due simply to background IBD. These approaches leverage previously inferred close relatives of such individuals to make these inferences. The methods then exclude such individuals from consideration when inferring degrees of relationship.
When adding a person to a pedigree, PRIMUS checks that the person's maximum likelihood estimated relationship with any person in the pedigree exceeds an initial threshold of 0.3, although this threshold can be adjusted downward over time if a pedigree fails to build properly on the first attempt. The methods disclosed herein do not have such a restriction. Being free of this restriction makes it possible to build larger pedigrees without having to progressively reduce this threshold, saving considerable computational time.
In the process of building pedigrees, PRIMUS does not distinguish among individuals within the same category of relationship. The PRIMUS method builds in stages, first combining all siblings and parentchild pairs, then second degree relatives (halfsibling, avuncular, and grandparent), then third degree relationships (first cousin, halfavuncular, greatavuncular, and greatgrandparent). The goal is to combine highconfidence classes of relatives first, although confidence in an estimate can vary within a degree class. In contrast, methods disclosed herein first include a person that is most closely related to any individuals in the pedigree. By giving priority to the close relationship with the highest confidence, the disclosed methods can improve the accuracy of the pedigree estimate.
In adding a person to pedigree, PRIMUS adds the person in all possible relationships regardless of the likelihoods of the relationships. In contrast, the methods disclosed herein add a person to a pedigree in potential relationships that are highly likely. Moreover, the methods disclosed herein exclude low likelihood pedigrees in the process of building pedigrees. These likelihoodbased techniques can greatly reduce the pedigree space that needs to be explored without significantly sacrificing accuracy. As a result, it can greatly improve computational speed and efficiency, and reduce memory and CPU loads.
All previous pedigree inference methods, including PRIMUS, attempt to search the full pedigree space. The full search of all possible pedigrees quickly becomes computationally intractable when the number of individuals in the pedigree is moderate or large. This is so even using modern computers. In contrast the methods disclosed herein use a two stage approach. In the first stage, small pedigrees are inferred using approaches that thoroughly search the space of possible pedigrees. In the second stage, small pedigrees are combined into large pedigrees using heuristic methods that greatly reduce the number of pedigrees that must be searched. Without this heuristic second step, it is computationally intractable to build large pedigrees. The methods disclosed herein are the only known methods to use such heuristic approaches and are the only computerimplemented methods capable of building very large pedigrees.
Process for Determining a PedigreeThe illustrated process is implemented using a computer system that includes one or more processors and system memory. In many realworld applications, it is not practical or possible to implement these methods in a person's mind or using pen and paper. For pedigrees including a large number of individuals, many types or levels of relationship, or ambiguous data, the computation involved in the process wound be too complex to be performed in the human mind.
The methods illustrated here apply IBD data and age data to a probabilistic relationship model to obtain likelihoods of many potential relationships. In each iteration of adding an individual to a pedigree, the number of possible pedigrees grows exponentially. The computational task of determining the likelihood of a single pairwise relationship given IBD data and age data is timeconsuming. This computation needs to be performed for tens of relationships in each iteration of growing pedigrees. As the pedigrees grow larger, the computation of a pedigree likelihood becomes impractical to perform by hand.
Process 150 starts by identifying, among the plurality of genetically related individuals, the closest relative of a starting individual. See block 152. In some applications, the starting individual is an individual of interest, such as a consumer who wants to obtain a pedigree or pedigree graph with herself as a focal person. In some implementations, the starting individual is an individual meeting certain genetic relationship criteria, such as a person who has a high average degree of relationship with other individuals being considered. Various methods may be used to determine how closely two individuals are related or the relationship distance between them. For example, IBD data may be used to calculate a coefficient of relationship as explained above.
In various implementations, the plurality of genetically related individuals includes at least 20, 50, 100, 200, 300, 400, or 500 individuals. In some implementations the pedigree can include both genetically related individuals that have been genotyped and those individuals where genotype data is unknown or not available. In some implementations, every pair of individuals in the plurality of genetically related individuals has a total IBD length larger than an IBD threshold. In various implementations, the IBD threshold is 1 centimorgan (cM), 2 cM, 3 cM, 4 cM, 5 cM 6 cM, 7 cM, 8 cM, 9 cM, 10 cM, 15 cM, 20 cM, 25 cM 50 cM, 75 cM, 100 cM, 200 cM, or 500 cM. In some implementations, the total IBD length is adjusted by subtracting background IBD from the pairwise IBD data.
Various methods may be used to determine background IBD for a group of individuals or a population of individuals. In some implementations two chromosomes in each pair of one or more pairs of the 22 pairs of somatic chromosomes of the same individual can be compared to identify IBD regions. Two corresponding fragments on a pair of chromosomes are respectively inherited from two parents. Assuming that an individual's two parents are not more consanguineous than unrelated individuals in the population, the IBD amount between the two chromosomes of a pair in the individual provides a good estimate of population. background IBD.
In some implementations, the level of background IBD can be inferred by estimating IBD between pairs of individuals assumed to be nonconsanguineous.
In some implementations, IBD lengths are adjusted for the background IBD before being used to model or determine the relationship likelihood or pedigree likelihood. In some implementations, IBD lengths are adjusted before being compared to an IBD threshold to determine whether individuals should be included for consideration in a pedigree. In other implementations, pairs of individuals whose IBD sharing levels are inferred to be significantly lower than expected by chance are removed from consideration.
When selecting a next individual to be added to a pedigree, the process considers how closely individuals already included in the pedigrees are related to individuals not yet included. In some implementations, pairwise IBD data between two individuals are used to determine how closely related the two individuals are, or the relationship distance between the two individuals. In some implementations the relatedness or relationship distance between individuals may be inferred from IBD data using a likelihood expression for the degree of relationship der red using a probabilistic recombination model. Other genetic information and methods may also be used to determine relatedness or relationship distance. In some implementations, relatedness or relationship distance may be measured by meioses on a common ancestor path. In some implementations, relatedness or relationship distance may be expressed as or measured by coefficient of relationship.
Process 150 proceeds to apply pairwise identity by descent (IBD) data and pairwise age data of the starting individual and the closest relative to the probabilistic relationship model to obtain various likelihoods of various potential relationships between the starting individual and the closest relative. In various implementations, the pairwise age data reflect the age difference between two individuals. In some implementations, the pairwise age data are obtained by simple subtraction. In other implementations, other operations may be performed on ages of two individuals, such as division (e.g., to obtain a ratio of two ages) or normalization (e.g., to obtain a zscore).
It is also possible to extrapolate from empirical distributions of age differences between different types of relatives to obtain distributions for relationships that are unobserved empirically. In particular, pairs of relatives sharing third greatgrandparental relationships (5 generations) may be unobserved, and therefore, it is not possible to obtain the distribution of the age difference of a third greatgrandparental pair empirically. However, the age difference distribution for third greatgrandparents can be estimated by computing the mean (μ_{PC}) and variance σ_{PC}^{2 }of the age differences among observed parentchild pairs. Then, noting that a third greatgrandparental relationship is a string of five statistically independent parentchild relationships, we find that the mean and variance of the age difference distribution for third greatgrandparental relationships are μ_{5GGP}=5 μ_{PC }and σ_{5GGP}^{2}=5 σ_{PC}^{2}, respectively. This result is obtained by using properties of the means and variances of sums of independently distributed random variables.
Given the IBD data and the pairwise age data of the two individuals, the probabilities or likelihoods of different relationships between the two individuals can be determined using the probabilistic relationship model. Various probabilistic relationship models are further described herein after. See block 154. In some implementations, the pairwise IBD data include the lengths of IBD segments, such as the total or summed length of the IBD segments. In some implementations, the lengths of IBD segments include the length of full IBD segments (IBD2) and/or length of half IBD segments (IBD1). In some implementations, the two types of IBD lengths may be combined. In other implementations, the two IBD segment lengths are kept separate and are modeled by the probabilistic relationship model to have different probability distributions. In some implementations the lengths of half IBD segments (IBD1) are summed and the sum is used to compute the likelihood. Similarly, in some implementations the lengths of full IBD segments (IBD2) are summed and the sum is used to compute the likelihood. In some implementations, the pairwise IBD data also include numbers or counts of IBD segments. Similar to lengths of the two types of IBD segments, the numbers of the two types of IBD segments may be combined or modeled separately.
Given the IBD data and the pairwise age data of the two individuals, the probabilities or likelihoods of different relationships between the two individuals can be determined using the probabilistic relationship model. Various probabilistic relationship models are further described herein after. See block 154. In some implementations, the pairwise IBD data include the lengths of IBD segments, such as the total or summed length of the IBD segments. In some implementations, the lengths of IBD segments include the length of full IBD segments (IBD2) and/or length of half IBD segments (IBD1). In some implementations, the two types of IBD lengths may be combined. In other implementations, the two IBD segment, lengths are kept separate and are modeled by the probabilistic relationship model to have different probability distributions. In some implementations the lengths of half IBD segments (IBD1) are summed and the sum is used to compute the likelihood. Similarly, in some implementations the lengths of full IBD segments (IBD2) are summed and the sum is used to compute the likelihood. In some implementations, the pairwise IBD data also include numbers or counts of IBD segments. Similar to lengths of the two types of IBD segments, the numbers of the two types of IBD segments ma be combined or modeled separately.
In some implementations, the probabilistic relationship model is a machine learning model obtained by training the model using training data to determine a plurality of parameters of the model, including parameters of probability distributions for various independent/input variables and various relationships. In some implementations, the probabilistic relationship model models the probability distribution of the pairwise IBD as a Gaussian distribution, a Poisson distribution, an exponential distribution, a binomial distribution, a beta binomial distribution, or other distributions suitably determined from prior information. In some implementations, the probabilistic relationship model also models the probability distribution of the pairwise age data, for each relationship using one or more of said forms of distributions.
In some implementations, the various potential relationships include more than 10, 20, 30, 40, or 50 different relationships. In various implementations, the various relationships include relationships of 0^{th}, 1^{st }and 2^{nd}, 3^{rd}, 4^{th}, 5^{th}, 6^{th}, 7^{th}, 8^{th}, 9^{th}, 10^{th}, 11^{th}, 12^{th}, 13^{th}, 14^{th}, or 15^{th }degree or further. In some implementations, the various relationships include relationships of at least 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 or more meioses on a commonancestor path between the two individuals through a common ancestor. In some implementations, the various relationships include two or more different relationships of the same degree or of the same coefficient of relationship (e.g., half sibling, grandparent, avuncular have a coefficient of relationship of 0.25). So in some implementations, the various relationships include these three relationships as different relationships instead of as a same category of relationship,
Process 150 involves selecting the one or more potential relationships between the starting individual and the closest relative that have relationship likelihoods meeting a relationship criterion. and forming a pedigree from each of the one or more potential relationships. See block 156. In various implementations, different relationship criteria may be used. For example, the relationship criterion may be determined oy likelihood ranks or percentile, There may simply be a number of the most likely relationships, e.g., the top 1, 2, 3, 5, 6, 7, 8, 9, 10, or 15 most likely relationships. In other implementations, the relationship criterion is based on a ratio of the candidate relationship likelihood over the maximum relationship likelihood. In some implementations, the ratio is a log likelihood ratio, and the criterion is for the ratio to be larger than a threshold c. In general, the larger the parameter c, the fewer potential relationships are included. By reducing the potential relationships to be used to construct different pedigrees, the process can reduce the number of relationships to be processed. This can increase computational speed and reduce computational load.
Process 150 proceeds to identify, among genetically related individuals not vet included in pedigrees already formed, a closest relative of individual in the formed pedigrees. See block 158.
Process 150 further involves applying se IBD data and pairwise age data of the closest relative and the individual already in the formed pedigrees to the probabilistic relationship model to obtain various likelihoods of various potential relationships between the closest relative and the individual already in the pedigrees. See block 160.
Process 150 then proceeds to select one or more potential relationships between the closest relative and the individual already in the pedigrees that have relationship likelihoods meeting the relationship criterion. The process also adds each of the one or more potential relationships with the individual already in the pedigrees to grow each pedigree into one or more growing pedigrees. See block 162.
Process 150 further involves selecting growing pedigrees that have pedigree likelihoods meeting a pedigree criterion. In some implementations, a pedigree likelihood can be obtained by aggregating the likelihood of all the relationships in a pedigree, such as summing the log likelihoods of the pairwise relationships in a pedigree. In some implementations, the pedigree criterion is met when a ratio of the candidate pedigree likelihood over a maximum pedigree likelihood is larger than or equal to a threshold value d. In various implementations, d=0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, or 1.0. In other implementations, d=1/100,000, 1/500,000, 1/1,000,000, 1/2,000,000, 1/4,000,000, or the like. In some implementations, the pedigree criterion may also be determined by pedigree likelihood ranks or percentile. Similar to the parameter c above, as a gets larger, fewer pedigrees are included for pedigree building. By increasing the value of d, one can increase computational speed a d reduce CPU or memory load for exploring potential pedigrees.
Process 150 then decides whether there are more individuals to be considered for adding to the pedigrees. See block 166. If so, the process loops back to block 158 to identify another closest relative of any individual already in the pedigree. In some implementations, the process continues the loop until all individuals of the plurality of genetically related individuals have been identified as a closest relative or excluded from the pedigrees according to particular exclusion criteria.
Pedigrees created using process 150 can be combined into even larger pedigrees.
Combining smaller pedigrees into larger pedigrees makes it possible to leverage all IBD segments observed between the smaller pedigrees when inferring the degree of relationship between individuals. This is because it is more likely that some individual in a small pedigree shares IBD with some individual in a related pedigree, even if not all crosspedigree pairs of individuals share IBD.
Methods that combine pedigrees are also computationally much faster than methods that add one individual at a time. The reason for this is that, the amount of IBD shared between a single unplaced individual and a set of genotyped individuals in a pedigree is often consistent with many possible relationships. Consequently, it is necessary to consider many ways of placing the new individual, which is computationally slow. In contrast, when combining two pedigrees, the IBD shared between individuals in the first pedigree and individuals in the second pedigree is larger than the amount shared with any single individual, providing additional information about the way in which the two pedigrees are related. As a result, there are fewer highly likely ways in which the pedigrees can be related, reducing the number of combinations that must be explored and considerably increasing the speed of computation. Combining pedigrees makes it computationally possible to infer pedigrees that are much larger than pedigrees that are computationally tractable for PRIMUS or methods that must search many possible pedigree configurations.
Another reason to combine pedigrees is that background IBD can e detected more effectively. When comparing a single pair of individuals, it is difficult to detect whether the IBD they share is due to background or to a recent relationship. However, by examining the amount of IBD shared among all genotyped or sequenced close relatives of a pair of individuals, it becomes easier to determine when observed IBD is background IBD.
Referring to
Process 170 proceeds by identifying the two smaller pedigrees, P_{1 }and P_{2}, that share the greatest amount of IBD. See Box 1. These will be the next pair of pedigrees that will be combined. To combine pedigrees P_{1 }and P_{2}, the set S_{1 }of individuals in P_{1 }who share IBD with individuals in P_{2 }are then identified. Conversely, the set S_{2 }individuals in P_{2 }who share IBD with individuals in P_{1 }are identified. See box 2. Process 170 then proceeds by identifying a common ancestor of the set S_{1 }(box 3 and box 4) and a common ancestor A_{2 }of the set S_{2 }(box 5 and box 6). These common ancestors are identified using the small pedigree structures that were previously inferred using methods such as shown in process 150.
The degree of relationship between common ancestors A_{1 }and A_{2 }is then inferred. See box 7. In some implementations of the method, the degree of relationship between A_{1 }and A_{2 }is inferred by considering a degree of relationship between A_{1 }and A_{2 }and attaching A_{1 }and A_{2 }by a chain of dummy nodes reflecting tins degree to create a combined pedigree P comprising P_{1}, P_{2}, and the newlyformed chain of dummy nodes. The log likelihood of this pedigree can then be computed as the sum over all pairwise log likelihoods among genotyped individuals in P. Process 170 considers many different possible degrees between A_{1 }and A_{2 }and forms a new pedigree P for each degree. The degree between A_{1 }and A_{2 }is then inferred as the degree that yields the pedigree P with the highest sun of pairwise log likelihoods.
In other implementations, the degree of relationship between A_{1 }and A_{2 }is inferred using a version of the DRUID estimator (M. D. Ramstetter, S. A. Shenoy, T. D. Dyer, D. M. Lehman, J. E. Curran, R. Duggirala, J. Blangero, J. G. Mezey, and A. L. Williams. Inferring identicalbydescent sharing of sample ancestors promotes highresolution relative detection. Am. J. Hum. Genet., 103:30 44, 2018) that, we generalize to the case of pedigrees with arbitrary outbred topologies. This generalized estimator of the degree between A_{1 }and A_{2 }is discussed in Section Distant Relatives Likelihood.
In some implementations, process 170 involves identifying individuals in the sets S_{1 }and S_{2 }whose observed IBD is likely due to background IBD. The IBD observed in these individuals can lead to biased estimates of the degree of relatedness between ancestors A_{1 }and A_{2 }and, more importantly, it can lead to the incorrect identification of A_{1 }and A_{2}, themselves.
The way in which background IBD can contribute to the misidentification of A_{1 }and A_{2 }is shown in
In
Ignoring the IBD in individuals 1 and 2 will not only lead to a different inferred degree between P_{1 }and P_{2}, it also affects the choice of A_{1}. In particular, if individuals 1 and 2 are unrelated to individuals 5 and 6, then the correct common ancestor in pedigree P_{1 }to whom one will connect A_{2 }is individual 8. In some implementations, process cycles over nodes descended from A_{1 }and A_{2 }and identifies nodes whose descendants share significantly less IBD than expected, conditional On the current estimate of the pedigrees P_{1 }and P_{2}, the choice of ancestors A_{1 }and A_{2}, and the degree of relationship between A_{1 }and A_{2}. See box 8. In other implementations, the processing of these nodes is optional.
To determine whether the amount of observed IBD in the descendants of node n_{1 }below A_{1 }is statistically significantly lower than that expected by chance, some implementations consider the set N_{1 }of genotyped descendants of n_{1 }and compute the total merged amount of IBD shared between individuals in N_{1 }and all nodes in S_{2}. The process then uses the likelihood described hereinafter in the Distant Relatives Likelihood section to evaluate whether the total merged length of IBD is significantly lower than that expected by chance. If the amount of IBD is lower than expected, the descendants N_{1 }of node n_{1 }are removed from S_{1}. The aforementioned approach is then used to identify and remove nodes below of A_{2 }whose descendants have an amount of IBD that is significantly lower than that expected by chance. Some implementations cycle through nodes descended from A_{1 }and A_{2 }by repeat consideration of shared IBD of the nodes described above until the amount of IBD in all the remaining descendant nodes of A_{1 }and A_{2 }is not significantly different from that expected by chance.
In some implementations, process 170 identifies a new pair of common ancestors A′_{1 }and A′_{2 }of the reduced sets S′_{1 }and S′_{2}, where S′_{1 }consists of the individuals in S_{1 }who remain after removing individuals whose IBD is inferred to be due to background IBD. Similarly, S′_{2 }consists of the individuals in S_{2 }who remain after removing individuals, whose IBD is inferred to be due to background IBD. See box 9. In some implementations, the operation shown in. box 9 is optional, and downstream processes are performed on A_{1 }and A_{2}. Process 170 then computes the degree of relatedness between A′_{1 }and A′_{2 }using the likelihood described in the Distant Relatives Likelihood section. See box 10. The process then attaches A′_{1 }to A′_{2 }by a string of dummy ancestral nodes. This step yields a new pedigree P comprised of small pedigrees P_{1 }and P._{2}, and the dummy nodes connecting and See box 11.
Process 170 can repeat operations covered by box 1 through box 11 and interim boxes until all small pedigrees have been combined into a single pedigree, or until no two small pedigrees share an amount of merged IBD greater than a predetermined threshold.
Distant Relatives LikelihoodLikelihoods computed among pairs of individuals provide high accuracy for inferring the degree of relatedness when the degree is relatively small. However, the amount of IBD shared between two individuals decreases exponentially in their degree of relatedness, resulting in very little information for inferring degrees between distant relatives. In fact, there can be a sizable probability that distant relatives will share no IBD segments at all, especially if IBD segments below a threshold are discarded to reduce the rate of observation of false positives.
When inferring the degree of relatedness between two distant relatives, it is helpful to leverage information from IBD segments shared among close relatives of these two individuals.
Practitioners have developed a likelihood estimator of the pairwise degree of relatedness between the common ancestors A_{1 }and A_{2 }of two sets of genotyped individuals. To do this, practitioners derive the probability of the observed pattern of IBD shared among descendants of A_{1 }and A_{2}, given the degree d=d_{A1,G}+d_{A1,G }separating A_{1 }and A_{2 }from their most recent common ancestor, G. Note that there can be more than one most recent common ancestor, G. There are two such individuals, G, if A_{1 }and A_{2 }are descended from a single ancestral couple and there is one most recent common ancestors if A_{1 }and A_{2 }are descended from a pair of half siblings.
Consider one of ancestor G's two alleles at a single locus and let O_{i }be a random variable describing the event that a copy of the allele is transmitted to descendant i and is observed. One sets O_{i}=1 if the allele is observed in individual i and O_{i}=0 if it is not observed. The probability Pr(O_{i}=1) can be computes by conditioning on whether G's allele was observed in a recent ancestor of individual i.
Consider the tree relating a set of genotyped individuals with no genotyped direct ancestors and their respective most recent common ancestors (dashed orange lines and red dots in
where d_{i,a(i) }is the number of meioses separating individual i from their ancestor a(i). In the final lines of Equations (1) and (2), one has used the fact that the probability that an allelic copy is transmitted in one meiosis is ½.
Equations (1) and (2) establish a recursion for computing the probability of an observed presence and absence pattern from a given ancestral allelic copy at a single base of the genome. Defining
one can express the recursion compactly as
with the base conditions p_{g,0}=0 and p_{g,1}=1 for q∈G. The probability of an observed IBD sharing pattern {O_{1}, . . . , O_{k}} across k leaf nodes can then be computed recursively using Equation (4).
Equation (4) allows one to compute the expected total length T_{1,2 }of the genome that is covered by an IBD segment between some descendant of A_{1 }and some descendant of A_{2}. In other words, this is the length of IBD one would obtain if one merged all observed IBD segments between descendants of A_{1 }and A_{2}. The expected fraction of the genome that is observed IBD between descendants of A_{1 }and A_{2 }is given by the probability that an ancestral allele copy at a locus in X is passed down to at least one descendant of A_{1 }and at least one descendant of A_{2}.
Let be a set of nodes descended from A_{1 }and let be a set of nodes descended. from A_{2}. In some implementations, and are the sets of genotyped nodes below A_{1 }and A_{2}. Let D_{1 }denote the event that a copy of the allele from G is observed in at least one descendant in . Then, given that the ancestral allele copy vas passed to A_{1}, the probability of the event D_{1}^{c }that no copy was passed to any node in is
which can be computed using the recursion in Equation (4) with the base conditions p_{A}_{1}_{,0}=0 and p_{A}_{1}_{,1}=1. The equivalent probability Pr(D_{2}^{c}O_{A}=1) that a particular allelic copy from G is not observed in any node in , given that A_{2 }inherited the copy is computed in the same way.
The probability that the allelic copy was observed in some member of and in some member of is then
where Pr(D_{1}^{c}O_{A}_{1}=1) and Pr(D_{2}^{c}O_{A}_{2}=1) are computed using the recursion and Equation (5).
If A_{1 }and A_{2 }had exactly one common ancestor with one allele to transmit, then Equation (6) would be the fraction of the genome in which we expect to find some IBD segment shared between some member of and some member of . However, we must now account for the fact that each common ancestor of A_{1 }and A_{2 }in G carries two allelic copies and that there can be either one or two such common ancestors,
Let G denote the number of common ancestors of A_{1 }and A_{2}, each of which carries two alleles at the locus of interest. The probability that a specific one of these 2G alleles is not observed IBD between the descendants of and A_{1 }and A_{2 }is 1−Pr(D_{1}, D_{2}) and the probability that none of them results in an observed IBD segment is [1−Pr(D_{1}, D_{2})]^{2G}. Therefore, the probability Pr(_{1,2}) that one of the 2G ancestral alleles results in an observed IBD segment between some descendant of A_{1 }and some descendant of A_{2 }is
One can use the probability Pr(_{1,2}) to obtain an approximate likelihood of the total length T_{1,2 }of IBD observed between descendants of A_{1 }and A_{2}. The mean of this distribution is simply the expected length of the genome in a state of IBD between the two clades, which is
An approximation of the variance of T_{1,2 }is derived by noting that the length of a patch of IBD can be approximated as the maximum length of × different IBD segments, where is the set of genotyped nodes below ancestor A_{i }at locus m in which the IBD segment is observed. This approximation comes from conceptualizing IBD shared among the IBD segment carrying descendants of A_{1 }and the  IBD segment carrying descendants of A_{2 }as × independent segments with a single point at which all segments overlap. The length of the merged segment to one side of this focal point then has a distribution given by the maximum of × exponential random variables Whose means depend on the degree of separation between the corresponding pairs of leaf individuals.
This approximation is a simplification of the IBD sharing pattern because the segments are not truly independent and need not overlap at a single Joint. Moreover, under this approximation, the length of the merged segment would actually be the maximum over sums of identically distributed random variables, representing the sum of the length of a segment to the right of the center point and the length of the segment to the left. However, one need not be overly concerned with these drawbacks of the conceptualization because the goal is to obtain an accurate, yet simple approximation of the variance of the distribution. One may also assume that no member of is the direct ancestor of another member of the set, which holds in practice if we drop all individuals front are descended from others.
The length, , of an IBD segment between leaf nodes i and is can be modeled as an exponentially distributed random variable with mean length μ_{i,j}=L_{genome}/R, where is the degree of relationship between them and R is the expected number of recombination events, genome wide, in one meiosis. This approximation is due to Huff, C. D., Witherspoon; D. J. Simonson, T. S., Xing, J., Watkins, W. S., Zhang, T., Tuohy, T. M., Neklason, D. W., Burt, R. W., Guthery, S. L., Woodward, S. R., and Jorde, L. B. (2011). Maximumlikelihood estimation of recent shared ancestry (ERSA). Genome Research, 21, 768774. When the length of the genome is expressed in centimorgans (cM), the expected number of recombination events in the genome is L_{genome}/100. Thus, the expected length in cM of an IBD segment between individuals i and separated by meioses is =100/.
Let L_{1,2 }denote a random variable describing the length of the segment formed by merging all segments at a given locus m between descendants of A_{1 }and A_{2}. If the lengths of all segments at this locus were independent, their merged length would have a distribution given (approximately) by the maximum over independent exponentially distributed random variables with means {
Then we have L_{1,2}=max(). Under this condition, the cumulative density function (CDF) F_{L}() of L is
where δ_{(a,b),(c,d) }is the Kronecker delta between tuples (a, b) and (c, d), which is equal to one when (a, b)=(c, d) and zero, otherwise.
The sets and are, themselves, random variables. Summing over all sets and , one obtains
where the probabilities Pr() and Pr() are probabilities of observing IBD in the sets of leaf nodes below A_{1 }and A_{2 }and can be computed using the recursion in Equation (4).
Over the length of the genome, the number N_{1,2 }of IBD segments between the descendants of A_{1 }and A_{2 }is approximately Poisson distributed with mean Pr(_{1,2})L_{genome}/E[L_{1,2}], where _{1,2 }is the event that IBD is observed between some individual in and some individual in . This rate conies from the fact that the average total amount of the genome a patch of IBD is Pr(_{1,2})L_{genome }while the average length of any given segment is E[L_{1,2}]. Thus, there are approximately Pr(_{1,2})L_{genome}/E[L_{1,2}] patches of IBD in the genome, on average. When the lengths of IBD are relatively short and far apart, which they are when the degree between A_{1 }and A_{2 }is large, this is a reasonable approximation. This is precisely the regime in which the distribution in Equation (20) is most useful.
The total length T_{1,2 }of merged IBD among the descendants of A_{1 }and A_{2 }is then
One can derive the variance of T_{1,2 }using the law of total variance as
Note that because N_{1,2}˜Poisson(Pr(_{1,2})L_{genome}/E[L_{1,2}]), one obtains
So Equation (12) simplifies to
where the fact that Var(X)=E[X^{2}]−E[X]^{2 }has been used.
It remains to find E[L_{1,2}] and E[L_{1,2}^{2}]. Using the cumulative density unction (CDF) of L_{1,2 }in Equation (10) and the fact that E[X^{m}]=In!χ^{m−1}[1−F_{X}(χ)]dχ, one obtains
where the integrals in Equation (15) can be evaluated by noting that they are essentially expressions for the moments of exponential random variables with parameters λ_{1}, (λ_{i}+λ_{j}), (λ_{i}+λ_{j}+λ_{k}), etc.
Thus, one can use Equation (15) to compute
where Pr() is the probability of observing IBD segments at the leaves and , and is obtained using the recursion in Equation (4). Equation (16) is then used in to Equation (14) to obtain the variance of T_{1,2}.
In practice, it is too computationally demanding to compute the sums in Equation (16) because the terms [L_{1,2}], [L_{1,2}^{2}], and Pr() are not fast to compute in large quantities. However, the probabilities Pr() can be computed quickly, making it possible to find the most likely sets of leaf nodes, and with observed IBD. Thus, in some implementations one can use an approximation in which it is assumed that the most likely IBD pattern has been observed and one computes
The assumption used in this approximation is that most patterns of observed IBD at the leaves are unlikely compared with the most likely pattern and that most highlikelihood patterns of IBD will yield similar moments E[L_{1,2}^{m}].
Equation (17) can then be used to obtain an approximation of the variance of T_{1,2 }as
where L_{1,2 }is the length of any given IBD segment between A_{1 }and A_{2 }formed by merging all IBD segments between leaf nodes in A_{1 }and A_{2 }that overlap one another.
If the segments, L_{1,2 }were each exponentially distributed, then T_{1, 2 }would have a gamma distribution. In practice, a gamma distribution is an accurate approximation for the distribution of T_{1,2}, given that the length T_{1,2 }is greater than zero, so one can approximate the distribution of T_{1,2 }by
where k_{1,2 }and θ_{1,2 }are found by matching the mean and variance of the gamma distribution with E[T_{1,2}] and Var(T_{1,2}). Thus, one obtains
where E[L_{1,2}] and E[L_{1,2}^{2}] are given by on (15).
If every IBD segment has some length, one can assume that T_{1,2 }is only identically zero when there are no IBD segments. The distribution of the number of segments can be modeled as a Poisson random variable with mean E[N_{1,2}] equal to the expected number N_{1,2 }of merged segments shared between and . The probability that there are no segments is then E^{−E[N}^{1,2}^{]}. Thus, one has the approximation
A maximum likelihood estimator of the degree between A_{1 }and A_{2 }can be obtained by determining the degree d_{L}(A_{1},A_{2}) between A_{1 }and A_{2 }for which value of the distribution in Equation (20) is maximized. This gives the likelihood estimator
One can also use Equation (5) to obtain a generalized version of the DRUID estimator of Ramstetter, M. D., Shenoy, S. A., Dyer, T. D., Lehman, D. M,, Curran, J. E., Duggirala, R., Blangero, J., Mezey, J. G., and Williams, A. L. (2018). Inferring identicalbydescent sharing of sample ancestors promotes highresolution relative detection. Am. J. Hum. Genet. 103, 3044. The generalized estimator can provide fast estimates of the degree without the need to evaluate Equation (21). Using the approach of Ramstetter et al., the total length of IBD shared between A_{1 }and A_{2 }can be estimated as the total length of IBD shared between and , divided by the total fraction of genetic material A_{1 }and A_{2 }are expected to pass to their descendants. The fraction f_{i }of the genome of A_{i }passed to their descendants is given by
Thus, an estimate (A_{1}, A_{2}) of the amount of IBD shared between A_{1 }and A_{2 }is
Using the expression {circumflex over (ϕ)}=(A_{1}, A_{2})/4L_{genome }for the kinship coefficient when all IBD is of type 1, one obtains the generalized DRUID estimator
are the ones used for the DRUID estimator presented in Ramstetter et al. Thus, one obtains a version of the DRUID estimator that can be applied to general outbred pedigrees.
Likelihood for Identifying Background IBDIndividuals with no recent relationship can share small segments of IBD by chance, especially in populations with recent or severe bottlenecks. This kind of IBD is referred to as background IBD and it poses a considerable challenge to accurate pedigree inference. Previous methods have addressed background IFID by various approaches. For example, the authors of the ERSA. method present an approach for modeling the distribution of background. IBD among unrelated individuals and then performing a likelihood ratio test to determine whether the IBD shared between a new pair of individuals is significantly different from background; Huff, C. D., Witherspoon, D. J., Simonson, T. S., Xing, J., Watkins, W. S., Zhang, Y., Tuohy, T. M., Neklason, D. W., Burt, R. W., Guthery, S. L., Woodward, S. R., and Jorde, L. B. (2011). Maximumlikelihood estimation of recent shared ancestry (ERSA). Genome Research, 21, 768774. This approach requires a background distribution of IBD and it requires testing each pair of individuals separately. The difficulty with detecting background IBD between each pair of individuals separately is that it can result in throwing out many pairs of individuals Those levels of IBD sharing are near background, even when those pairs are truly related. Improved power for detecting background IBD can be obtained by leveraging the information inherent in previouslyinferred pedigree structures to infer background IBD for sets of multiple individuals at the same time.
Practitioners take an approach to identifying background IBD in which they consider the information contained in IBD sharing patterns across multiple individuals to determine when IBD is background and when it is due to true recent ancestry. In particular, eve consider the problem in which all of the IBD observed in an individual is either background IBD, or true IBD due to a recent relationship.
To illustrate the approach, consider the IBD sharing pattern shown in
One can test for background IBD through a series of hypothesis tests. Given that IBD is observed between two sets of nodes, and , suppose that the putative common ancestors A_{1 }and A_{2 }through which the IBD was inherited are the most recent common ancestors of and , respectively. One can then consider each of the descendant nodes immediately below A_{1 }in turn (e.g., 7 and 8 in
All nodes that reject the null hypothesis of this test are dropped and the ancestral node is reset to be the common ancestor of all remaining IBDcarrying nodes. For example, if one detected that the clade below node 7 in
Let C_{n }denote the set of children of node n. To test whether the IBD observed below a child node c∈C_{A1 }is background IBD, consider the null hypothesis H_{0 }that the observed IBD below the node is real, and ask whether this hypothesis is rejected in favor of the alternative hypothesis H_{1 }that the IBD is background. Background IBD can either be lower than the expected true amount of IBD (as in the example in
Under the null hypothesis H_{0}, it is assumed that the IBD observed is real and we assume that the degree d_{H}_{0}(A_{1}, A_{2}0 between A_{1 }and A_{2 }is the maximum likelihood estimate: d_{H}_{0}(A_{1}, A_{2})=d_{L}(A_{1}, A_{2}), or the generalized DRUID estimate: d_{H}_{0}(A_{1},A_{2})=d_{D}(A_{1}A_{2}). One can then perform the following test

 Reject H_{0 }at level α if:
where T_{c,A2 }is the random variable describing the observed amount of IBD between descendants of c and descendants of A_{2 }with observed value t_{c,A2}. The distribution of T_{c,A2 }is given by Equation (20). It is reasonable to be conservative when dropping background IBD so that true relationships are called as background IBD only a small fraction of the time. Thus, in practice, we take n to be small, such as α=10^{−4}.
Determining When Ancestral Branches are UnrelatedOne difficulty in constructing large pedigrees is determining the ancestors through which two sets of genotyped individuals are related. A simple fundamental question is whether two lineages are both on the maternal side of an individual, both on the paternal side, or on opposite parental sides. Without genotyped parents, the side through which a lineage passes can be difficult to determine, although sex chromosomes and mitochondrial haplotypes can be used to resolve the parent of origin in some cases.
Practitioners consider the problem of inferring whether two distant sets of relatives are related through the same parent of a focal individual, or through different parents. The scenario is shown in
The amount of IBD shared among pedigrees 1 and 2 is uninformative about whether they are related through the same parent. However, if pedigrees 1 and 2 are related to the focal individual 1 through the same parent, the IBD segments pedigree 1 shares with individual 1 cannot spatially overlap with the segments pedigree 2 shares with individual 1. This is because two overlapping segments would have undergone recombination in the parent (e.g., 10). The result will either be a spliced segment (
In the Bonsai method, when there are multiple possible grandparents through which we can connect two pedigrees and to a focal set of nodes in a focal pedigree we examine whether the IBD segments between and overlap the IBD segments between and .
Training the Probabilistic Relationship ModelProcess 200 starts by receiving IBD data of a plurality of training sets for a plurality of relationships. See block 202. All individuals in each training set are genetically related with the same relationship in a pairwise manner. Each training set is associated with a unique relationship. The IBD data of each training set include pairwise IBD data of individuals in the training set.
Process 200 further involves obtaining age data of the plurality of training sets for the plurality of relationships. The age data of each training set includes age data for individuals in the training set. See block 204.
Process 200 further involves training the probabilistic relationship model using the IBD data of the plurality of training sets and the age data of the plurality of training sets. See block 206. The trained probabilistic relationship model is configured to take as input pairwise IBD data and age data for two test individuals and provide as output various likelihoods of various potential relationships for the two test individuals.
As shown here the nu fiber of IBD segments is also used to train the probabilistic relationship model. In some implementations, the numbers of the two types of IBD segments may be modeled separately. In other implementations, the two numbers may be combined and modeled to have one probability distribution. Age difference between the two individuals in a pair is also used to train the probabilistic relationship model.
Three pairs of individuals are shown for parentoffspring training set 302. Each individual pair provides a data point of the length of half IBD, a data point of the number of IBD and a data point of age difference. Although only three individuals are illustrated, the training set may include hundreds, thousands, or tens of thousands of individuals or more.
The data points from the individuals in the training set for each variable (IBD1, number of IBD segments, age difference) are used to train a probabilistic relationship model. The probabilistic relationship model models the probability distribution of each variable as a Gaussian distribution in this example. In other implementations, the probability distribution of each variable may be modeled as an exponential distribution, a Poisson distribution, a binomial distribution, a beta binomial distribution, and other suitable distributions based on prior knowledge of the variable.
The data points from the individuals in the training set for each variable (IBD1, number of IBD, age difference) are used to train a probabilistic relationship model. The probabilistic relationship model models the probability distribution of each variable as a Gaussian distribution in this example. In other implementations, the probability distribution of each variable may be modeled as an exponential distribution, a Poisson distribution, a binomial distribution, a beta binomial distribution, and other suitable distributions based on prior knowledge of the variable.
The data from the training set 302 for the three variables (IBD1 length, number of IBD segments, and age difference) are used to train the probabilistic model. In some implementations, training involves using various techniques to fit the probability distribution to the training data. In some implementations, methodofmoments techniques are used to fit the Gaussian distribution of each variable to the training data. The probability distributions for the three variables are shown in box 312 for parentoffspring relationship. The data and the distributions in the figure are for illustrative purposes only, and they do not reflect biological or mathematical reality. In the same manner, full siblings training set data 304 are used to train the probabilistic relationship model to obtain the probability distributions for the three variables as shown in box 314. The training data of avuncular relationship in box 306 are used to train the probabilistic relationship model to obtain the probability distributions for the three variables for the avuncular relationship.
After the model is trained, it can be applied to estimate relationship likelihoods between individuals based on IBD data and age differences between the individuals. To apply the probabilistic relationship model, test data of two test individuals are provided to the trained model. The model provides as output the probability r of each variable for each relationship. Multiple probabilities for multiple independent variables are aggregated in a relationship to provide a likelihood of the relationship. For example, likelihoods for a relationship as functions of each of the three variables may be summed to provide a composite likelihood indicating how likely the relationship is given the two test individuals' IBD and age data.
Regarding training, various techniques may be used to fit the model to the data. In some implementations, maximum likelihood methods may be used to obtain parameters that maximize the likelihood of the model given the data. In some implementations, methodofmoments techniques may be used to calculate distribution parameters from the training data. Other model fitting techniques such as kernel density estimation may also be employed, although such techniques tend to provide less accurate estimates in some applications.
Generating Pedigree GraphsAnother aspect of the disclosure provides methods for generating pedigree graphs that are informative, userfriendly, easy to understand or intuitive.
In some implementations not shown in
Process 400 then involves determining a minimal set of root entries from which all of the plurality of entries are reachable by starting from the minimal set of root entries and traversing from parent entries to child entries and traversing between partner entries. See block 404. As mentioned above, a root entry is an entry having no parent entries. Partner entries are two parent entries of the same child entry. In some implementations, two partner entries may be generated based on other information, such as nongenetic information indicating marriage or partnership that does not yield children. One can travel from one partner to another partner through the partner's children.
Process 400 further involves forming, a subtree, for each root entry of the minimal set of root entries, using the root entry as a root node and entries reachable from the root entry as additional nodes. This operation obtains a plurality of subtrees. See block 406. Root nodes do not have parent nodes, and they tend to be at the top of a subtree. On the contrary, leaf nodes do not have child nodes, and they tend to be at the bottom of a subtree.
Process 400 further involves positioning nodes in each of the plurality of subtrees. See block 408. In some implementations, positioning the nodes involves starting from the root node and recursively going through nodes in a defined order to reach nodes to be positioned. In some implementations, the defined order is as follows

 1. children of partners to the left;
 2. children of partners to the right;
 3. partners to the left;
 4. self: and
 5. partners to the right.
In some implementations, positioning nodes in each of the plurality of subtrees involves placing a leaf node without any siblings on the left at an origin. It also involves placing a leaf node with a sibling immediately to its left at a position immediately to the right of said sibling. In some implementations, it also involves positioning parent nodes relative to their child nodes so that they are further from the horizontal center of a row of nodes than its child nodes are. In some implementations, positioning nodes also involves positioning parent nodes relative to their child nodes' partner nodes, so that they are further from the horizontal center of a row of nodes than their child nodes' partner nodes are.
Similarly, the right branch of the subtree may be placed in positions as shown in
In some implementations, there are more than two parent nodes in a row. A first parent, node is not either end of the row. A second parent node is immediately to the left of the first parent node. These implementations include a rule of positioning the first parent node relative to the child nodes of the second parent node, so that the first parent node is to the right of the child nodes of the second parent node.
Parent node 706 is not on either end of the row. Parent node 704 is immediately to the left of parent node 706. In this case, the rule described above positions the parent node 706 relative to child node 716, which is the child of parent node 704. Node 702 is not a partner of node 704. Nonetheless, the same rule described above applies to node 704 and 702. Node 704 is not on either end of the row. It is placed relative to the two child nodes of node 702 so that it is to the right of node 714, a child of node 702.
In some implementations, after positioning the first parent nodes relative to the child nodes of the second parent, the process shifts the child nodes of the first parent node to maintain previous relative relations between positions between the first parent and the child nodes of the first parent.
Returning to
In some implementations, merging the subtrees to form the pedigree graph involves shifting one or more of the subtrees so that noncorresponding nodes of different subtrees do not overlap. Noncorresponding nodes on different trees are nodes that do not represent the same individual.
In some implementations, merging the subtrees to form the pedigree graph includes identifying core nodes that include a focal node representing a focal individual, any sibling nodes representing siblings of the focal individual, any parent nodes representing parents of the focal individual, any descendant nodes presenting descendants of the focal individual, and any partner nodes presenting partners of the descendants.
For example, for the pedigree graph in
In some implementations, merging the plurality of subtrees involves merging each pair of four pairs of subtrees to form. four grandparent subtrees. See
Here, merging each pair of subtrees includes horizontally shifting one subtree so that noncorresponding nodes of the two subtrees do not overlap, and merging two corresponding nodes on the pair of subtrees representing the same grandparent into one node. See
In some implementations, the process further involves merging the four grandparent subtrees to form the pedigree graph. The merging involves horizontally shifting one or more of the grandparents subtrees so that noncorresponding nodes of different grandparent subtrees do not overlap and merging corresponding nodes on different grandparent subtrees presenting the same individual into one node. In the pedigree tree graph in
In some implementations, the subtree corresponding to a great grandparent is obtained by merging two or more subtrees. See, e.g.,
In some implementations, the two grandparent subtrees in the middle are smaller than the two grandparent subtrees on the outside. See
Some implementations further involve removing empty spaces in the pedigree graph. In some implementations, this involves removing a column of empty spaces in the pedigree graph when the removal of the empty spaces does not cause any noncorresponding nodes to overlap.
In some implementations, the process further includes applying force directed graph drawing techniques to redraw one or more nodes and lines connecting them. In some implementations, the one or more nodes include leaf nodes and their parent nodes.
In some implementations each pair of two or more pairs of parent nodes in the pedigree graph includes two nodes rendered in different colors. In some implementations, the lines connecting each child node to its parent nodes includes curved lines.
Graphical User Interface for Pedigree GraphsAnother aspect of the disclosure relates to methods for displaying the pedigree graph for a plurality of genetically related individuals on a graphical user interface (GUI), The method is implemented using a computer system including a processor, system memory, and a display device. The method includes using the display device to display a pedigree graph including a plurality of nodes representing a plurality of genetically related individuals and lines connecting each child node to its pair of parent nodes. The child node and its pair of parent nodes present a child and its pair of parents. Each pair of two or more pairs of parent nodes includes two nodes rendered in different colors. The lines connecting each child node to its pair of parent nodes include curved lines.
In some implementations,
In some implementations, the two or more pairs of parent nodes are direct ancestors of a focal node. In the example shown in
In some implementations, the relative nodes that are not direct ancestors of the focal node have the same coloring as a direct ancestor that is on the same family side and at the same generational level as the relative nodes.
In some implementations, generation levels at and above greatgrandparents are rendered in the same color on the same side of the family.
In some implementations, nodes at the pedigree graph include core nodes that include the focal node representing a focal individual, any sibling nodes resenting siblings of the focal individual, any parent nodes representing parents of the focal individual, any descendant: nodes resenting descendants of the fetal individual, and any nodes resenting partners of the descendants. In the example in
In some implementations, the pedigree graph includes lines indicating direct ancestry of the focal individual. The lines indicating the direct ancestry of the focal individual are rendered in a color or shade that is different from lines not indicating the direct ancestry of the focal individual. In the example in
In some implementations, at least one pair of parent nodes has an offcenter alignment relative to their child nodes. See, e.g., parent nodes 827 and 828 relative to their child node 818.
In some implementations, two subtrees having the same topology in the pedigree graph are represented in different forms.
In some implementations, at least one pair of parent nodes has an interpair physical distance of larger than the smallest possible distance. See, e.g., parent nodes 814 and 812.
In some implementations, one or more nodes and lines connecting them are drawn. using force directed graph drawing techniques. In some implementations, the one or more nodes include leaf nodes and their parent nodes. See e.g., leaf nodes 836, 840, and 838, and parent nodes 832 and 834.
In some implementations, each child node is connected through a curved line to a straight line connecting the child node's parent nodes. See most of the direct ancestry lines on pedigree graph in
In some implementations, the GUI can interactively display and update the pedigree graph. A user may provide user input relating to genealogy information on any individuals represented by the nodes of the pedigree graph. In some implementations, the user input is provided in a way that involves an interaction with an input text field of the GUI and/or an interaction with a graphical element of the pedigree graph. For example, the user may point and click at a node on the pedigree graph, which activates an editing mode of the node. Then the user may provide genealogical or other information about the individual represented by the node. The information may include age, gender, partnership, ethnicity, nationality, relative relationship, name, photos, etc. A computer processor can then use data provided by or derived from the user input data to update the pedigree information underlying the pedigree graph or update the pedigree graph directly. Some implementations store and/or propagate the updated pedigree information to generate other pedigree graphs involving said information.
In some implementation two or more different users may pros de input to two or more different pedigree graphs. In some implementations, a computer system uses the input from one user to update the pedigree graph of another user, and vice versa. In such implementations, two or more different users can collaboratively update their pedigree graphs in real time.
In various embodiments, the pedigree graph is interactive. Namely, the pedigree graph is designed or configured to receive user input, modify information associated with the pedigree graph, and update the pedigree graph using the modified information. In some implementations, the user input is received via a user interaction with the pedigree graph in a GUI. In some implementations, the user interaction includes clicking an interactive node in the pedigree. An interactive node is one that is configured to receive user input and present information, sometimes the presented information being updated by the user input.
In some implementations, updating the pedigree graph automatically updates one or more display elements of the pedigree graph or in the GUI. In some implementations, the user interaction includes entering data using a window activated by clicking the interactive node. In some implementations, updating one or more elements of the pedigree graph includes changing a relationship between the interactive node and at least one other node in the pedigree graph. In some implementations, at least one interactive node in the pedigree graph. is associated with information of health, traits, diseases, physical conditions, or phenotypes of an individual represented by the at least one interactive node.
In some implementations, the at least one interactive node is associated with a graphical element representing the information of health traits, diseases, physical conditions, or phenotypes of the individual. In some implementations the user input includes clicking the interactive node. The user provides input by clicking the interactive node. In some implementations, the user input includes entering information of individuals in a window activated by clicking the interactive node. In some implementations, updating the one or more elements of the pedigree graph includes changing at least one relationship between two nodes. In some implementations, updating the one or more elements of the pedigree graph includes changing the graphical element representing information of traits, diseases, physical conditions. or phenotypes of the individual.
In some implementations, the nodes of the pedigree graph, or associated with information relating to diseases, physical conditions, traits, or phenotypes. In some implementations, the information of diseases, traits, physical conditions, or phenotypes is represented by a graphical icon by graphical elements positioned next to the node. The graphical elements reflect conditions of the individual represented by the node.
Graphical icons 1608 and 1609 respectively represent a heart condition and a lung condition of a focal individual represented by node 604. Graphical icons 1608 and 1609 air positioned with respect to node 1604. Similarly, graphical icons 1620 and 1621 respectively represent an eye color and a heart condition of the mother represented by node 1614. Graphical icons 1620 and 1621 are positioned with respect to mother node 1614. Similarly, graphical icons 1622 and 1623 respectively represent an eye color and a heart condition of the grandmother represented by 1618. Graphical icons 1622 and 1623 are positioned with respect to the node 1618. The juxtaposition of the nodes and icons in the pedigree graph visualizes the heritability of diseases and traits.
In this example, the user interacts with the pedigree graph by inputting information related to the pedigree or individuals in the pedigree. In some implementations, the information comprises traits, diseases, physical conditions, or phenotypes. In some implementations, the user may provide input regarding a node, such as node 1606 by clicking the node 1606, which brings up a graphical window or GUI 1610 for displaying, and receiving information about Jodi Neville represented by node 1606. In GUI 1610, user may interact with elements in the window to display further information and/or input information related to Jodi. In this example, the user may further view Jodi's medical history information by clicking a link or button 1612, which brings up another GUI 1702 in
In many implementations and applications, a pedigree graph includes nodes for both genotyped individuals and ungenotyped individuals. Pedigree graphs can also be referred to as family trees. A genotyped node represents an individual whose genetic data have been used to determine the pedigree relationships depicted by the pedigree graph. An ungenotyped node represents an individual whose genetic data have not been used to determine the pedigree relationships depicted by the pedigree graph. Since ungenotyped nodes are inferred from the pedigree relationships, information about the individual is limited to the inference from the pedigree relationships.
For example,
ungenotyped nodes. Each genotyped node is labeled with two letters. For example, the focal node (node 1402) labeled as FZ has a genotyped sibling labeled as LZ. It also has two genotyped parents 1403 and 1404. It can be inferred that each of the parents has two parents and four grandparents. For instance, the parent node 1404 is inferred to have two parents 1405 and 1406. These inferred individuals are also shown in the pedigree graph. They are not associated with data beyond the inferred relationships among them.
A user may annotate ungenotyped nodes using annotation information such as name, gender, date of birth, etc. Such information and annotation are helpful for understanding individuals and the relationships represented by the pedigree graph, making the graph more informative. When a pedigree graph is updated with new relationships or additional individuals based on genotyped data, the identities of genotyped nodes are known, and their matching between an old graph and a new graph is straightforward. However, because the identities of the unannotated, ungenotyped nodes in a new graph are unknown, matching them to annotated, ungenotyped nodes in an old graph is not as straightforward. The matching requires using relationships between ungenotyped nodes and genotyped nodes. But such relationships in the old graph and the new graph may not be the same, making it difficult to reannotate ungenotyped nodes in the new graph using annotation data of corresponding nodes in the old graph. Some implementations provide methods and systems for reannotating ungenotyped nodes in pedigree graphs using annotation data of prior graphs.
Process 1300 also involves receiving annotation data to annotate one or more ungenotyped nodes of the first ped igree graph. See box 1304. Process 1300 also involves displaying the first pedigree graph with the one or more ungenotyped nodes annotated. See box 1306.
After annotation data are provided by the user, the first pedigree graph is displayed with grandparent node 1406 annotated as shown in
Returning to
Process 1300 further involves matching one or more annotated., ungenotyped nodes of the first pedigree graph respectively with one or more corresponding nodes of the second pedigree graph. See box 1310. Process 1300 involves annotating one or more corresponding nodes of the second pedigree graph respectively using annotation data of their matching nodes of the first pedigree graph. This annotation can be referred to as reannotating nodes in a second pedigree graph using annotation data of corresponding nodes in the first pedigree graph. See box 1312.
Process 1300 further involves displaying the second pedigree graph with the annotated one or more corresponding nodes. See box 1314.
Process 1350 for reannotating pedigree graph involves receiving annotation data to annotate one or more ungenotyped nodes of the first pedigree graph that depicts relationships among a first plurality of individuals. See box 1352. In some implementations, the annotation data include name, maiden name, gender, date of birth, year of birth, place of birth, ethnicity, living or deceased state, date of birth, date of death, place of death, and/or photographic data.
The first pedigree graph includes a plurality of genotyped nodes and one or more ungenotyped nodes, each node representing an individual. The genotyped node represents an individual hose genetic data, have been used to determine the pedigree relationships depicted by the pedigree graph. An ungenotyped node represents an individual whose genetic data, have not been used to determine the pedigree relationships depicted by the pedigree graph.
In some implementations, the first pedigree graph is displayed before annotation data is received. The annotation data may be provided by the user with reference to the displayed first pedigree graph.
An example of the first pedigree graph is shown in
Process also involves generating a second pedigree graph that depicts relationships among the second plurality of individuals. The second pedigree graph includes a plurality of genotyped nodes and one or more ungenotyped nodes, each node representing an individual. See box 1354. In some implementations, the second plurality of individuals includes at least one individual who is not among the first plurality of individuals. In other implementations, the first plurality of individuals includes at least one individual who is not among the second plurality of individuals. In various implementations, the first plurality of individuals and the second plurality of individuals overlap. In some implementations, the second plurality of individuals is identical to the first plurality of individuals, but the relationships among the second plurality of individuals are not identical to the relationships among the first plurality of individuals.
Process 1353 also involves matching one or more annotated, ungenotyped nodes of the first pedigree graph respectively with one or more corresponding nodes of the second pedigree graph. See box 1356.
Process 1350 further involves annotating the one or more corresponding nodes of the second pedigree graph respectively using annotation. data of their matching nodes of the first pedigree graph. See box 1358. In some implementations, the process further involves displaying the annotated second pedigree graph using a display device.
In some implementations, matching nodes of the first pedigree graph with nodes of the second pedigree graph includes the steps in the following pedigree matching procedure:
Procedure (Pedigree Matching).1. Determine that an individual N in the first pedigree graph matches an individual N in the second pedigree graph;
2. identify, among individuals represented by genotyped nodes in the first pedigree graph, relatives of P(1, 1, N) and relatives of P(1, 2, N), wherein P(1, 1, N) is in the first pedigree graph a first, parent of N, P(1, 2, N) is in the first pedigree graph a second parent of N, and the relatives are biologically related and exclude any common direct descendants;
3. identify, among individuals represented by genotyped nodes in the second pedigree graph, relatives of P(2, 1, N) and relatives of P(2, 2, N), wherein P(2, 1, N) is in the second pedigree graph a first parent of N, P(2, 2, N) is in the second pedigree graph a second parent of N. and the relatives are biologically related and exclude any common direct descendants; and
4. a) match node P(1, 1, N) with node P(2, 1 N) or P(2, 2, N) when matching conditions are met, wherein the matching conditions comprise: any identified relatives of P(1, 1, N) are also identified relatives of P(2, 1, N) or P(2, 2, N) respectively, or

 b) match node P(1, 1, N) with either node P(2, 1, N) or node P(2, 2, N) when P(1, 1, N), P(2, 1, N) and P(2, 2, N) all have zero identified relatives.
In some implementations, one or more of the following must be met for matching a parent P(1, 1, N) of N on Tree 1 to a parent P(2, 1, N) of N on Tree 2. In some implementations, all of the following must be met to match.
Conditions (Pedigree Matching).1. Any identified relative of P(1, 1, N) appearing on both trees is also an identified relative of P(2, 1, N);
2. any identified relative of P(1, 2, N) appearing on both trees is also an identified relative of P(2, 2, N);
3. no identified relative of P(1, 1, N) is also an identified relative of P(2, 2, N);
4. no identified relative of P(1, 2, N) is also an identified relative of P(2, 1, N);
5. all shared identified descendants of P(1, 1, N) and P(2, 1, N) have the same degrees of relationship to P(1, 1, N) and P(2, 1, N);
6. all shared identified descendants of P(1, 2, N) and P(2, 2, N) have the same degrees of relationship to P(1, 2, N) and P(2, 2, N);
7. P(1, 1, N) and P(2, 1, N) have at least one common identified relative, or P(1, 2, N) and P(2, 2, N) have at least one common identified relative, or P(1, 1, N) (1, 2, N), P(2, 1, N) and P(2, 2, N) have no identified relatives.
In some implementations, the matching conditions further include: each identified relative of P(1, 1, N) who is also an identified relative of P(2, 1, N) or P(2, 2, N) has same category of relationships with P(1, 1, N) and P(1, 2, N) or P(1, 1, N) and P(2, 2, N) respectively.
In some implementations, each category of relationship is selected from: direct ancestor relationships, direct descendant relationships, and other relationships. In some implementations, each category of relationships corresponds to a degree of relationship or similar degrees of relationships.
In some implementations, a data structure with a relationship dictionary is used to store the relatives of each node of interest, as well as relationships, relationship types, relationship degrees, or relationship categories of the relatives. By querying the dictionary, all relatives of an individual of interest can be determined. In some implementations, the relationship dictionary groups relationships of relatives into three categories: ancestors, descendants, and other relatives. In some implementations, relatives who are related through only marriage are excluded.
In some implementations, both the first pedigree graph and the second pedigree graph include a genotyped node representing a same focal individual, and the individual AT is the focal individual.
In some implementations, matching nodes includes repeating steps 14 of the pedigree matching procedure one or more times using the matched P(1, 1, N) and P(2, 1, N) in step 4 as the individual N in step 1.
In some implementations, matching nodes further includes when matching conditions are not met in step 4 of the pedigree matching procedure, repeating steps 1 1 of the pedigree matching procedure using an individual represented by a genotyped node whose parents have not been matched.
In some implementations, matching nodes further includes, matching a first node on the first pedigree graph with a second node on the second pedigree graph when the partner node of the first node and the partner node of the second node are matched. The partner node of the first node and the partner node of the second node can be matched based on the matching conditions. They can also be matched because they correspond to a same genotyped individual in the database.
A matching process can start oy selecting a focal node 2 on both Tree 1 and Tree 2 corresponding to a same focal person that was genotyped. The same genotyped data of the person for the two nodes indicate that the two nodes represent the same individual. The matching process identifies all relatives of node A (a first parent of node 2) of Tree 1, which include node 1 and node 4. The relatives are biologically related and exclude any common direct descendants. The matching process also identifies all relatives of node A (a first parent of node 2) of Tree 2, which include node 1 and node 4. Therefore, pedigree matching conditions 1 and 7 are satisfied.
Moreover, the relatives of E in Tree 1 are node 3, and the relatives of node E in Tree are node 3 and node 5. No relative of node A in Tree 1 (1, 4) is also a relative of node E in Tree 2 (3, 5). Similarly, no relatives of node F in Tree 1 (3) is also a relative of node A in Tree 2 (1, 4). As such, pedigree matching conditions 2, 3, and 4 are satisfied.
Finally, on both Tree 1 and Tree 2, the only descendant of A and E is node 2.
As such, pedigree matching conditions 5 and 6 are satisfied. Therefore, because all pedigree matching conditions are satisfied node A of Tree 1 is matched with node A of Tree 2.
In some implementations, the process matches a first node on the first pedigree graph with a second node on the second pedigree graph when the partner nodes of the first and second nodes are matched. In this example, when the partner of node E in Tree 1 (node A) is matched with the partner of node F in Tree 2 (node A). node E in Tree 1 is also matched with node E of Tree 2.
The matching process in some implementations proceeds to use node A as the focal node and identify the relatives of node B and its partner node as two parent nodes of node A. Because relatives of node B in Tree 1 (1, 4) are also relatives of node B in Tree 2 (1,4), pedigree matching conditions 1 and 7 are satisfied. Because the partner of B has no relatives in either Tree 1 or Tree 2, pedigree matching condition 2 is satisfied. Because no relatives of node B in Tree 1 (1, 4) are also relatives of node B's partner in Tree 2 (none), pedigree matching condition 3 is satisfied. Because no relatives of node B's partner in Tree 1 (none) are also relatives of node Bin Tree 2 (1,4), pedigree matching condition 4 is satisfied. Finally, because B and their partner have the same descendants (A and 2) in both Tree 1 and Tree 2, and because the relationships between these descendants and B and B's partner are the same in both trees, pedigree matching conditions 5 and 6 are satisfied. Therefore, node B of Tree 1 is matched with node B of Tree 2.
The matching process in some implementations proceeds to use node B as the focal node and identify the relatives of node C and node D as two parent nodes of node B in Tree 1 and node 1502 and node 1504 as two parent nodes of node B in Tree 2. Because a relative of node C in Tree 1 (1) is one of the relatives of node 1502 in Tree 2 (1,4), pedigree matching condition 7 is satisfied. Because no relative of node C in Tree 1 (1) is also a relative of node 1504 in Tree 2 (none), pedigree matching condition 3 is satisfied. However, because a relative of node D in Tree 1 is also one of the relatives of node 1502 in Tree 2 (1,4), pedigree matching condition 4 is not, satisfied. Therefore, node C of Tree 1 cannot be marched with node 1502 of Tree 2. Similarly, node D of Tree 1 cannot be matched with node 1502 of Tree 2. Also, neither node C nor node D of Tree 1 cannot be matched with node 1504 of Tree 2. Therefore, node C and node D of Tree 1 cannot be matched with any nodes of Tree 2.
The matching process in some implementations proceeds to use node E as the focal node and identify node C and node as two parent nodes of node E. Because the relatives of node G in Tree 1 (none) and the relatives of node G in Tree 2 (none) are the same, pedigree matching condition I is satisfied. Because the relative of node F in Tree 1 (3) is also one of the relatives of node F in Tree 2 (3,5), pedigree marching condition 2 and 7 are satisfied. Because no relatives of node F in Tree 1 (3) are also relatives of node G in Tree 2 (none), pedigree matching condition 3 is satisfied. Moreover, because no relatives of node in Tree 1 (none) are also the relatives of node F in Tree 2 (3,5), pedigree matching condition 4 is satisfied. Finally, because E and 2 are the only descendants of F and G in both Tree 1 and Tree 2, and because these descendants have the same degrees of relationship to F and G in both trees. pedigree matching conditions 5 and 6 are satisfied. Therefore, node F of Tree 1 is matched with node F of Tree 2. Also, their partner nodes, node G on Tree 1 and node G on Tree 2 are snatched.
In some implementations, when two parent nodes in Tree 1 and two parent nodes in Tree 2 all have zero relatives, either parent node in Tree 1 can be matched to either parent node of Tree 2. In these implementations, node I and node K in Tree 1 and node I and. node K in Tree 2 all have zero relatives. Note that node 2 is not a relative, because it is a common descendant that is excluded. Moreover, the identified descendants of J and K are 2, E, and C and these descendants have the same degrees to J and K in both Tree 1 and Tree 2. Thus, all criteria, are satisfied and either node J or node K in Tree 1 can be matched to either node J or node K in Tree 2.
However, node 1 and node If in Tree 1 cannot be matched to nodes 1506 or erode 1508 in Tree 2, because node I and node H in Tree 1 have no relatives, but 1509 in Tree 2 has relative node 5. Therefore, pedigree matching condition 7 is not satisfied,
Pseudocode for matching nodes on two pedigree graphs for reannotation is provided below.
Apparatus and SystemsProcessor 102 is coupled bidirectionally with memory 110, which can include a first primary storage, typically a random access memory (RAM), and a second primary storage area, typically a readonly memory (ROM). As is well known in the art, primary storage can be used as a general storage area and as scratchpad memory, and can also be used to store input data and processed data. Primary storage can also store programming instructions and
data, in the form of data objects and text objects, in addition to other data and instructions for processes operating on processor 102. Also as is well known in the art, primary storage typically includes basic operating instructions, program code, data, and objects used by the processor 102 to perform its functions (e.g., programmed instructions). For example, memory 110 can include any suitable computerreadable storage media, described below, depending on. whether, for example, data access needs to be bidirectional or unidirectional. For example, processor 102 can also directly and very rapidly retrieve and store frequently needed data in a cache memory (not shown).
A removable mass storage device 112 provides additional data storage capacity for the computer system 100, and is coupled either bidirectionally (read/write) or unidirectionally (read only) to processor 102. For example, storage 112 can also include computerreadable media such as magnetic tape, flash memory, PCCARDS, portable mass storage devices, holographic storage devices, and other storage devices. A fixed mass storage 120 can also, for example, provide additional data storage capacity. The most common example of mass storage 120 is a hard disk drive. Mass storage 112, 120 generally store additional programming instructions, data, and the like that typically are not in active use by the processor 102. It will be appreciated that the information retained within mass storage 112 and 120 can be incorporated, if needed in standard fashion as part of memory 110 (e.g., RAM) as virtual memory.
In addition to providing processor 102 access to storage subsystems, bus 114 can also be used to provide access to other subsystems and devices. As shown, these can include a display monitor 118, a network interface 116, a keyboard 104, and a pointing device 106, as well as an auxiliary input/output device interface, a sound card, speakers, and other subsystems as needed. For example, the pointing device 106 can be a mouse, stylus track ball, or tablet, and is useful for interacting with a graphical user interface.
The network interface 116 allows processor 102 to be coupled to another compute computer network, or telecommunications network using a network connection as shown. For example, through the network interface 116, the processor 102 can receive information (e.g., data objects or program instructions) from another network or output information to another network in the course of performing method/process steps. Information, often represented as a sequence of instructions to be executed on a processor, can be received from and outputted to another network. An interface card or similar device and appropriate software implemented by executed/performed on) processor 102 can be used to connect the computer system 100 to an external network and transfer data according to standard. protocols. For example, various process implementations disclosed herein can be executed on processor 102, or can be performed across a network such as the Internet, intranet networks, or local area networks, in conjunction with a remote processor that shares a portion of the processing. Additional mass storage devices (not shown) can also be connected to processor 102 through network interface 116.
An auxiliary I/O device interface (not shown) can be used in conjunction with computer system 100. The auxiliary I/O device interface can include general and customized interfaces that allow the processor 102 to send and, more typically, receive data from other devices such as microphones, touchsensitive displays, transducer card readers, tape readers, voice or handwriting recognizers biometrics readers, cameras, portable mass storage devices, and other computers.
In addition, various implementations disclosed herein further relate to computer storage products with a computer readable medium that includes program code for performing various computerimplemented operations. The computerreadable medium is any data storage device that can store data which can thereafter be read by a computer system. Examples of computerreadable media include, but are not limited to, all the media mentioned above: magnetic media such as hard disks, floppy disks, and magnetic tape; optical media, such as CDROM disks; magnetooptical media such as optical disks; and specially configured hardware devices such as applicationspecific integrated circuits (ASICs), programmable logic devices (PLDs), and ROM and RAM devices. Examples of program code include both machine code, as produced, for example, by a compiler, or files containing higher level code (e.g., script) that can be executed using an interpreter.
The computer system shown in
In some implementations, DNA samples (e.g., saliva, blood, etc.) are collected from genotyped individuals and analyzed using DNA microarray or other appropriate techniques. The genotype information is obtained (e.g., from genotyping chips directly or from genotyping services that provide assayed results) and stored in database 208 and is used by system 206 to make ancestry predictions or pedigree determination. Reference data including genotype data of reference individuals, simulated data (e.g., results of machinebased processes that simulate biological processes such as recombination of parents' DNA), precomputed data (e.g., a precomputed reference haplotype data used in phasing and model training) and the like can also be stored in database 208 or any other appropriate storage unit.
EXPERIMENTAL Accuracy of the Likelihood and Generalized DRUID Estimatorshis experiment shows the accuracy of the likelihood and generalized DRUID estimators (Equations 21 and 24) for inferring the degree of relationship between two distantlyrelated pedigrees.
Practitioners applied these estimators to infer the degree between common ancestors A_{1 }and A_{2 }of two small pedigrees and . For this analysis, two identical small pedigrees and were simulated. Each small pedigree had the same topology comprised of the common ancestor A_{1 }or A_{2}, their spouse, their two children, and four grandchildren, where the grandchildren were comprised of two children for each child of A_{1 }or A_{2}. The ancestors A_{1 }and A_{2 }were then connected by degree d through a pair of common ancestors, where the degree d varied from 1 to 10.
This experiment compares accuracy and computer runtime between a computer implemented method referred to as Bonsai according to some implementation with a method based on the PRIMUS method.
The experiment is conducted to evaluate 204 pedigrees of 23andMe customers. Pedigrees were chosen in which each nuclear family had at least two genotyped offspring and two genotyped parents. Pedigrees spanned at least two generations.
To evaluate the accuracy of pedigree inference, practitioners subsampled these pedigrees, sampling ID, 20, 30, 40, or 50% of their members uniformly at random without replacement. The subsampled individuals were then used to reconstruct the pedigrees using PRIMUS and Bonsai.
The Bonsai and PRIMUS methods were applied to exactly the same pedigree subsets. The xaxis in
Bonsai and PRIMUS have similar runtimes when few or many individuals are sampled because there are many fewer possible pedigrees to explore. This suggests that Bonsai gains considerable computational efficiency by ignoring very low likelihood pedigrees. The experiment illustrates that due to the computational efficiency of Bonsai, it runs faster on computers and conserves computer resources compared to prior art methods.
Similarly,
Claims
1. A method, implemented using a computer system that includes one or more processors and system memory, for determining pedigree relationships among a plurality of genetically related individuals, the method comprising: wherein (1a)(1h) are performed by the computer system.
 a) identifying, among the plurality of genetically related individuals, a closest relative of a starting individual using genetic data of the plurality of genetically related individuals;
 b) applying pairwise IdentitybyDescent (IBD) data and pairwise age data of the starting individual and the closest relative to a probabilistic relationship model to obtain various likelihoods of various potential relationships between the starting individual and the closest relative;
 c) selecting one or more potential relationships between the starting individual and the closest relative that have relationship likelihoods meeting a relationship criterion, and forming a pedigree from each of the one or more potential relationships;
 d) identifying, among genetically related individuals not included in pedigrees already formed, a closest relative of any individual already in a pedigree;
 e) applying pairwise IBD data and pairwise age data of the closest relative and the individual already in the pedigree to the probabilistic relationship model to obtain various likelihoods of various potential relationships between the closest relative and the individual already in the pedigree; selecting one or more potential relationships between the closest relative and the individual already in the pedigree that have relationship likelihoods meeting the relationship criterion, and adding each of the one or more potential relationships to each pedigree already formed to grow each pedigree into one or more growing pedigrees;
 f) selecting growing pedigrees that have pedigree likelihoods meeting a pedigree criterion as the pedigrees already formed; and
 g) repeating (1d)(1g) one or more times,
2. The method of claim 1, wherein the pairwise IBD data comprise a length of IBD segments.
3. The method of claim 2, wherein the lengths of IBD segments comprise a length of full IBD segments and/or a length of half IBD segments.
4. The method of claim 1, wherein the pairwise IBD data comprise a number of IBD segments.
5. The method of claim 4, wherein the number of IBD segments comprise a number of full IBD segments and/or a number of half IBD segments.
6. The method of claim 1, wherein the probabilistic relationship model is a machinelearning model.
7. The method of claim 1, wherein the probabilistic relationship model models the probability distribution of the pairwise IBD data for each relationship and/or the probability distribution of the pairwise age data for each relationship as a Gaussian distribution, a Poisson distribution, or an exponential distribution.
8. The method of claim 1, further comprising: storing into a database or retrieving from a database relationship data of a pedigree having the highest pedigree likelihood among the growing pedigrees selected in (1g).
9. The method of claim 8, further comprising:
 a) generating a pedigree graph using the relationship data of the pedigree having the highest pedigree likelihood; and
 b) displaying the pedigree graph on a display device.
1019. (canceled)
20. The method of claim 1, wherein pairwise IBD data between two individuals are used to determine how closely related the two individuals are.
21. The method of claim 1, wherein the relationship criterion is a ratio of an instant relationship likelihood over a maximum relationship likelihood being larger than a value c.
22. The method of claim 1, wherein the pedigree criterion is a ratio of an instant pedigree likelihood over a maximum pedigree likelihood being larger than a specific value d.
23. The method of claim 1, wherein (1h) comprises repeating (1d)(1g) until all individuals of the plurality of genetically related individuals have been identified as a closest relative or excluded from the pedigree.
24. The method of claim 1, further comprising:
 b) identifying two pedigrees from among a plurality of pedigrees constructed using operations (1a)(1h), the two pedigrees being a genealogically closest pair of pedigrees among all pairs in the plurality pedigrees.
25. The method of claim 24, wherein in operation (24c) a genealogical similarity between two pedigrees that are the genealogically closest pair of pedigrees is measured as a union over all IBD segments shared between an individual in a first pedigree in the genealogically closest pair of pedigrees and an individual in a second pedigree in the genealogically closest pair of pedigrees.
26. The method of claim 24 or 25, further comprising:
 c) combining the two pedigrees identified in operation (24c) into a combined pedigree.
27. The method of claim 26, further comprising:
 d) repeating operations (24c) and (26d) to agglomerate a plurality of pedigrees into a large pedigree.
28. The method of claim 26, wherein in operation (26d) the two pedigrees are combined by:
 a) identifying a first set of individuals in a first pedigree who share IBD with individuals in a second pedigree;
 b) identifying a second set of individuals in the second pedigree who share IBD with individuals in the first pedigree;
 c) identifying a common ancestor of the first set of individuals;
 d) identifying a common ancestor of the second set of individuals;
 e) inferring a degree of relatedness between the common ancestor of the first set and the common ancestor of the second set; and
 f) connecting the two common ancestors by an inferred degree of relatedness between the common ancestors.
2935. (canceled)
36. A method, implemented using a computer system that includes one or more processors and system memory, for combining two or more pedigrees into a larger pedigree, the method comprising:
 a) identifying a first pedigree and a second pedigree from among a plurality of pedigrees, the two pedigrees being a genealogically closest pair of pedigrees among all pairs in the plurality pedigrees;
 b) identifying a first set of individuals in the first pedigree who share IBD with individuals in the second pedigree;
 c) identifying a second set of individuals in the second pedigree who share IBD with individuals in the first pedigree;
 d) identifying a common ancestor of the first set of individuals;
 e) identifying a common ancestor of the second set of individuals;
 f) inferring a degree of relatedness between the common ancestor of the first set and the common ancestor of the second set; and
 g) connecting the two common ancestors by an inferred degree of relatedness between the common ancestors to form a larger pedigree.
3799. (canceled)
100. A method, implemented using a computer system comprising a processor and system memory, of generating pedigree graphs, the method comprising:
 a) receiving, by the processor, annotation data to annotate one or more ungenotyped nodes of a first pedigree graph that depicts relationships among a first plurality of individuals, wherein the first pedigree graph comprises a plurality of genotyped nodes and one or more ungenotyped nodes, each node representing an individual, a genotyped node represents an individual whose genetic data have been used to determine the pedigree relationships depicted by the pedigree graph, and an ungenotyped node represents an individual whose genetic data have not been used to determine the pedigree relationships depicted by the pedigree graph,
 b) generating, using the processor, a second pedigree graph that depicts relationships among a second plurality of individuals, wherein the second pedigree graph comprises a plurality of genotyped nodes and one or more ungenotyped nodes, each node representing an individual;
 c) matching, using the processor, one or more annotated, ungenotyped nodes of the first pedigree graph respectively with one or more corresponding nodes of the second pedigree graph;
 d) annotating, using the processor, the one or more corresponding nodes of the second pedigree graph respectively using annotation data of their matching nodes of the first pedigree graph.
101119. (canceled)
Type: Application
Filed: Mar 11, 2022
Publication Date: Jun 23, 2022
Inventors: Ethan Macneil Jewett (San Jose, CA), Andrew C. Seaman (San Jose, CA), Kimberly Faith McManus (San Francisco, CA), William Allen Freyman (Menlo Park, CA), Cordell T. Blakkan (San Francisco, CA), Adam Auton (Menlo Park, CA), Joanna Louise Mountain (Menlo Park, CA), Susan M. Furest (San Francisco, CA), Rachel E. Lopatin (Los Altos, CA), Hang Xu (Sunnyvale, CA), Hilary M. Vance (Palo Alto, CA)
Application Number: 17/693,245