,

Efficiently summarizing relationships in large samples: a general duality between statistics of genealogies and genomes

, , и .
bioRxiv, (2019)
DOI: 10.1101/779132

Аннотация

As a genetic mutation is passed down across generations, it distinguishes those genomes that have inherited it from those that have not, providing a glimpse of the genealogical tree relating the genomes to each other at that site. Statistical summaries of genetic variation therefore also describe the underlying genealogies. We use this correspondence to define a general framework that efficiently computes single-site population genetic statistics using the succinct tree sequence encoding of genealogies and genome sequence. The general approach accumulates “sample weights” within each marginal tree, which are then combined using a “summary function”; different statistics result from different choices of weight and function. Results can be reported in two dual ways: by site, which corresponds to statistics calculated from genome sequence; and by branch, which gives the expected value of the site statistic under the infinite-sites model of mutation. We use the framework to implement many currently-defined statistics of genome sequence (making the statistics’ relationship to the underlying genealogical trees concrete and explicit) and suggest several new statistics. We evaluate computational performance using simulated data, and show that calculating statistics from tree sequences using this general framework is several orders of magnitude more efficient than optimized matrix-based methods in terms of both run time and memory requirements. We also explore how well the duality between site and branch statistics holds in practice on trees inferred from the 1000 Genomes Project dataset, and discuss ways in which deviations may encode interesting biological signals.

тэги

Пользователи данного ресурса

  • @peter.ralph

Комментарии и рецензии