tskit-paper/paper.tex at main · tskit-dev/tskit-paper · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
\documentclass{article}

\usepackage{geometry}
\geometry{a4paper}

\usepackage[superscript,biblabel]{cite}
% \usepackage[super,numbers]{natbib}
\bibliographystyle{naturemag}
% \bibliographystyle{plain}
\usepackage[hidelinks]{hyperref}
\usepackage{authblk}
\usepackage{tabularray}
\usepackage{graphicx}
\usepackage{amssymb}

% Make affiliations run inline
\renewcommand\Affilfont{\small}
\renewcommand\Authfont{\normalsize}

% Key change: make affiliations run inline
\makeatletter
\renewcommand\AB@affilsepx{;\hspace{0.6em}\protect\Affilfont}
\makeatother

\begin{document}

\title{Population-scale Ancestral Recombination Graphs with tskit 1.0}

\input{authors.tex}

\date{}
\maketitle

% ARGs are now practical, and tskit is the infrastructure enabling their use.

\noindent
Ancestral recombination graphs (ARGs) capture the full genetic history of
samples from a recombining species.
Although ARGs have been a central
theoretical object in population genetics for decades, their practical use was
constrained by the lack of scalable inference methods, standard
interchange formats, and software infrastructure. Recent breakthroughs in
simulation and inference have substantially changed this landscape,
leading to renewed interest in ARG-based analyses across population and statistical
genetics\cite{brandt2024promise,lewanski2024era,nielsen2024inference}.
% genetics\cite{lewanski2024era}.
The tskit library has played a key enabling role in this shift
and has become foundational infrastructure for working with ARGs.
This paper marks the release of tskit 1.0, which formalises long-term stability
guarantees for its data formats and APIs.

% The succinct tree sequence is a complete, lossless encoding of ARGs.
At the core of tskit is the succinct tree sequence data model which
defines a set of nodes (genomes at particular times) and
edges (inheritance relationships between nodes spanning genomic intervals)
in a simple tabular form\cite{kelleher2018efficient}.
This encoding provides a lossless representation of a general class of
ARGs
suitable for large-scale computation\cite{wong2024general}.
The data model also incorporates site, mutation, population, and pedigree
information and supports arbitrary
metadata associated with each of these components.
Provenance information is recorded natively, enhancing reproducibility
and transparency.
These features make the tskit data model a
semantically complete and interoperable
representation of ARGs that serves as a common foundation across diverse
analytical workflows (Figure 1).

% This encoding first proved itself by transforming population-genetic
% simulation.

\begin{figure}
\includegraphics[width=\textwidth]{figure}
\caption{Tskit enables an interoperable ARG software ecosystem.
ARGs produced by simulation or inference tools can be analysed by diverse
downstream applications via tskit’s well-defined tabular data model, C library and
Python/Rust/R bindings. Tools shown are representative examples from Table S1
(three per category; ordered by citation count).}
\end{figure}


Simulation is a fundamental tool in population genomics, and
was the first domain in which the tskit data model demonstrated its impact.
Introduced initially as part of the msprime simulator,
the tskit data model enabled performance improvements of
several orders of magnitude
over previous coalescent simulation approaches\cite{kelleher2016efficient}.
The same representation later
enabled efficient forward-time simulation of ARGs
and yielded substantial speedups by avoiding
explicit simulation of neutral mutations\cite{kelleher2018efficient}.
Because these forward-time and coalescent
simulators share this common representation, their complementary
strengths can be combined within a single workflow. This has made it possible
to simulate ARGs under complex demographic scenarios involving geography and
selection that were previously infeasible.
Simulation capabilities have continued to expand,
including whole-autosome ARG simulations for nearly 1.5 million individuals
based on a large human pedigree\cite{andersontrocme2023genes}.

% Inference methods interoperate through tskit, enabling evaluation and reuse
% without imposing design choices.

The lack of scalable inference methods has been a major obstacle
to empirical application of ARGs. Although there
are many inference methods\cite{wong2024general},
tsinfer was the first to scale to hundreds of thousands of samples,
directly leveraging the tskit data model\cite{kelleher2019inferring}.
Many recent ARG inference methods have chosen to support tskit
as an output format in addition to their
own native representations (Table S1).
This shared output layer enables inferred ARGs to
interoperate directly with simulators, facilitating systematic evaluation and
benchmarking against known ground truth. It also shifts the burden of format
conversion away from downstream users, who can instead rely on inference tools
to emit results in a common, well-defined representation. The scalability and
flexibility of this approach are illustrated by the recent inference of an ARG
for 2.48 million SARS-CoV-2 whole genomes, which occupies 32 MiB of storage and
can be loaded into memory in under a second\cite{zhan2025pandemic}.

% Shared representation unlocks fast, correct, and reusable downstream
% analysis.

Efficient storage and analysis of large genetic datasets is a central
design goal of tskit, and the data model has enabled substantial
performance gains in downstream analyses.
For example, single-site
population genetic statistics can be computed orders of magnitude faster than from
genotype matrices while using far less memory by operating
on the underlying ARG structure\cite{ralph2020efficiently}.
Tskit exposes a large API with a
performance-critical core implemented in C and bindings available for Python,
Rust, and R.
Its vectorised, table-first design allows zero-copy access to
underlying arrays, supporting high-performance analysis pipelines.
As a result,
downstream tools inherit performance and correctness properties from a shared,
well-tested core.

% tskit 1.0 formalises this ecosystem as stable, long-term scientific
% infrastructure.

The goal of tskit is to provide a shared technical foundation, centred on
efficient, well-tested, and thoroughly documented primitive operations on ARGs, rather
than to directly implement end-user workflows.
This design principle has enabled a broad ecosystem of downstream
software---spanning simulation, ARG inference, population and statistical genetic
inference, analysis, and visualisation---with 64 published tools now using tskit
as a core dependency (Table S1).
Building on the initial introduction of the succinct tree sequence
data model~\cite{kelleher2016efficient}
and its formalisation as a general ARG representation~\cite{wong2024general},
tskit 1.0 marks the maturity of the software library and data model
for scalable ARG analysis (see Supplementary Information).
By focusing on stable
primitives rather than prescribing analytical pipelines, tskit enables
methodological innovation to concentrate on modelling, inference, and
interpretation rather than bespoke data formats and tooling.
In this way, tskit provides
a common and extensible foundation that supports the further expansion of
ARG-based analyses as datasets, methods, and applications grow.
As tskit is applied to a wider range of biological applications, future development
is likely to address additional complexities such as supporting multiple
chromosomes and structural variants.
Extensive documentation, tutorials, and other information are available
at \url{https://tskit.dev}.

\subsection*{Acknowledgements}
We gratefully acknowledge funding from the Robertson Foundation,
the NIH (research grants HG011395 and HG012473),
and the NSF (research grant OAC-2104115),
supporting core tskit development.

\bibliography{paper}
\end{document}