Abstract
Phylogeny estimation (the reconstruction of evolutionary trees) has recently been applied to CRISPR-based cell lineage tracing, allowing the developmental history of an individual tissue or organism to be inferred from a large number of mutated sequences in somatic cells. However, current computational methods are not able to construct phylogenetic trees from extremely large numbers of input sequences. Here, we present a deep distributed computing framework to comprehensively trace accurate large lineages (FRACTAL) that substantially enhances the scalability of current lineage estimation software tools. FRACTAL first reconstructs only an upstream lineage of the input sequences and recursively iterates the same produce for its downstream lineages using independent computing nodes. We demonstrate the utility of FRACTAL by reconstructing lineages from>235 million simulated sequences and from>16 million cells from a simulated experiment with a CRISPR system that accumulates mutations during cell proliferation. We also successfully applied FRACTAL to evolutionary tree reconstructions and to an experiment using error-prone PCR (EP-PCR) for large-scale sequence diversification.
This is a preview of subscription content
Access options
Subscribe to Journal
Get full journal access for 1 year
92,52 €
only 7,71 € per issue
All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.
Buy article
Get time limited or full article access on ReadCube.
$32.00
All prices are NET prices.
Data availability
All of the simulated sequences and their lineages generated by PRESUME are available via the URLs given in Supplementary Table 2. The high-throughput sequencing data produced in this work are available at the NCBI Sequence Read Archive (PRJNA675984). P values obtained in this study are listed in Supplementary Data 5.
Code availability
The source codes and manuals of FRACTAL and PRESUME are available at GitHub (https://github.com/yachielab/FRACTAL and https://github.com/yachielab/PRESUME, respectively). FRACTAL and PRESUME can also be modified and executed on a small scale on web browsers through the Code Ocean, a cloud-based code sharing platform (https://doi.org/10.24433/CO.2433997.v1 and https://doi.org/10.24433/CO.3922773.v1, respectively). Note that the distributed computing mode is disabled in the Code Ocean because the Univa Grid Engine (UGE) is not supported in the Code Ocean. The other codes used in this study are also all available at GitHub (for the list, see Supplementary Table 5).
References
- 1.
Zou, Q., Wan, S., Zeng, X. & Ma, Z. S. Reconstructing evolutionary trees in parallel for massive sequences. BMC Syst. Biol. 11, 100 (2017).
- 2.
Mora, C., Tittensor, D. P., Adl, S., Simpson, A. G. & Worm, B. How many species are there on Earth and in the ocean? PLoS Biol. 9, e1001127 (2011).
- 3.
Cong, L. et al. Multiplex genome engineering using CRISPR/Cas systems. Science 339, 819–823 (2013).
- 4.
Mali, P. et al. RNA-guided human genome engineering via Cas9. Science 339, 823–826 (2013).
- 5.
Kalhor, R. et al. Developmental barcoding of whole mouse via homing CRISPR. Science 361, eaat9804 (2018).
- 6.
Chan, M. M. et al. Molecular recording of mammalian embryogenesis. Nature 570, 77–82 (2019).
- 7.
Bowling, S. et al. An engineered CRISPR–Cas9 mouse line for simultaneous readout of lineage histories and gene expression profiles in single cells. Cell 181, 1410–1422 (2020).
- 8.
Salvador-Martinez, I., Grillo, M., Averof, M. & Telford, M. J. Is it possible to reconstruct an accurate cell lineage using CRISPR recorders? eLife 8, e40292 (2019).
- 9.
McKenna, A. et al. Whole-organism lineage tracing by combinatorial and cumulative genome editing. Science 353, aaf7907 (2016).
- 10.
Raj, B. et al. Simultaneous single-cell profiling of lineages and cell types in the vertebrate brain. Nat. Biotechnol. 36, 442–450 (2018).
- 11.
Spanjaard, B. et al. Simultaneous lineage tracing and cell-type identification using CRISPR–Cas9-induced genetic scars. Nat. Biotechnol. 36, 469–473 (2018).
- 12.
Alemany, A., Florescu, M., Baron, C. S., Peterson-Maduro, J. & van Oudenaarden, A. Whole-organism clone tracing using single-cell sequencing. Nature 556, 108–112 (2018).
- 13.
Quinn, J. J. et al. Single-cell lineages reveal the rates, routes, and drivers of metastasis in cancer xenografts. Science 371, eabc1944 (2021).
- 14.
Simeonov, K. P. et al. Single-cell lineage and transcriptome reconstruction of metastatic cancer reveals selection of aggressive hybrid EMT states. Cancer Cell 39, 1150–1162.e9 (2021).
- 15.
Cao, J. et al. Comprehensive single-cell transcriptional profiling of a multicellular organism. Science 357, 661–667 (2017).
- 16.
Cao, J. et al. The single-cell transcriptional landscape of mammalian organogenesis. Nature 566, 496–502 (2019).
- 17.
Cao, J. et al. A human cell atlas of fetal gene expression. Science 370, eaba7721 (2020).
- 18.
Sender, R., Fuchs, S. & Milo, R. Revised estimates for the number of human and bacteria cells in the body. PLoS Biol. 14, e1002533 (2016).
- 19.
Barbera, P. et al. EPA-ng: massively parallel evolutionary placement of genetic sequences. Syst. Biol. 68, 365–369 (2019).
- 20.
Stamatakis, A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30, 1312–1313 (2014).
- 21.
Simonsen, M., Mailund, T. & Pedersen, C. N. S. in International Workshop on Algorithms in Bioinformatics 113–122 (Springer, 2008).
- 22.
Price, M. N., Dehal, P. S. & Arkin, A. P. FastTree 2—approximately maximum-likelihood trees for large alignments. PLoS ONE 5, e9490 (2010).
- 23.
Robinson, D. F. & Foulds, L. R. Comparison of phylogenetic trees. Math. Biosci. 53, 131–147 (1981).
- 24.
Yarza, P. et al. The All-Species Living Tree project: a 16S rRNA-based phylogenetic tree of all sequenced type strains. Syst. Appl. Microbiol. 31, 241–250 (2008).
- 25.
Parks, D. H. et al. A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nat. Biotechnol. 36, 996–1004 (2018).
- 26.
Frieda, K. L. et al. Synthetic recording and in situ readout of lineage information in single cells. Nature 541, 107–111 (2017).
- 27.
Jones, M. G. et al. Inference of single-cell phylogenies from lineage tracing data using Cassiopeia. Genome Biol. 21, 92 (2020).
- 28.
Nishida, K. et al. Targeted nucleotide editing using hybrid prokaryotic and vertebrate adaptive immune systems. Science 353, aaf8729 (2016).
- 29.
Hwang, B. et al. Lineage tracing using a Cas9-deaminase barcoding system targeting endogenous L1 elements. Nat. Commun. 10, 1234 (2019).
- 30.
Grünewald, J. et al. A dual-deaminase CRISPR base editor enables concurrent adenine and cytosine editing. Nat. Biotechnol. 38, 861–864 (2020).
- 31.
Zhang, X. et al. Dual base editor catalyzes both cytosine and adenine base conversions in human cells. Nat. Biotechnol. 38, 856–860 (2020).
- 32.
Sakata, R. C. et al. Base editors for simultaneous introduction of C-to-T and A-to-G mutations. Nat. Biotechnol. 38, 865–869 (2020).
- 33.
Du, Z., Santella, A., He, F., Tiongson, M. & Bao, Z. De novo inference of systems-level mechanistic models of development from live-imaging-based phenotype analysis. Cell 156, 359–372 (2014).
- 34.
Ciccarelli, F. D. et al. Toward automatic reconstruction of a highly resolved tree of life. Science 311, 1283–1287 (2006).
- 35.
Brown, C. T. et al. Unusual biology across a group comprising more than 15% of domain Bacteria. Nature 523, 208–211 (2015).
- 36.
Poe, S. & Swofford, D. L. Taxon sampling revisited. Nature 398, 299–300 (1999).
- 37.
Chow, K. K. et al. Imaging cell lineage with a synthetic digital recording system. Science 372, eabb3099 (2021).
- 38.
Klein, A. M. et al. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell 161, 1187–1201 (2015).
- 39.
Macosko, E. Z. et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell 161, 1202–1214 (2015).
- 40.
Yu, M. K. et al. DDOT: a Swiss army knife for investigating data-driven biological ontologies. Cell Syst. 8, 267–273 (2019).
- 41.
Shannon, P. et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 13, 2498–2504 (2003).
- 42.
Capella-Gutiérrez, S., Silla-Martínez, J. M. & Gabaldón, T. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics 25, 1972–1973 (2009).
- 43.
Shen, W., Le, S., Li, Y. & Hu, F. SeqKit: a cross-platform and ultrafast toolkit for FASTA/Q file manipulation. PLoS ONE 11, e0163962 (2016).
- 44.
Schliep, K. P. phangorn: phylogenetic analysis in R. Bioinformatics 27, 592–593 (2010).
- 45.
Li, W. & Godzik, A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 1658–1659 (2006).
- 46.
Madeira, F. et al. The EMBL-EBI search and sequence analysis tools APIs in 2019. Nucleic Acids Res. 47, W636–W641 (2019).
- 47.
Danecek, P. et al. Twelve years of SAMtools and BCFtools. Gigascience 10, giab008 (2021).
- 48.
Baum, B. R. PHYLIP: phylogeny inference package. Version 3.2. Quarterly Review of Biology 64, 539–541 (1989).
- 49.
Zhao, L., Liu, Z., Levy, S. F. & Wu, S. Bartender: a fast and accurate clustering algorithm to count barcode reads. Bioinformatics 34, 739–747 (2018).
- 50.
Satopaa, V., Albrecht, J., Irwin, D. & Raghavan, B. in 2011 31st International Conference on Distributed Computing Systems Workshops 166–171 (IEEE, 2011).
- 51.
Levenshtein, V. I. in Soviet Physics Doklady, Vol. 10 707–710 (Doklady Akademii Nauk SSSR, 1966).
- 52.
Brunner, E. & Munzel, U. The nonparametric Behrens–Fisher problem: asymptotic theory and a small-sample approximation. Biom. J. 42, 17–25 (2000).
Acknowledgements
We thank members of the Yachie lab for valuable discussions and critical assessment of the work, especially A. Adel, S. King and S. Okawa for reviewing the manuscript. We also thank C. de Hoog for providing comments on the manuscript. Deep sequencing was performed with the support of H. Aburatani. This study was supported by the Canada Research Chair program (by the Canadian Institutes for Health Research), the Japan Science and Technology Agency (JST) PRESTO program (10814), the Japan Agency for Medical Research and Development (AMED) PRIME program (20gm6110007), the Shimadzu Science and Technology Foundation, the Naito Foundation, the Nakajima Foundation and the Asahi Glass Foundation (all to N.Y.). Y.K., S.I., H.M. and N.M. were supported by Japan Society for the Promotion of Science (JSPS) Research Fellowships. High-performance computing experiments were performed using the SHIROKANE Supercomputer at the University of Tokyo Human Genome Center or the NIG Supercomputer System at the National Institute of Genetics.
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Peer review information Nature Biotechnology thanks the anonymous reviewers for their contribution to the peer review of this work.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 PRESUME.
a, Schematic diagram of PRESUME. b, Various datasets of 32,768 sequences generated by PRESUME with different mutational parameters μ and α. The topological parameter σ was fixed to 0 (perfectly balanced sequence diversification). The diversity of nucleotide letters across different sequence positions is represented by a bit score distribution and by a sequence logo for the first 15 nt of the generated sequences. c, Partial diagrams of representative trees generated by PRESUME with different topological parameters σ.
Extended Data Fig. 2 Runtime simulation of FRACTAL.
a, Conceptual diagram of the runtime simulation of FRACTAL. Under a distributed computing environment with a fixed number of available nodes d, each FRACTAL job of input sequence size n starts if there is one or more available free computing node; otherwise, it is stalled until a node is released from one of the ongoing jobs. Each FRACTAL job process is modeled to occupy one computing node for a runtime f(n) and produce two new child jobs for the next job cycles each with an input sequence size of ⌊n/2⌋. When the input sequence size is under a certain threshold (n ≤ t), the job terminates after occupying one node for a runtime f(n) assuming the terminal lineage reconstruction process. b, Two implementation models of FRACTAL. In FRACTAL, each job cycle contains three steps that require high computing loads: sequence subsampling from the entire input sequences, phylogenetic placement, and sorting of sequences mapped on each of the sample lineage clades for the next job cycles. In model A, none of the three steps is parallelized where a runtime of each cycle is f(n) for the input sequence size of n. In model B, all of the three steps are perfectly parallelized where a runtime of each cycle is linearly reduced by the number of available computing nodes. The denominator of the formula represents the number of available computing nodes, assuming that the smaller the input sequences for a FRACTAL iteration cycle, the higher the likelihood of computing nodes being occupied by the other job processes at the same period. c, Linear regression f(n) using runtime log data of independent FRACTAL iteration cycles any of whose major three steps was not parallelized for the lineage reconstruction of the 235 million sequences.
Extended Data Fig. 3 Robustness of evolutionary lineage reconstruction with sequence data noise.
The scatter plots represent accuracies to reconstruct the SILVA 16S rRNA lineage from the input sequence datasets including different fractions of noise sequences. The noise sequences were generated by shuffling nucleotide positions of randomly selected sequences by keeping their alignment gap positions in the multiple sequence alignment result. The lineage dendrograms represent the ones reconstructed from the sequence dataset with 20% noise using FRACTALized RapidNJ, RAxML and FastTree.
Extended Data Fig. 4 Simulating proliferation of cells with the scalable CRISPR-barcode system.
a, The scGESTALT lineage tracing datasets of zebrafish brain development. BBI scores were obtained for branches that had more than three leaves associated to each of the two descending edges and a total of more than ten leaves associated to both of the two descending edges. b, c, Estimation of the insertion and deletion event probabilities per barcode array per generation in the scGESTALT dataset. The relative insertion and deletion event probabilities across each barcode array position and the probability distributions of insertion and deletion sizes were modeled using those observed in the scGESTALT dataset. The probabilities of the insertion and deletion events per generation for the production of 4,000 cells were fitted to the average total insertion and deletion lengths per barcode array observed in the scGESTALT dataset, respectively. d, Median fractions of base substitution per nucleotide position per generation observed for different secondary scaling factors for base editing in the simulation of producing 4,000 cells. The blue shading represents the 5 to 95 percentile range. e, f, Tree recovery scores for RapidNJ and FRACTALized RapidNJ in reconstructing the lineages of 4,000 cells simulated based on different secondary scaling factors for base editing.
Extended Data Fig. 5 Reconstruction of the BET002 sequence diversification process produced by EP-PCR.
a, Distribution in number of mutations observed in the second-generation sequences (CTNNB1 and BET002). b-g, BET002. b, Distribution of the second-generation sequences across the parental sample wells. The second-generation sequences were assigned based on their best matched parental first-generation sequences. c, The lineage tree of the mutated sequences reconstructed by FRACTAL. The sequences that did not have unique best matched sequences in the expected parental wells were filtered out after the lineage reconstruction. The dendrogram only represents the upstream lineage of the largest clades each composed of less than 15,000 sequences. Number of sequences, proportions of their source sample wells, and entropy of the well proportions are represented for each of the clades. d, A zoom-in diagram of the lineage highlighted by yellow in c. e, Distribution of entropies for the clades each with 1,000 or more sequences. The statistical difference between the entropy distribution and the null distribution given by random sequence-well assignment was tested by two-sided Brunner-Munzel test. f, Reconstructed lineage of the sequences in the clade indicated by the arrow in d. g, Proportions of the parental sequences identified in the control PCR wells and normalized distances of the second-generation sequences in the reconstructed lineage for pairs in the same wells and pairs, one of each is from a different well. Orange dots represent significant differences between the intra- and inter-well distributions (two-sided Brunner-Munzel test with Bonferroni correction; adjusted P-value
Extended Data Fig. 6 Evolutionary lineage reconstruction without preliminary MSA.
a, Sample tree reconstruction and phylogenetic placement of FRACTAL with no preliminary multiple sequence alignment (MSA). A given number of sequences are first randomly subsampled from the input sequences (Step 1). The subsampled sequences are aligned with a common root sequence by MSA using MAFFT (Step 2) and a sample tree is reconstructed by a software tool of choice (Step 3). Each of the remaining input sequences are then independently added to the MSA result by ‘plus-one’ alignment using HMMER (Step 4i) and placed on the sample tree (Step 4ii). b, Accuracies of reconstructing various sizes of clades in the reference lineage of 1,000,000 sequences generated by RNASim. c, Accuracies and coverages of reconstructing the entire lineage of 1,000,000 unaligned sequences by FRACTALized RapidNJ, RAxML and FastTree with 100 computing nodes (five trials). d, Time series for the numbers of computing jobs and waiting jobs observed in the reconstruction of the simulated lineage of RNA evolution using 100 computing nodes for FRACTALization.
Extended Data Fig. 7 Lineages of CTNNB1 and BET002 datasets before filtering out the potential artifact sequences.
a, b, Reconstructed lineages. The dendrogram only represents the upstream lineage of the largest clades each composed of less than 15,000 sequences. Number of sequences, proportions of their source sample wells, and entropy of the well proportions are represented for each of the clades. a, CTNNB1. b, BET002. c, d, Distribution of entropies for the clades with 1,000 or more sequences. c, CTNNB1. d, BET002. The statistical differences between the entropy distributions and the null distributions given by random sequence-well assignment were tested by two-sided Brunner-Munzel test. e, f, Unique read counts of the second-generation sequences uniquely best-matched to single parental sequences in the expected and unexpected wells and those best-matched to multiple parental sequences. The ones uniquely best-matched to single parental sequences are color-coded according to the parental wells. The second-generation sequences best-matched to single parental sequences of unexpected wells can be assumed to be cross-contaminants derived during the second EP-PCR and the following steps. The second-generation sequences redundantly best-matched to multiple parental sequences can be assumed to have parental sequences that were either cross-contaminated before the second EP-PCR or were conferred an insufficient number of mutations. e, CTNNB1. f, BET002.
Supplementary information
About this article
Cite this article
Konno, N., Kijima, Y., Watano, K. et al. Deep distributed computing to reconstruct extremely large lineage trees.
Nat Biotechnol (2022). https://doi.org/10.1038/s41587-021-01111-2
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41587-021-01111-2
Note: This article have been indexed to our site. We do not claim legitimacy, ownership or copyright of any of the content above. To see the article at original source Click Here