Generating lineage-resolved, complete metagenome-assembled genomes from complex microbial communities

Abstract

Microbial communities might include distinct lineages of closely related organisms that complicate metagenomic assembly and prevent the generation of complete metagenome-assembled genomes (MAGs). Here we show that deep sequencing using long (HiFi) reads combined with Hi-C binning can address this challenge even for complex microbial communities. Using existing methods, we sequenced the sheep fecal metagenome and identified 428 MAGs with more than 90% completeness, including 44 MAGs in single circular contigs. To resolve closely related strains (lineages), we developed MAGPhase, which separates lineages of related organisms by discriminating variant haplotypes across hundreds of kilobases of genomic sequence. MAGPhase identified 220 lineage-resolved MAGs in our dataset. The ability to resolve closely related microbes in complex microbial communities improves the identification of biosynthetic gene clusters and the precision of assigning mobile genetic elements to host genomes. We identified 1,400 complete and 350 partial biosynthetic gene clusters, most of which are novel, as well as 424 (298) potential host–viral (host–plasmid) associations using Hi-C data.

This is a preview of subscription content

Access options

Subscribe to Journal

Get full journal access for 1 year

92,52 €

only 7,71 € per issue

All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.

Buy article

Get time limited or full article access on ReadCube.

$32.00

All prices are NET prices.

Data availability

The HiFi sheep dataset, Hi-C reads and WGS short reads are available on National Center of Biotechnology Information BioProject PRJNA595610 at accession IDs SRX7628648, SRX10704191 and SRX7649993, respectively. Whole-metagenome assemblies and MAG bins for the pCLR and HiFi datasets are available at https://doi.org/10.5281/zenodo.4729049. The ‘kaiju_db_nr_euk_2021-02-24’ database was used for Kaiju classification (https://kaiju.binf.ku.dk/server). The ‘2017-07’ version of the UniProt database was used for BlobTools classification (https://ftp.uniprot.org/pub/databases/uniprot/previous_major_releases/release-2017_07/).

Code availability

The MAGPhase script and codebase are part of the https://github.com/Magdoll/cDNA_Cupcake GitHub repository. Scripts to replicate the analysis of the manuscript and to implement the MAGPhase workflow are located at this centralized repository: https://github.com/njdbickhart/SheepHiFiManuscript (ref. 61). A listing of all analysis software packages used in this study can be found in Supplementary Table 10.

References

  1. 1.

    Bowers, R. M. et al. Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea. Nat. Biotechnol. 35, 725–731 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  2. 2.

    Chen, L.-X., Anantharaman, K., Shaiber, A., Eren, A. M. & Banfield, J. F. Accurate and complete genomes from metagenomes. Genome Res. 30, 315–333 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  3. 3.

    Pasolli, E. et al. Extensive unexplored human microbiome diversity revealed by over 150,000 genomes from metagenomes spanning age, geography, and lifestyle. Cell 176, 649–662 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  4. 4.

    Singleton, C. M. et al. Connecting structure to function with the recovery of over 1000 high-quality metagenome-assembled genomes from activated sludge using long-read sequencing. Nat. Commun. 12, 2009 (2021).

    CAS  PubMed  PubMed Central  Google Scholar 

  5. 5.

    Vollger, M. R. et al. Long-read sequence and assembly of segmental duplications. Nat. Methods 16, 88–94 (2019).

    CAS  PubMed  Google Scholar 

  6. 6.

    Bickhart, D. M. et al. Assignment of virus and antimicrobial resistance genes to microbial hosts in a complex microbial community by combined long-read assembly and proximity ligation. Genome Biol. 20, 153 (2019).

    PubMed  PubMed Central  Google Scholar 

  7. 7.

    Moss, E. L., Maghini, D. G. & Bhatt, A. S. Complete, closed bacterial genomes from microbiomes using nanopore sequencing. Nat. Biotechnol. 38, 701–707 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  8. 8.

    Zhang, L. et al. A comprehensive investigation of metagenome assembly by linked-read sequencing. Microbiome 8, 156 (2020).

    PubMed  PubMed Central  Google Scholar 

  9. 9.

    Nurk, S., Meleshko, D., Korobeynikov, A. & Pevzner, P. A. metaSPAdes: a new versatile metagenomic assembler. Genome Res. 27, 824–834 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  10. 10.

    Kolmogorov, M. et al. metaFlye: scalable long-read metagenome assembly using repeat graphs. Nat. Methods 17, 1103–1110 (2020).

    CAS  PubMed  Google Scholar 

  11. 11.

    Watson, M. & Warr, A. Errors in long-read assemblies can critically affect protein prediction. Nat. Biotechnol. 37, 124–126 (2019).

    CAS  PubMed  Google Scholar 

  12. 12.

    Latorre-Pérez, A., Villalba-Bermell, P., Pascual, J. & Vilanova, C. Assembly methods for nanopore-based metagenomic sequencing: a comparative study. Sci. Rep. 10, 13588 (2020).

    PubMed  PubMed Central  Google Scholar 

  13. 13.

    Olm, M. R. et al. inStrain profiles population microdiversity from metagenomic data and sensitively detects shared microbial strains. Nat. Biotechnol. 39, 727–736 (2021).

  14. 14.

    Quince, C. et al. STRONG: metagenomics strain resolution on assembly graphs. Genome Biol. 22, 214 (2021).

    CAS  PubMed  PubMed Central  Google Scholar 

  15. 15.

    Kang, D. D., Froula, J., Egan, R. & Wang, Z. MetaBAT, an efficient tool for accurately reconstructing single genomes from complex microbial communities. PeerJ 3, e1165 (2015).

  16. 16.

    Burton, J. N., Liachko, I., Dunham, M. J. & Shendure, J. Species-level deconvolution of metagenome assemblies with Hi-C–based contact probability maps. G3 (Bethesda) 4, 1339–1346 (2014).

    Google Scholar 

  17. 17.

    Parks, D. H. et al. Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life. Nat. Microbiol. 2, 1533–1542 (2017).

    CAS  PubMed  Google Scholar 

  18. 18.

    Lapierre, P. & Gogarten, J. P. Estimating the size of the bacterial pan-genome. Trends Genet. 25, 107–110 (2009).

    CAS  PubMed  Google Scholar 

  19. 19.

    Vicedomini, R., Quince, C., Dar ling, A. E. & Chikhi, R. Strainberry: automated strain separation in low-complexity metagenomes using long reads. Nat. Commun. 12, 4485 (2021).

    CAS  PubMed  PubMed Central  Google Scholar 

  20. 20.

    O’Brien, J. D. et al. A Bayesian approach to inferring the phylogenetic structure of communities from metagenomic data. Genetics 197, 925–937 (2014).

    PubMed  PubMed Central  Google Scholar 

  21. 21.

    Quince, C. et al. DESMAN: a new tool for de novo extraction of strains from metagenomes. Genome Biol. 18, 181 (2017).

    PubMed  PubMed Central  Google Scholar 

  22. 22.

    Nicholls, S. M. et al. On the complexity of haplotyping a microbial community. Bioinformatics https://doi.org/10.1093/bioinformatics/btaa977 (2020).

  23. 23.

    Vicedomini, R., Quince, C., Darling, A. E. & Chikhi, R. Strainberry: automated strain separation in low-complexity metagenomes using long reads. Nat. Commun. 12, 4485 (2021).

    CAS  PubMed  PubMed Central  Google Scholar 

  24. 24.

    Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat. Biotechnol. 37, 1155–1162 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  25. 25.

    Ebert, P. et al. Haplotype-resolved diverse human genomes and integrated analysis of structural variation. Science 372, eabf7117 (2021).

  26. 26.

    Garg, S. et al. Chromosome-scale, haplotype-resolved assembly of human genomes. Nat. Biotechnol. 39, 309–312 (2021).

    CAS  PubMed  Google Scholar 

  27. 27.

    Porubsky, D. et al. Fully phased human genome assembly without parental data using single-cell strand sequencing and long reads. Nat. Biotechnol. 39, 302–308 (2021).

    CAS  PubMed  Google Scholar 

  28. 28.

    Miga, K. H. et al. Telomere-to-telomere assembly of a complete human X chromosome. Nature 585, 79–84 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  29. 29.

    Menzel, P., Ng, K. L. & Krogh, A. Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nat. Commun. 7, 11257 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  30. 30.

    Kolmogorov, M. Supporting data for the manuscript ‘Generation of lineage-resolved complete metagenome-assembled genomes in complex microbial communities’. https://doi.org/10.5281/zenodo.5138306 (2021).

  31. 31.

    Chaumeil, P.-A., Mussig, A. J., Hugenholtz, P. & Parks, D. H. GTDB-Tk: a toolkit to classify genomes with the Genome Taxonomy Database. Bioinformatics 36, 1925–1927 (2020).

    CAS  Google Scholar 

  32. 32.

    Ondov, B. D. et al. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 17, 132 (2016).

    PubMed  PubMed Central  Google Scholar 

  33. 33.

    Wang, B. et al. Variant phasing and haplotypic expression from long-read sequencing in maize. Commun. Biol. 3, 1–11 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  34. 34.

    Tseng, E. cDNA_cupcake v24.0.0. https://github.com/Magdoll/cDNA_Cupcake

  35. 35.

    Nei, M. & Rooney, A. P. Concerted and birth-and-death evolution of multigene families. Annu. Rev. Genet. 39, 121–152 (2005).

    CAS  PubMed  PubMed Central  Google Scholar 

  36. 36.

    Meleshko, D. et al. BiosyntheticSPAdes: reconstructing biosynthetic gene clusters from assembly graphs. Genome Res. 29, 1352–1362 (2019).

  37. 37.

    Blin, K. et al. antiSMASH 5.0: updates to the secondary metabolite genome mining pipeline. Nucleic Acids Res. 47, W81–W87 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  38. 38.

    Pellow, D. et al. SCAPP: an algorithm for improved plasmid assembly in metagenomes. Microbiome 9, 144 (2021).

    PubMed  PubMed Central  Google Scholar 

  39. 39.

    He, C. et al. Genome-resolved metagenomics reveals site-specific diversity of episymbiotic CPR bacteria and DPANN archaea in groundwater ecosystems. Nat. Microbiol. 6, 354–365 (2021).

    CAS  PubMed  PubMed Central  Google Scholar 

  40. 40.

    Wick, R. R., Judd, L. M. & Holt, K. E. Performance of neural network basecalling tools for Oxford Nanopore sequencing. Genome Biol. 20, 129 (2019).

    PubMed  PubMed Central  Google Scholar 

  41. 41.

    Guo, C.-J. et al. Discovery of reactive microbiota-derived metabolites that inhibit host proteases. Cell 168, 517–526 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  42. 42.

    Press, M. O. et al. Hi-C deconvolution of a human gut microbiome yields high-quality draft genomes and reveals plasmid-genome interactions. Preprint at https://www.biorxiv.org/content/10.1101/198713v1 (2017).

  43. 43.

    Bentley, D. R. et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53–59 (2008).

    CAS  PubMed  PubMed Central  Google Scholar 

  44. 44.

    Walker, B. J. et al. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS ONE 9, e112963 (2014).

  45. 45.

    Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at https://arxiv.org/abs/1303.3997 (2013).

  46. 46.

    Li, H. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics 32, 2103–2110 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  47. 47.

    DeMaere, M. Z. & Darling, A, E.bin3C: exploiting Hi-C sequencing data to accurately resolve metagenome-assembled genomes. Genome Biol. 20, 46 (2019).

    PubMed  PubMed Central  Google Scholar 

  48. 48.

    Laetsch, D. R. & Blaxter, M. L. BlobTools: interrogation of genome assemblies. F1000Research 6, 1287 (2017).

    Google Scholar 

  49. 49.

    Chan, P. P. & Lowe, T. M. tRNAscan-SE: searching for tRNA genes in genomic sequences. Methods Mol. Biol. 1962, 1–14 (2019).

    PubMed  PubMed Central  Google Scholar 

  50. 50.

    Nurk, S. et al. HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. Genome Res. 30, 1291–1305 (2020).

  51. 51.

    Chin, C.-S. & Khalak, A. Human genome assembly in 100 minutes. Preprint at https://www.biorxiv.org/content/10.1101/705616v1 (2019).

  52. 52.

    Nayfach, S. et al. CheckV assesses the quality and completeness of metagenome-assembled viral genomes. Nat. Biotechnol. 39, 578–585 (2020).

  53. 53.

    Ondov, B. D., Bergman, N. H. & Phillippy, A. M. Interactive metagenomic visualization in a web browser. BMC Bioinformatics 12, 385 (2011).

    PubMed  PubMed Central  Google Scholar 

  54. 54.

    Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).

    PubMed  PubMed Central  Google Scholar 

  55. 55.

    Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B Methodol. 57, 289–300 (1995).

    Google Scholar 

  56. 56.

    Robinson, J. T. et al. Integrative Genomics Viewer. Nat. Biotechnol. 29, 24–26 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  57. 57.

    Chen, Z., Erickson, D. L. & Meng, J. Polishing the Oxford Nanopore long-read assemblies of bacterial pathogens with Illumina short reads to improve genomic analyses. Genomics 113, 1366–1377 (2021).

    CAS  PubMed  Google Scholar 

  58. 58.

    Stewart, R. D. et al. Compendium of 4,941 rumen metagenome-assembled genomes for rumen microbiome biology and enzyme discovery. Nat. Biotechnol. 37, 953–961 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  59. 59.

    Hyatt, D. et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11, 119 (2010).

    PubMed  PubMed Central  Google Scholar 

  60. 60.

    Kautsar, S. A. et al. MIBiG 2.0: a repository for biosynthetic gene clusters of known function. Nucleic Acids Res. 48, D454–D458 (2020).

    PubMed  Google Scholar 

  61. 61.

    Bickhart, D. M. SheepHiFiManuscript. https://doi.org/10.5281/zenodo.5120910 (2021).

Download references

Acknowledgements

We thank K. McClure, K. Kuhn, B. Lee, J. Carnahan and W. Thompson for technical support. D.M.B. was supported by appropriated USDA CRIS Project 5090-31000-026-00-D. T.P.L.S. and S.B.S. were supported by appropriated USDA CRIS Project 3040-31000-100-00D. I.L., S.T.S. and G.U. were supported, in part, by NIH grants R44AI150008 and R44AI162570 to Phase Genomics. I.M. was supported by grants from the European Research Council (no. 640384) and from the Israel Science Foundation (no. 1947/19). M.K. and P.A.P. were supported by NSF/MCB-BSF grant 1715911. V.P.A. was supported by the US Defense Advanced Research Projects Agency’s Living Foundries program award HR0011-15-C-0084. A.K. and I.T. were supported by St. Petersburg State University (grant ID PURE 73023672). K.P. was supported by appropriated USDA CRIS Project 5090-21000-071-000-D. We thank P. J. Weimer for helpful comments and suggestions on the manuscript. The USDA does not endorse any products or services. Mentioning of trade names is for information purposes only. The USDA is an equal opportunity employer.

Author information

Author notes

  1. These authors contributed equally: D. M. Bickhart, M. Kolmogorov.

Affiliations

  1. USDA Dairy Forage Research Center, Madison, WI, USA

    Derek M. Bickhart & Kevin Panke-Buisse

  2. Department of Computer Science and Engineering, University of California – San Diego, La Jolla, CA, USA

    Mikhail Kolmogorov & Pavel A. Pevzner

  3. Pacific Biosciences, Menlo Park, CA, USA

    Elizabeth Tseng & Daniel M. Portik

  4. Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia

    Anton Korobeynikov & Ivan Tolstoganov

  5. Amazon, Seattle, WA, USA

    Gherman Uritskiy

  6. Phase Genomics, Seattle, WA, USA

    Ivan Liachko & Shawn T. Sullivan

  7. USDA Meat Animal Research Center, Clay Center, NE, USA

    Sung Bong Shin & Timothy P. L. Smith

  8. Department of Life Sciences and the National Institute for Biotechnology in the Negev, Ben Gurion University of the Negev, Beer Sheba, Israel

    Alvah Zorea & Itzhak Mizrahi

  9. Bioinformatics Group, Wageningen University, Wageningen, Netherlands

    Victòria Pascal Andreu & Marnix H. Medema

Contributions

T.P.L.S. and D.M.B. conceived the project, with extensive modifications introduced on the advice of I.L. and P.A.P. S.B.S and T.P.L.S. were responsible for collecting the sample and generating the sequence data. D.B. and M.K. produced the assemblies and conducted a large proportion of reported analysis. G.U. and S.T.S. conducted analysis related to Hi-C linkage data. V.P.A. and M.H.M. identified biosynthetic gene clusters in the dataset. D.M.B., A.Z. and I.M. identified mobile genetic elements in the sample. E.T. developed the MAGPhase algorithm, with input from D.M.B. D.M.B., T.P.L.S., M.K. and P.A.P. wrote the manuscript. All authors read and contributed to the final manuscript.

Corresponding authors

Correspondence to Pavel A. Pevzner or Timothy P. L. Smith.

Ethics declarations

Competing interests

The authors declare the following competing interests: M.H.M. is a co-founder of Design Pharmaceuticals and a member of the scientific advisory board of Hexagon Bio. E.T. and D.M.P. are employees of Pacific Biosciences. G.U. is an employee of Amazon. S.T.S. and I.L. are co-founders and the CTO and CEO, respectively, of Phase Genomics. The remaining authors declare no competing interests.

Additional information

Peer review information Nature Biotechnology thanks Mads Albertsen, C. Titus Brown and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Contig-level comparison of pCLR and HiFi assemblies.

a. Strategy for generating the read sets for the three pCLR and the HiFi assemblies. b. Comparison of contig length distributions in the four assemblies demonstrating a tendency for pCLR assembly to create longer contigs. c. Comparison of the total length of each assembly after separation of contigs into predicted Superkingdoms demonstrating an increased length from HiFi assembly among assigned Superkingdom and reduced length in unassigned bin. d. Comparison of the completeness of pCLR and HiFi assemblies based on the presence of>90% expected single-copy genes with

Extended Data Fig. 2 Assembled MAG taxonomy.

A circular dendrogram showing the presence (blue) and absence (black) of GTDB-TK assigned taxonomy to Assembly bins for the HiFi (outermost ring) and CLR (innermost rings, descending) assemblies. Branch nodes were consolidated to Genus-level affiliations when possible. Branch colors were assigned based on Phylum-level classification, with the exception of the Firmicutes, which was sub-divided into separate classes due to its increased diversity relative to other Phyla.

Extended Data Fig. 3 Read depth across orthologous, collapsed pCLR bins.

Each bin from separate, replicate pCLR assemblies corresponds to all three HiFi bins displayed in Supplementary Figure 6. Read depth that can be attributed to the reference sequence is labeled in blue, whereas phased alternative haplotypes identified via MAGPhase are labelled in alternating colors (see legend). Contig ends are denoted by vertical black bars and the x-axis represents the total length of the entire MAG with contigs placed randomly from end-to-end.

Extended Data Fig. 4 Read depth across three closely related HiFi Complete MAGs.

Read depth that can be attributed to the reference sequence is labeled in blue, whereas phased alternative haplotypes identified via MAGPhase are labelled in alternating colors (see legend). Contig ends are denoted by vertical black bars and the x-axis represents the total length of the entire MAG with contigs placed randomly from end-to-end.

Extended Data Fig. 5 Biosynthetic Gene Cluster Analysis.

The HiFi assembly revealed approximately 25% more complete Biosynthetic Gene Clusters (BGCs) than the average pCLR assembly (a). This increase was manifested in all identified BGC classes (colors in legend) and was not exclusive to one particular class. As found in other metagenome assembly datasets, the majority of identified BGCs were novel in all assemblies (b), but the HiFi assembly had a higher proportion of novel BGCs than the other assemblies. Additionally, the HiFi assembly contained more partial BGCs (c) of any assembly.

Extended Data Fig. 6 CLR1 viral association network plot.

Viral contigs identified from Blobtools-assigned taxonomy estimates are represented as hexagonal nodes with black borders, whereas non-viral host contigs are represented as circular nodes with white borders. Edges represent associations identified for each connection, with colors representing the identification of partial HiFi read overlap (blue), Hi-C read links (green) or both types of data (red), respectively.

Extended Data Fig. 7 CLR2 Viral association network plot.

Viral contigs identified from Blobtools-assigned taxonomy estimates are represented as hexagonal nodes with black borders, whereas non-viral host contigs are represented as circular nodes with white borders. Edges represent associations identified for each connection, with colors representing the identification of partial HiFi read overlap (blue), Hi-C read links (green) or both types of data (red), respectively.

Extended Data Fig. 8 CLR3 Viral association network plot.

Viral contigs identified from Blobtools-assigned taxonomy estimates are represented as hexagonal nodes with black borders, whereas non-viral host contigs are represented as circular nodes with white borders. Edges represent associations identified for each connection, with colors representing the identification of partial HiFi read overlap (blue), Hi-C read links (green) or both types of data (red), respectively.

Supplementary information

About this article

Verify currency and authenticity via CrossMark

Cite this article

Bickhart, D.M., Kolmogorov, M., Tseng, E. et al. Generating lineage-resolved, complete metagenome-assembled genomes from complex microbial communities. Nat Biotechnol (2022). https://doi.org/10.1038/s41587-021-01130-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1038/s41587-021-01130-z

Note: This article have been indexed to our site. We do not claim legitimacy, ownership or copyright of any of the content above. To see the article at original source Click Here

Related Posts
Index Of News