Generating lineage-resolved, complete metagenome-assembled genomes from complex microbial communities


Microbial communities might include distinct lineages of closely related organisms that complicate metagenomic assembly and prevent the generation of complete metagenome-assembled genomes (MAGs). Here we show that deep sequencing using long (HiFi) reads combined with Hi-C binning can address this challenge even for complex microbial communities. Using existing methods, we sequenced the sheep fecal metagenome and identified 428 MAGs with more than 90% completeness, including 44 MAGs in single circular contigs. To resolve closely related strains (lineages), we developed MAGPhase, which separates lineages of related organisms by discriminating variant haplotypes across hundreds of kilobases of genomic sequence. MAGPhase identified 220 lineage-resolved MAGs in our dataset. The ability to resolve closely related microbes in complex microbial communities improves the identification of biosynthetic gene clusters and the precision of assigning mobile genetic elements to host genomes. We identified 1,400 complete and 350 partial biosynthetic gene clusters, most of which are novel, as well as 424 (298) potential host–viral (host–plasmid) associations using Hi-C data.

Data availability

The HiFi sheep dataset, Hi-C reads and WGS short reads are available on National Center of Biotechnology Information BioProject PRJNA595610 at accession IDs SRX7628648, SRX10704191 and SRX7649993, respectively. Whole-metagenome assemblies and MAG bins for the pCLR and HiFi datasets are available at The ‘kaiju_db_nr_euk_2021-02-24’ database was used for Kaiju classification ( The ‘2017-07’ version of the UniProt database was used for BlobTools classification (

Code availability

The MAGPhase script and codebase are part of the GitHub repository. Scripts to replicate the analysis of the manuscript and to implement the MAGPhase workflow are located at this centralized repository: (ref. 61). A listing of all analysis software packages used in this study can be found in Supplementary Table 10.


We thank K. McClure, K. Kuhn, B. Lee, J. Carnahan and W. Thompson for technical support. D.M.B. was supported by appropriated USDA CRIS Project 5090-31000-026-00-D. T.P.L.S. and S.B.S. were supported by appropriated USDA CRIS Project 3040-31000-100-00D. I.L., S.T.S. and G.U. were supported, in part, by NIH grants R44AI150008 and R44AI162570 to Phase Genomics. I.M. was supported by grants from the European Research Council (no. 640384) and from the Israel Science Foundation (no. 1947/19). M.K. and P.A.P. were supported by NSF/MCB-BSF grant 1715911. V.P.A. was supported by the US Defense Advanced Research Projects Agency’s Living Foundries program award HR0011-15-C-0084. A.K. and I.T. were supported by St. Petersburg State University (grant ID PURE 73023672). K.P. was supported by appropriated USDA CRIS Project 5090-21000-071-000-D. We thank P. J. Weimer for helpful comments and suggestions on the manuscript. The USDA does not endorse any products or services. Mentioning of trade names is for information purposes only. The USDA is an equal opportunity employer.

Author information

Author notes

  1. These authors contributed equally: D. M. Bickhart, M. Kolmogorov.


  1. USDA Dairy Forage Research Center, Madison, WI, USA

    Derek M. Bickhart & Kevin Panke-Buisse

  2. Department of Computer Science and Engineering, University of California – San Diego, La Jolla, CA, USA

    Mikhail Kolmogorov & Pavel A. Pevzner

  3. Pacific Biosciences, Menlo Park, CA, USA

    Elizabeth Tseng & Daniel M. Portik

  4. Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia

    Anton Korobeynikov & Ivan Tolstoganov

  5. Amazon, Seattle, WA, USA

    Gherman Uritskiy

  6. Phase Genomics, Seattle, WA, USA

    Ivan Liachko & Shawn T. Sullivan

  7. USDA Meat Animal Research Center, Clay Center, NE, USA

    Sung Bong Shin & Timothy P. L. Smith

  8. Department of Life Sciences and the National Institute for Biotechnology in the Negev, Ben Gurion University of the Negev, Beer Sheba, Israel

    Alvah Zorea & Itzhak Mizrahi

  9. Bioinformatics Group, Wageningen University, Wageningen, Netherlands

    Victòria Pascal Andreu & Marnix H. Medema


T.P.L.S. and D.M.B. conceived the project, with extensive modifications introduced on the advice of I.L. and P.A.P. S.B.S and T.P.L.S. were responsible for collecting the sample and generating the sequence data. D.B. and M.K. produced the assemblies and conducted a large proportion of reported analysis. G.U. and S.T.S. conducted analysis related to Hi-C linkage data. V.P.A. and M.H.M. identified biosynthetic gene clusters in the dataset. D.M.B., A.Z. and I.M. identified mobile genetic elements in the sample. E.T. developed the MAGPhase algorithm, with input from D.M.B. D.M.B., T.P.L.S., M.K. and P.A.P. wrote the manuscript. All authors read and contributed to the final manuscript.

Corresponding authors

Correspondence to Pavel A. Pevzner or Timothy P. L. Smith.

Ethics declarations

Competing interests

The authors declare the following competing interests: M.H.M. is a co-founder of Design Pharmaceuticals and a member of the scientific advisory board of Hexagon Bio. E.T. and D.M.P. are employees of Pacific Biosciences. G.U. is an employee of Amazon. S.T.S. and I.L. are co-founders and the CTO and CEO, respectively, of Phase Genomics. The remaining authors declare no competing interests.

Additional information

Peer review information Nature Biotechnology thanks Mads Albertsen, C. Titus Brown and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Contig-level comparison of pCLR and HiFi assemblies.

a. Strategy for generating the read sets for the three pCLR and the HiFi assemblies. b. Comparison of contig length distributions in the four assemblies demonstrating a tendency for pCLR assembly to create longer contigs. c. Comparison of the total length of each assembly after separation of contigs into predicted Superkingdoms demonstrating an increased length from HiFi assembly among assigned Superkingdom and reduced length in unassigned bin. d. Comparison of the completeness of pCLR and HiFi assemblies based on the presence of>90% expected single-copy genes with

Extended Data Fig. 2 Assembled MAG taxonomy.

A circular dendrogram showing the presence (blue) and absence (black) of GTDB-TK assigned taxonomy to Assembly bins for the HiFi (outermost ring) and CLR (innermost rings, descending) assemblies. Branch nodes were consolidated to Genus-level affiliations when possible. Branch colors were assigned based on Phylum-level classification, with the exception of the Firmicutes, which was sub-divided into separate classes due to its increased diversity relative to other Phyla.

Extended Data Fig. 3 Read depth across orthologous, collapsed pCLR bins.

Each bin from separate, replicate pCLR assemblies corresponds to all three HiFi bins displayed in Supplementary Figure 6. Read depth that can be attributed to the reference sequence is labeled in blue, whereas phased alternative haplotypes identified via MAGPhase are labelled in alternating colors (see legend). Contig ends are denoted by vertical black bars and the x-axis represents the total length of the entire MAG with contigs placed randomly from end-to-end.

Extended Data Fig. 4 Read depth across three closely related HiFi Complete MAGs.

Read depth that can be attributed to the reference sequence is labeled in blue, whereas phased alternative haplotypes identified via MAGPhase are labelled in alternating colors (see legend). Contig ends are denoted by vertical black bars and the x-axis represents the total length of the entire MAG with contigs placed randomly from end-to-end.

Extended Data Fig. 5 Biosynthetic Gene Cluster Analysis.

The HiFi assembly revealed approximately 25% more complete Biosynthetic Gene Clusters (BGCs) than the average pCLR assembly (a). This increase was manifested in all identified BGC classes (colors in legend) and was not exclusive to one particular class. As found in other metagenome assembly datasets, the majority of identified BGCs were novel in all assemblies (b), but the HiFi assembly had a higher proportion of novel BGCs than the other assemblies. Additionally, the HiFi assembly contained more partial BGCs (c) of any assembly.

Extended Data Fig. 6 CLR1 viral association network plot.

Viral contigs identified from Blobtools-assigned taxonomy estimates are represented as hexagonal nodes with black borders, whereas non-viral host contigs are represented as circular nodes with white borders. Edges represent associations identified for each connection, with colors representing the identification of partial HiFi read overlap (blue), Hi-C read links (green) or both types of data (red), respectively.

Extended Data Fig. 7 CLR2 Viral association network plot.

Viral contigs identified from Blobtools-assigned taxonomy estimates are represented as hexagonal nodes with black borders, whereas non-viral host contigs are represented as circular nodes with white borders. Edges represent associations identified for each connection, with colors representing the identification of partial HiFi read overlap (blue), Hi-C read links (green) or both types of data (red), respectively.

Extended Data Fig. 8 CLR3 Viral association network plot.

Viral contigs identified from Blobtools-assigned taxonomy estimates are represented as hexagonal nodes with black borders, whereas non-viral host contigs are represented as circular nodes with white borders. Edges represent associations identified for each connection, with colors representing the identification of partial HiFi read overlap (blue), Hi-C read links (green) or both types of data (red), respectively.

Supplementary information

