A Python library for probabilistic analysis of single-cell omics data

To the Editor — Methods for analyzing single-cell data^1,2,3,4 perform a core set of computational tasks. These tasks include dimensionality reduction, cell clustering, cell-state annotation, removal of unwanted variation, analysis of differential expression, identification of spatial patterns of gene expression, and joint analysis of multi-modal omics data. Many of these methods rely on likelihood-based models to represent variation in the data; we refer to these as ‘probabilistic models’. Probabilistic models provide principled ways to capture uncertainty in biological systems and are convenient for decomposing the many sources of variation that give rise to omics data⁵.

Despite the appeal of probabilistic models, several obstacles impede their community-wide adoption. The first obstacle, coming from the perspective of the end user, relates to the difficulty of implementing and running such models. Because probabilistic models are often implemented using Python machine-learning libraries, users are often required to interact with interfaces and objects that are lower level in nature than those used in popular environments for single-cell data analysis like Bioconductor⁶, Seurat⁷ or Scanpy⁸.

A second obstacle relates to the development of new probabilistic models. From the perspective of developers, there are many necessary routines to implement in support of a probabilistic model, including data handling, tensor computations, training routines that handle device management (for example, GPU (graphic processing unit) computing), and the underlying optimization, sampling and numerical procedures. Although higher level machine-learning packages that automate some of these routines (for example, PyTorch Lightning⁹ or Keras¹⁰) are becoming popular, they do not work seamlessly with single-cell omics data.

To address these limitations, we present scvi-tools (https://scvi-tools.org/), a Python library for deep probabilistic analysis of single-cell omics data. From the end user’s perspective (Supplementary Note 1), scvi-tools offers standardized access to methods for many single-cell data analysis tasks, such as integration of single-cell RNA sequencing (scRNA-seq) data (scVI¹¹ or scArches¹²), annotation of single-cell profiles (CellAssign¹³ or scANVI¹⁴), deconvolution of bulk spatial transcriptomics profiles (Stereoscope¹⁵ or DestVI¹⁶), doublet detection (Solo¹⁷) and multi-modal analysis of CITE-seq (cellular indexing of transcriptomes and epitopes by sequencing) data (totalVI¹⁸).

In the broader analysis pipeline, scvi-tools sits downstream of initial quality control (QC)-driven preprocessing and generates outputs that may be further interpreted via general single-cell analysis tools (Fig. 1a). At its core, scvi-tools implements several key functionalities that are accessible across data modalities, such as differential analysis and dataset integration. All 14 models (Supplementary Table 1) currently implemented in scvi-tools interact with Scanpy through the annotated dataset (AnnData¹⁹) format, and the models share a consistent user interface (Fig. 1b). The scvi-tools library also has an interface with R such that each model may be used in Seurat or Bioconductor pipelines.

**Fig. 1: User perspective of scvi-tools.**

We also illustrate two new features of scvi-tools applicable to several types of omics data. The first feature offers the ability to remove unwanted variation due to multiple nuisance factors simultaneously, including both discrete (for example, batch category) and continuous (for example, percent mitochondrial reads) factors. In Supplementary Note 2, we apply this in the context of an scRNA-seq dataset of Drosophila wing development that suffered from nuisance variation due to cell cycle, sex and replicate. The second feature extends several scvi-tools integration methods to iteratively integrate new ‘query’ data into a pretrained ‘reference’ model via the recently proposed scArches neural network architecture surgery¹². This feature is particularly useful for incorporating new samples into an analysis without having to reprocess the entire set of samples. Supplementary Note 3 presents a case study of applying this approach with totalVI by projecting data from patients with COVID-19 into an atlas of immune cells.

From the perspective of a methods developer, scvi-tools offers a set of building blocks that make it easy to implement new models and modify existing models with minimal code overhead (Fig. 2a,b and Supplementary Note 4). These building blocks use popular libraries, such as AnnData¹², PyTorch²⁰, PyTorch Lightning⁹ and Pyro²¹, and facilitate probabilistic model design with neural network components and GPU acceleration. This allows method developers to primarily focus on developing probabilistic models instead of on data management, model training and user-interface code. We demonstrate how these building blocks can be used for efficient model development through a reimplementation of Stereoscope, in which we demonstrate a substantial reduction in code complexity (Fig. 2c–e and Supplementary Note 5). This example demonstrates the broad scope of analyses that may be powered by scvi-tools.

**Fig. 2: The scvi-tools API for developers and reimplementation of Stereoscope.**

On the scvi-tools documentation website, we feature the application programming interface (API) reference of each model, as well as tutorials describing the functionality of each model and its interaction with other single-cell tools. We also make these tutorials available via Google Colab, which provides a free computing environment and GPU and can even support large-scale analyses.

In the development of scvi-tools, we aimed to bridge the gap that exists between the single-cell software ecosystem and the contemporary machine-learning frameworks for constructing and deploying this class of models. Thus, developers can now expect to build models that are immediately accessible to end users in the single-cell community while continuing to rely on popular machine-learning libraries. On our documentation website, we provide a series of tutorials on building a model with scvi-tools, walking through the steps of data management, module construction and model development. We also built a template repository on GitHub that enables developers to quickly create a Python package that uses unit testing, automated documentation and popular code styling libraries. This repository demonstrates how the scvi-tools building blocks can be used for external model deployment. We anticipate that most models built with scvi-tools will be deployed in this way as independent packages while adhering to standard API and coding conventions, which will make them more readily accessible for new users.

As scvi-tools remains under active development, end users can expect that scvi-tools will continually evolve, adding support for new models, new workflows and new features. We anticipate that these resources will serve the single-cell community by facilitating the prototyping of new models, creating a standard for the deployment of probabilistic analysis software and enhancing the scientific discovery pipeline.

References

Svensson, V., da Veiga Beltrame, E. & Pachter, L. Database 2020, baaa073 (2020).
Article

Google Scholar
Lee, J., Hyeon, D. Y. & Hwang, D. Exp. Mol. Med. 52, 1428–1442 (2020).
CAS
Article

Google Scholar
Wagner, A., Regev, A. & Yosef, N. Nat. Biotechnol. 34, 1145–1160 (2016).
CAS
Article

Google Scholar
Zappia, L., Phipson, B. & Oshlack, A. PLOS Comput. Biol. 14 (2018).
Lopez, R., Gayoso, A. & Yosef, N. Mol. Syst. Biol. 16, e9198 (2020).
Article

Google Scholar
Gentleman, R. C. et al. Genome Biol. 5, R80 (2004).
Article

Google Scholar
Satija, R., Farrell, J. A., Gennert, D., Schier, A. F. & Regev, A. Nat. Biotechnol. 33, 495–502 (2015).
CAS
Article

Google Scholar
Wolf, F. A., Angerer, P. & Theis, F. J. Genome Biol. 19, 15 (2018).
Article

Google Scholar
Falcon, W. & The PyTorch Lightning team. PyTorch Lightning (Version 1.4). (2019); https://doi.org/10.5281/zenodo.3828935
Chollet, F. et al. Keras. https://keras.io (2015).
Lopez, R., Regier, J., Cole, M. B., Jordan, M. I. & Yosef, N. Nat. Methods 15, 1053–1058 (2018).
CAS
Article

Google Scholar
Lotfollahi, M. et al. Nat. Biotechnol. 40, 121–130 (2022).
CAS
Article

Google Scholar
Zhang, A. W. et al. Nat. Methods 16, 1007–1015 (2019).
CAS
Article

Google Scholar
Xu, C. et al. Mol. Syst. Biol. 17, e9620 (2021).
Article

Google Scholar
Andersson, A. et al. Commun. Biol. 3, 565 (2020).
Article

Google Scholar
Lopez, R. et al. Preprint at bioRxiv https://doi.org/10.1101/2021.05.10.443517 (2021).
Bernstein, N. J. et al. Cell Syst. 11, 95–101.e5 (2020).
CAS
Article

Google Scholar
Gayoso, A. et al. Nat. Methods 18, 272–282 (2021).
CAS
Article

Google Scholar
Angerer, P., Wolf, A., Virshup, I. & Rybakov, S. AnnData. GitHub https://github.com/theislab/anndata (2019).
Paszke, A. et al. Adv. Neural Inf. Process. Syst. 32, 8026–8037 (2019).

Google Scholar
Bingham, E. et al. J. Mach. Learn. Res. 20, 1–6 (2019).

Google Scholar

Download references

Acknowledgements

We acknowledge members of the Streets and Yosef laboratories for general feedback. We thank all the GitHub users who contributed code to scvi-tools over the years. We thank Nicholas Everetts for help with the analysis of the Drosophila data. We thank David Kelley and Nick Bernstein for help implementing Solo. We thank Marco Wagenstetter and Sergei Rybakov for help with the transition of the scGen package to use scvi-tools, as well as feedback on the scArches implementation. We thank Hector Roux de Bézieux for insightful discussions about the R ecosystem. We thank Kieran Campbell and Allen Zhang for clarifying aspects of the original CellAssign implementation. We thank the Pyro team, including Eli Bingham, Martin Jankowiak and Fritz Obermeyer, for help integrating Pyro in scvi-tools. Research reported in this manuscript was supported by the NIGMS of the National Institutes of Health under award number R35GM124916 and by the Chan-Zuckerberg Foundation Network under grant number 2019-02452. O.C. is supported by the EPSRC Centre for Doctoral Training in Modern Statistics and Statistical Machine Learning (EP/S023151/1, studentship 2420649). A.G. is supported by NIH Training Grant 5T32HG000047-19. A.S. and N.Y. are Chan Zuckerberg Biohub investigators.

Author information

Author notes

Maxime Langevin
Present address: Pasteur, Department of Chemistry, École Normale Supérieure, PSL University, Paris, France
Maxime Langevin
Present address: Molecular Design Sciences – Integrated Drug Discovery, Sanofi R&D, Vitry-sur-Seine, France
Yining Liu & Achille Nazaret
Present address: Department of Computer Science, Columbia University, New York, NY, USA
Gabriel Misrachi
Present address: Gleamer, Paris, France
Oscar Clivio
Present address: Department of Statistics, University of Oxford, Oxford, UK
These authors contributed equally: Adam Gayoso, Romain Lopez and Galen Xing

Affiliations

Center for Computational Biology, University of California, Berkeley, Berkeley, CA, USA
Adam Gayoso, Galen Xing, Valeh Valiollah Pour Amiri, Justin Hong, Chenling Xu, Tal Ashuach, Mariano Gabitto, Aaron Streets, Michael I. Jordan & Nir Yosef
Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley, CA, USA
Romain Lopez, Pierre Boyeau, Valeh Valiollah Pour Amiri, Justin Hong, Katherine Wu, Michael Jayasuriya, Yining Liu, Mariano Gabitto, Michael I. Jordan & Nir Yosef
Chan Zuckerberg Biohub, San Francisco, CA, USA
Galen Xing, Aaron Streets & Nir Yosef
École Normale Supérieure Paris-Saclay, Gif-sur-Yvette, France
Pierre Boyeau, Edouard Mehlman & Oscar Clivio
Centre de Mathématiques Appliquées, École polytechnique, Palaiseau, France
Edouard Mehlman, Maxime Langevin, Gabriel Misrachi & Achille Nazaret
Mines Paristech, PSL University, Paris, France
Jules Samaran
Institute of Computational Biology, Helmholtz Center Munich, Munich, Germany
Mohammad Lotfollahi & Fabian J. Theis
School of Life Sciences Weihenstephan, Technical University of Munich, Munich, Germany
Mohammad Lotfollahi & Fabian J. Theis
Serqet Therapeutics, Cambridge, MA, USA
Valentine Svensson
Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA
Eduardo da Veiga Beltrame & Lior Pachter
Cellular Genetics Programme, Wellcome Sanger Institute, Wellcome Genome Campus, Cambridge, UK
Vitalii Kleshchevnikov & Carlos Talavera-López
EMBL-European Bioinformatics Institute, Wellcome Genome Campus, Cambridge, UK
Carlos Talavera-López
Department of Computing and Mathematical Sciences, California Institute of Technology, Pasadena, CA, USA
Lior Pachter
Department of Bioengineering, University of California, Berkeley, Berkeley, CA, USA
Aaron Streets
Department of Statistics, University of California, Berkeley, Berkeley, CA, USA
Michael I. Jordan
Department of Statistics, University of Michigan, Ann Arbor, MI, USA
Jeffrey Regier
Ragon Institute of Massachusetts General Hospital, MIT and Harvard, Cambridge, MA, USA
Nir Yosef

Contributions

A.G., R.L and G.X. contributed equally. A.G. designed the scvi-tools application programming interface with input from G.X. and R.L. G.X. and A.G. led development of scvi-tools with input from R.L. G.X. reimplemented scVI, totalVI, AutoZI and scANVI with input from A.G. R.L. implemented Stereoscope with input from A.G. Data analysis in this manuscript was led by A.G., R.L. and G.X, with input from N.Y. A.G., R.L., P.B., E.M., M. Langevin., Y.L., J.S., G.M. and A.N., O.C. worked on the initial version of the codebase (scvi package), with input from M.I.J, J.R. and N.Y. R.L., E.M. and C.X. contributed the scANVI model, with input from J.R. and N.Y. A.G. implemented totalVI with input from A.S. and N.Y. T.A. implemented peakVI with input from A.G. A.G implemented scArches with input from M. Lotfollahi., F.J.T and N.Y. V.S. made several contributions to the codebase, including the LDVAE model. P.B. contributed the differential expression programming interface. E.d.V.B. and C.T.-L. provided tutorials on differential expression and deconvolution of spatial transcriptomics, with input from L.P. K.W. implemented CellAssign in the codebase with input from A.G. V.V.P.A., J.H. and M.J. made general code contributions and helped maintain scvi-tools. J.H. implemented LDA. T.A. and M.G. implemented MultiVI. V.K. improved Pyro support in scvi-tools and ported Cell2Location to use scvi-tools. N.Y. supervised all research. A.G., R.L., G.X., J.R. and N.Y. wrote the manuscript.

Corresponding author

Correspondence to
Nir Yosef.

Ethics declarations

Competing interests

V.S. is a full-time employee of Serqet Therapeutics and has ownership interest in Serqet Therapeutics. F.J.T. reports consulting fees from Roche Diagnostics GmbH and Cellarity Inc., and ownership interest in Cellarity, Inc. N.Y. is an advisor to and/or has equity in Cellarity, Celsius Therapeutics and Rheos Medicines. The remaining authors declare no competing interests.

Peer review

Peer review information

Nature Biotechnology thanks Martin Hemberg and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Supplementary information

About this article

Cite this article

Gayoso, A., Lopez, R., Xing, G. et al. A Python library for probabilistic analysis of single-cell omics data.
Nat Biotechnol (2022). https://doi.org/10.1038/s41587-021-01206-w

Download citation

Published: 07 February 2022
DOI: https://doi.org/10.1038/s41587-021-01206-w

Note: This article have been indexed to our site. We do not claim legitimacy, ownership or copyright of any of the content above. To see the article at original source Click Here

Cold Press Juicer, 400w Slow Juicer Machines with 5.4" Wide Feed Chute, Masticating Juicer with High Juice Yield for Whole Vegetables & Fruits

(516)

$99.99 (as of December 23, 2024 19:05 GMT +00:00 - )

PlayStation DualSense Wireless Controller Cosmic Red (Renewed)

(1797)

$61.95 (as of December 23, 2024 19:13 GMT +00:00 - )

Over 50 DIFFERENT World Coins (1/2) Half Pound Grab Bag

(2018)

$15.99 (as of December 23, 2024 19:13 GMT +00:00 - )

Sol de Janeiro Hair & Body Fragrance Mist Travel Size 90mL/3.0 fl oz.

(43225)

$25.00 (as of December 23, 2024 19:13 GMT +00:00 - )

Hearth and Homestead: Handmade Whipped Tallow Balm (Unscented/Herb-Infused) - Organic Body Butter with Infused Olive Oil, for Eczema, Rosacea, Baby - 1.3 oz

(4325)

$29.99 (as of December 23, 2024 19:34 GMT +00:00 - )

Index Of News Author

Science and Medical

What’s worse than a ‘toxic’ workplace? One that gaslights employees

Credit: Pixabay/CC0 Public Domain When it comes to relationships between co-workers, organizations' stated priorities must match what's happening under the hood. These days, we hear a lot about "toxic bosses," "toxic companies," and the like. It's easy to forget that non-toxicity is not all we want from an employer. If we're really honest, most of

September 14, 2023

Science and Medical

Emirates, VR simulated in-flight experience A380 and 777

　エミレーツ航空（UAE/EK）は、VR（仮想現実）技術を活用し機内を体験できるアプリの提供を開始した。対応のヘッドセットを使用することで、エアバスA380型機とボーイング777-300ER型機などの機内を疑似体験できる。ヘッドセットを着用しEmirates Oculus VRを体験する利用者（エミレーツ航空の動画から）　VRアプリ「Emirates Oculus VR」は、シャワーの水を出したり、プライベートスイートのドアを閉めたりなど、A380の機内ラウンジや777のファーストクラスなどをインタラクティブ（双方向）で疑似体験できる。利用にはヘッドセット「オキュラスリフト（Oculus Rift）」などが必要となる。　このほかウェブサイトやスマートフォン用アプリでも、A380と777の客室などの機内を3Dで見ることができる。エミレーツ航空が提供するEmirates Oculus VR（同社の動画から）ヘッドセットを着用しEmirates Oculus VRを体験する利用者（エミレーツ航空の動画から）関連リンクExperience Emirates in VR with OculusThe Emirates Fleet in 3D VRの活用が進む航空業界・JAL、VRでCA訓練　最大4人参加、満席も再現（20年10月16日）・ロールス・ロイス、VRでエンジン整備の遠隔教育　ガルフG650用BR725で（20年5月17日）・JAL、KDDIの5G利用開始　国内航空会社で初、整備支援など活用（20年3月31日）・ANA、整備士の安全教育にVR導入　危険予知力高め労災ゼロに（20年2月27日）・ANA、VRで客室乗務員の訓練　NEC開発、緊急事態を再現（19年3月26日）エミレーツ航空・エミレーツ航空、CA募集開始　半年で3000人、採用強化（21年9月17日）・エミレーツ航空、航空券の有効期限3年延長（21年9月3日）・A380、11月に完納へ　エミレーツ航空、注残3機受領で（21年9月2日）・エミレーツ航空、建国50周年デカール機　A380と777に（21年8月17日）

September 27, 2021

Science and Medical

Brains of Cosmonauts “Rewired” During Space Missions

A new study published in Frontiers in Neural Circuits is the first to analyze the structural connectivity changes that happen in the brain after long-duration spaceflight. The results show significant microstructural changes in several white matter tracts such as the sensorimotor tracts. The study can form a basis for future research into the full scope…

February 19, 2022

Science and Medical

Blue Origin investigating New Shepard parachute issue

Blue Origin's New Shepard capsule descends on the NS-25 mission May 19 with only two of its three parachutes fully inflated. Credit: Blue Origin webcast KENNEDY SPACE CENTER, Fla. — A parachute failed to fully inflate on the latest Blue Origin New Shepard suborbital flight because a line controlling its expansion was not cut as

June 1, 2024

Science and Medical

Classified satellite declared lost after China launches twice in 2 hours

by Andrew Jones — September 27, 2021 Liftoff of the Kuaizhou-1A solid rocket sending the Jilin-1 Gaofen 02D Earth observation satellite into orbit. A Long March 3B carrying lifted off shortly after carrying a classified satellite China has since declared lost. Credit: Changguang Satellite Update 8:02 a.m. Sept. 28: Chinese state media confirm abnormal function…

September 27, 2021

Science and Medical

Funguje omikron ako prirodzená vakcína? Odborníci hovoria o konci pandémie aj stratách na životoch

Podľa riaditeľa brnianskeho laboratória Elisabeth Pharmacon Omara Šerého na omikron očkovanie zaberá menej a vôbec nebráni jeho šíreniu. Obmedzovanie neočkovaných a covid pas tak podľa neho stráca zmysel. Nádej do budúcnosti Výhodou by však mohol byť fakt, že prekonanie omikronu môže chrániť proti ostatným variantom, čo v kontexte toho, že je nový variant menej nebezpečný,…

January 13, 2022

Hand-Picked Top-Read Stories

Assistant school principal David Braff appears in court accused of molesting girls as young as SIX

Woman’s attempt to defend herself from abusive boyfriend ends in her violent murder

Sheriff slates nightclub after woman reveller falls prey to sex attacker when she was stopped from going back inside

Trending Tags

A Python library for probabilistic analysis of single-cell omics data

References

Acknowledgements