bento-sc documentation
Single-cell language modeling
This repository is linked to the study called “A systematic assessment of single-cell language model configurations” (preprint paper link).
The package contains routines and definitions for pre-training single-cell (transcriptomic) language models.
Package features:
Memory-efficient scRNA-seq dataloading from
h5torch-compatible HDF5 files.yaml-configurable language model training scripts.Modular and extendable data preprocessing pipelines.
A diverse set of downstream tasks to evaluate scLM performance.
Full reproducibility instructions of our study results via bento-sc-reproducibility.
Install
bento-sc is distributed on PyPI.
pip install bento-sc
Note: The package has been tested with torch==2.2.2 and pytorch-lightning==2.2.5. If you encounter errors with bento-sc using more recent version of these two packages, consider downgrading.
You may need to install PyTorch before running this command in order to ensure the right CUDA kernels for your system are installed.
Package usage and structure
Please refer to our documentation page.
Academic reproducibility
All config files and scripts that were used to pre-train models and fine-tune them towards downstream tasks are included in a separate GitHub repository: bento-sc-reproducibility.
In addition, all scripts to reproduce the “baselines” in our study are located in the bento-sc-reproducibility repository.
Citation
If you end up using this code in your research, please cite:
@article {dewaele2025systematic,
author = {De Waele, Gaetan and Menschaert, Gerben and Waegeman, Willem},
title = {A systematic assessment of single-cell language model configurations},
year = {2025},
doi = {10.1101/2025.04.02.646825},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2025/04/08/2025.04.02.646825},
journal = {bioRxiv}
}