Getting started
Install
bento-sc is distribution on PyPI.
pip install bento-sc
Note: The package has been tested with torch==2.2.2 and pytorch-lightning==2.2.5. If you encounter errors with bento-sc using more recent version of these two packages, consider downgrading.
You may need to install PyTorch before running this command in order to ensure the right CUDA kernels for your system are installed.
Download and prepare data
In order to get up and running, you will need to download the relevant data and process them to h5torch-compatible HDF5 files.
All data downloading and processing routines are made available through a single CLI script bentosc_data:
usage: bentosc_data [-h] datafile
Data downloading script launching pad. Choose a datafile to download and process to h5torch format.
positional arguments:
datafile Datafile to download, choices: {scTab, scTab_upscaling, scTab_grn, neurips_citeseq, replogle_perturb, batchcorr_embryolimb, batchcorr_greatapes, batchcorr_circimm}
options:
-h, --help show this help message and exit
On each of the subcommands, you can also call -h. E.g.: bentosc_data scTab -h.
To download and preprocess the pre-training data:
bentosc_data scTab ./data_tmp/ ./scTab.h5t
To perform downstream task evaluations, additionally process any or all task-specific data:
bentosc_data scTab_upscaling ./scTab.h5t ./scTab_upsc_val.h5t val
bentosc_data scTab_upscaling ./scTab.h5t ./scTab_upsc_test.h5t test
bentosc_data scTab_grn ./scTab.h5t ./scTab_grn_val.h5t ./ext_pertdata.h5ad ./scenicdb.feather val
bentosc_data scTab_grn ./scTab.h5t ./scTab_grn_test.h5t /ext_pertdata.h5ad ./scenicdb.feather test
bentosc_data neurips_citeseq ./data_tmp/ ./scTab.h5t ./citeseq.h5t
bentosc_data replogle_perturb ./data_tmp/ ./scTab.h5t ./perturb.h5t
bentosc_data batchcorr_embryolimb ./data_tmp/ ./scTab.h5t ./batchcorr_el.h5t
bentosc_data batchcorr_greatapes ./data_tmp/ ./scTab.h5t ./batchcorr_ga.h5t
bentosc_data batchcorr_circimm ./data_tmp/ ./scTab.h5t ./batchcorr_ci.h5t
Note: downloading and processing all of this data will cumulatively take up quite a bit of time and storage space. Allow for at least 400Gb storage space.
Pre-training a model
Pre-training can be performed through the CLI script bentosc_pretrain:
usage: bentosc_pretrain [-h] [--data_path str] [--lr float] [--ckpt_path str] [--tune_mode boolean] config_path logs_path
Pre-training script.
positional arguments:
config_path .yaml config file controlling most of the pre-training parameters.
logs_path Where to save tensorboard logs and model weight checkpoints.
options:
-h, --help show this help message and exit
--data_path str Data file. Overrides value in config file if specified (default: None)
--lr float Learning rate. Overrides value in config file if specified (default: None)
--ckpt_path str Continue from checkpoint (default: None)
--tune_mode boolean Don't pre-train whole model but run small experiment. (default: False)
The first input to bentosc_pretrain, a config.yaml file, controls most of the pre-training logic.
A minimal example of a working config file is:
#DataModule args:
batch_size: 192 # per-gpu batch size
devices: [0, 1] # devices to use. here, the first and second GPU are used (total batch size: 384)
n_workers: 12 # number of CPUs to use in dataloading
in_memory: False # don't load scTab.h5t into memory fully, but read from disk during training.
val_sub: True # use the subsetted validation set of scTab.
# Data processing args:
return_zeros: False # if False, load in the cell profiles without zeros
allow_padding: False # if False, cut to min size in batch
input_processing: # cell-wise preprocessing functions
- type: FilterTopGenes # get the top genes
affected_keys: ["gene_counts", "gene_index", "gene_counts_true"]
number: 1024
- type: Bin # Bin input counts
key: "gene_counts"
- type: Mask # Mask 15% of input counts
p: 0.15
key: "gene_counts"
- type: Bin # Bin output counts
key: "gene_counts_true"
# Model args:
discrete_input: True # Set to True if using binned or rank input encodings
n_discrete_tokens: 29 # the number of bins
gate_input: False # use the gating mechanism on input embeddings
pseudoquant_input: False # use the pseudoquantization mechanism on input embeddings
dim: 512 # hidden dim of the transformer
depth: 10 # number of transformer encoder layers
dropout: 0.2 # dropout rate in attention matrix and FF
n_genes: 19331 # number of genes in dataset to initialize gene index vocabulary
# General learning args
lr: 3e-4 # Pre-training learning rate
train_on_all: False # Only compute loss on masked/noised positions
loss: # type of loss and parameters
type: BinCE
n_bins: 29
# Pre-training args:
nce_loss: False # Use contrastive learning or not
nce_dim: 64 # contrastive embedding dim
nce_temp: 1 # Temp in contrastive loss func
# Fine-tuning args:
celltype_clf_loss: False # Use celltype-clf, can be used during both pre-training and fine-tuning
modality_prediction_loss: False # Fine-tune for citeseq task
cls_finetune_dim: 164 # final linear layer projecting the CLS embedding. Should be 164 for scTab Celltype ID, and 134 for NeurIPS citeseq
perturb_mode: False # Fine-tune for perturbation task
Using this base design, one can train the “base” scLM configuration in our study. Many more examples are available in our reproducibility GitHub repository.
Evaluating performance on a downstream task
Once a model is pre-trained, its performance on downstream tasks can be evaluated.
Currently, bento-sc defines six downstream tasks.
Each one is implemented through a CLI script:
Batch Correction:
bentosc_task_batchcorrCelltype Identification:
bentosc_task_celltypeidGRN Inference:
bentosc_task_grninferPost Perturbation Expression Prediction:
bentosc_task_perturbProtein concentration Prediction:
bentosc_task_protconcGene Expression upscaling:
bentosc_task_upscale
The command line inputs for these scripts can be inspected via their -h flag, e.g.:
usage: bentosc_task_celltypeid [-h] [--data_path str] [--lr float] [--batch_size int] [--n_workers int] [--prefetch_factor int] [--tune_mode boolean] config_path checkpoint logs_path
Fine-tuning script for cell-type identification evaluation.
positional arguments:
config_path config_path
checkpoint checkpoint path
logs_path logs_path
options:
-h, --help show this help message and exit
--data_path str Data file. Overrides value in config file if specified (default: None)
--lr float Learning rate. Overrides value in config file if specified (default: None)
--batch_size int Batch size. Overrides value in config file if specified (default: None)
--n_workers int Num workers. Overrides value in config file if specified (default: None)
--prefetch_factor int
Prefetch Factor of dataloader. Overrides value in config file if specified (default: None)
--tune_mode boolean Don't pre-train whole model but run small experiment. (default: False)
Note: many tasks require different config file values from their pre-training one. For examples, refer to our reproducibility GitHub repository.