# Getting started ## Install `bento-sc` is distribution on PyPI. ```bash pip install bento-sc ``` Note: The package has been tested with `torch==2.2.2` and `pytorch-lightning==2.2.5`. If you encounter errors with `bento-sc` using more recent version of these two packages, consider downgrading. You may need to [install PyTorch](https://pytorch.org/get-started/locally/) before running this command in order to ensure the right CUDA kernels for your system are installed. ## Download and prepare data In order to get up and running, you will need to download the relevant data and process them to [`h5torch`-compatible HDF5 files](https://github.com/gdewael/h5torch). All data downloading and processing routines are made available through a single CLI script `bentosc_data`: ```bash usage: bentosc_data [-h] datafile Data downloading script launching pad. Choose a datafile to download and process to h5torch format. positional arguments: datafile Datafile to download, choices: {scTab, scTab_upscaling, scTab_grn, neurips_citeseq, replogle_perturb, batchcorr_embryolimb, batchcorr_greatapes, batchcorr_circimm} options: -h, --help show this help message and exit ``` On each of the subcommands, you can also call `-h`. E.g.: `bentosc_data scTab -h`. To download and preprocess the pre-training data: ```bash bentosc_data scTab ./data_tmp/ ./scTab.h5t ``` To perform downstream task evaluations, additionally process any or all task-specific data: ```bash bentosc_data scTab_upscaling ./scTab.h5t ./scTab_upsc_val.h5t val bentosc_data scTab_upscaling ./scTab.h5t ./scTab_upsc_test.h5t test bentosc_data scTab_grn ./scTab.h5t ./scTab_grn_val.h5t ./ext_pertdata.h5ad ./scenicdb.feather val bentosc_data scTab_grn ./scTab.h5t ./scTab_grn_test.h5t /ext_pertdata.h5ad ./scenicdb.feather test bentosc_data neurips_citeseq ./data_tmp/ ./scTab.h5t ./citeseq.h5t bentosc_data replogle_perturb ./data_tmp/ ./scTab.h5t ./perturb.h5t bentosc_data batchcorr_embryolimb ./data_tmp/ ./scTab.h5t ./batchcorr_el.h5t bentosc_data batchcorr_greatapes ./data_tmp/ ./scTab.h5t ./batchcorr_ga.h5t bentosc_data batchcorr_circimm ./data_tmp/ ./scTab.h5t ./batchcorr_ci.h5t ``` **Note:** downloading and processing all of this data will cumulatively take up quite a bit of time and storage space. Allow for at least 400Gb storage space. ## Pre-training a model Pre-training can be performed through the CLI script `bentosc_pretrain`: ```bash usage: bentosc_pretrain [-h] [--data_path str] [--lr float] [--ckpt_path str] [--tune_mode boolean] config_path logs_path Pre-training script. positional arguments: config_path .yaml config file controlling most of the pre-training parameters. logs_path Where to save tensorboard logs and model weight checkpoints. options: -h, --help show this help message and exit --data_path str Data file. Overrides value in config file if specified (default: None) --lr float Learning rate. Overrides value in config file if specified (default: None) --ckpt_path str Continue from checkpoint (default: None) --tune_mode boolean Don't pre-train whole model but run small experiment. (default: False) ``` The first input to `bentosc_pretrain`, a `config.yaml` file, controls most of the pre-training logic. A minimal example of a working config file is: ```yaml #DataModule args: batch_size: 192 # per-gpu batch size devices: [0, 1] # devices to use. here, the first and second GPU are used (total batch size: 384) n_workers: 12 # number of CPUs to use in dataloading in_memory: False # don't load scTab.h5t into memory fully, but read from disk during training. val_sub: True # use the subsetted validation set of scTab. # Data processing args: return_zeros: False # if False, load in the cell profiles without zeros allow_padding: False # if False, cut to min size in batch input_processing: # cell-wise preprocessing functions - type: FilterTopGenes # get the top genes affected_keys: ["gene_counts", "gene_index", "gene_counts_true"] number: 1024 - type: Bin # Bin input counts key: "gene_counts" - type: Mask # Mask 15% of input counts p: 0.15 key: "gene_counts" - type: Bin # Bin output counts key: "gene_counts_true" # Model args: discrete_input: True # Set to True if using binned or rank input encodings n_discrete_tokens: 29 # the number of bins gate_input: False # use the gating mechanism on input embeddings pseudoquant_input: False # use the pseudoquantization mechanism on input embeddings dim: 512 # hidden dim of the transformer depth: 10 # number of transformer encoder layers dropout: 0.2 # dropout rate in attention matrix and FF n_genes: 19331 # number of genes in dataset to initialize gene index vocabulary # General learning args lr: 3e-4 # Pre-training learning rate train_on_all: False # Only compute loss on masked/noised positions loss: # type of loss and parameters type: BinCE n_bins: 29 # Pre-training args: nce_loss: False # Use contrastive learning or not nce_dim: 64 # contrastive embedding dim nce_temp: 1 # Temp in contrastive loss func # Fine-tuning args: celltype_clf_loss: False # Use celltype-clf, can be used during both pre-training and fine-tuning modality_prediction_loss: False # Fine-tune for citeseq task cls_finetune_dim: 164 # final linear layer projecting the CLS embedding. Should be 164 for scTab Celltype ID, and 134 for NeurIPS citeseq perturb_mode: False # Fine-tune for perturbation task ``` Using this base design, one can train the "base" scLM configuration in our study. Many more examples are available in our reproducibility GitHub repository. ## Evaluating performance on a downstream task Once a model is pre-trained, its performance on downstream tasks can be evaluated. Currently, `bento-sc` defines six downstream tasks. Each one is implemented through a CLI script: - Batch Correction: `bentosc_task_batchcorr` - Celltype Identification: `bentosc_task_celltypeid` - GRN Inference: `bentosc_task_grninfer` - Post Perturbation Expression Prediction: `bentosc_task_perturb` - Protein concentration Prediction: `bentosc_task_protconc` - Gene Expression upscaling: `bentosc_task_upscale` The command line inputs for these scripts can be inspected via their `-h` flag, e.g.: ```bash usage: bentosc_task_celltypeid [-h] [--data_path str] [--lr float] [--batch_size int] [--n_workers int] [--prefetch_factor int] [--tune_mode boolean] config_path checkpoint logs_path Fine-tuning script for cell-type identification evaluation. positional arguments: config_path config_path checkpoint checkpoint path logs_path logs_path options: -h, --help show this help message and exit --data_path str Data file. Overrides value in config file if specified (default: None) --lr float Learning rate. Overrides value in config file if specified (default: None) --batch_size int Batch size. Overrides value in config file if specified (default: None) --n_workers int Num workers. Overrides value in config file if specified (default: None) --prefetch_factor int Prefetch Factor of dataloader. Overrides value in config file if specified (default: None) --tune_mode boolean Don't pre-train whole model but run small experiment. (default: False) ``` **Note:** many tasks require different config file values from their pre-training one. For examples, refer to our reproducibility GitHub repository.