API Reference#

This section provides detailed documentation for all classes and functions in scDataset.

Main Dataset Class#

scDataset

Iterable PyTorch Dataset for on-disk data collections with flexible sampling strategies.

Multi-Modal Data Support#

MultiIndexable

Container for multiple indexable objects that should be indexed together.

Transform Functions#

adata_to_mindex

Transform AnnData/AnnCollection batch to MultiIndexable with optional obs columns.

hf_tahoe_to_tensor

Transform HuggingFace Tahoe-100M sparse gene expression data to dense tensors.

bionemo_to_tensor

Fetch callback for BioNeMo SingleCellMemMapDataset.

Sampling Strategies#

SamplingStrategy

Abstract base class for sampling strategies.

Streaming

Sequential streaming sampling strategy with optional buffer-level shuffling.

BlockShuffling

Block-based shuffling sampling strategy.

BlockWeightedSampling

Weighted sampling with block-based shuffling.

ClassBalancedSampling

Class-balanced sampling with automatic weight computation.

Experimental Features#

Warning

Features in the experimental module are subject to change and may be modified significantly or removed entirely in future releases.

suggest_parameters

Suggest optimal parameters for scDataset based on system resources.

estimate_sample_size

Estimate the memory size of a single sample from the data collection.