scDataset Documentation#

Scalable Data Loading for Deep Learning on Large-Scale Single-Cell Omics

scDataset is a flexible and efficient PyTorch IterableDataset for large-scale single-cell omics datasets. It supports a variety of data formats (e.g., AnnData, HuggingFace Datasets, NumPy arrays) and is designed for high-throughput deep learning workflows. While optimized for single-cell data, it is general-purpose and can be used with any dataset.

Key Features#

✨ Flexible Data Source Support: Integrates seamlessly with AnnData, HuggingFace Datasets, NumPy arrays, PyTorch Datasets, and more.

🚀 Scalable: Handles datasets with billions of samples without loading everything into memory.

⚡ Efficient Data Loading: Block sampling and batched fetching optimize random access for large datasets.

🔄 Dynamic Splitting: Split datasets into train/validation/test dynamically, without duplicating data or rewriting files.

🎯 Custom Hooks: Apply transformations at fetch or batch time via user-defined callbacks.

Quick Start#

Install from PyPI:

pip install scDataset

Basic usage:

from scdataset import scDataset, Streaming
from torch.utils.data import DataLoader
import numpy as np

# Create sample data
data = np.random.randn(10000, 2000)  # 10k cells, 2k genes

# Create dataset with streaming strategy
dataset = scDataset(data, Streaming(), batch_size=64)

# Use with PyTorch DataLoader (note: batch_size=None)
loader = DataLoader(dataset, batch_size=None, num_workers=4)

for batch in loader:
    print(f"Batch shape: {batch.shape}")
    break

Contents:

Tutorials:

Quickstart: Using scDataset with the Tahoe-100M Dataset

Additional Resources: