scDataset Documentation#
Scalable Data Loading for Deep Learning on Large-Scale Single-Cell Omics
scDataset is a flexible and efficient PyTorch IterableDataset for large-scale single-cell omics datasets.
It supports a variety of data formats (e.g., AnnData, HuggingFace Datasets, NumPy arrays) and is designed for
high-throughput deep learning workflows. While optimized for single-cell data, it is general-purpose and can be
used with any dataset.
Key Features#
✨ Flexible Data Source Support: Integrates seamlessly with AnnData, HuggingFace Datasets, NumPy arrays, PyTorch Datasets, and more.
🚀 Scalable: Handles datasets with billions of samples without loading everything into memory.
⚡ Efficient Data Loading: Block sampling and batched fetching optimize random access for large datasets.
🔄 Dynamic Splitting: Split datasets into train/validation/test dynamically, without duplicating data or rewriting files.
🎯 Custom Hooks: Apply transformations at fetch or batch time via user-defined callbacks.
Quick Start#
Install from PyPI:
pip install scDataset
Basic usage:
from scdataset import scDataset, Streaming
from torch.utils.data import DataLoader
import numpy as np
# Create sample data
data = np.random.randn(10000, 2000) # 10k cells, 2k genes
# Create dataset with streaming strategy
dataset = scDataset(data, Streaming(), batch_size=64)
# Use with PyTorch DataLoader (note: batch_size=None)
loader = DataLoader(dataset, batch_size=None, num_workers=4)
for batch in loader:
print(f"Batch shape: {batch.shape}")
break
Contents:
- Installation
- Quick Start Guide
- Examples
- Data Transforms and Callbacks
- Distributed & Parallel Training
- How DDP Works
- Basic DDP Setup
- Manual Rank Configuration
- Data Partitioning
- Automatic Epoch Handling
- Launching Distributed Training
- Complete Training Example
- Weighted Sampling with DDP
- DDP with Any Strategy
- Best Practices
- DataLoader Multiprocessing (num_workers)
- DataParallel (DP) Support
- Further Reading
- API Reference