Quick Start Guide#
This guide will help you get started with scDataset quickly.
Basic Concepts#
scDataset is built around two main concepts:
Data Collections: Any object that supports indexing (
__getitem__) and length (__len__)Sampling Strategies: Define how data is sampled and batched
Minimal Example#
The simplest way to use scDataset is as a drop-in replacement for your existing dataset:
from scdataset import scDataset, Streaming
from torch.utils.data import DataLoader
import numpy as np
# Your existing data (numpy array, AnnData, HuggingFace Dataset, etc.)
data = np.random.randn(1000, 100) # 1000 samples, 100 features
# Create scDataset with streaming strategy
dataset = scDataset(data, Streaming(), batch_size=64)
# Use with DataLoader (note: batch_size=None)
loader = DataLoader(dataset, batch_size=None, num_workers=4)
for batch in loader:
print(f"Batch shape: {batch.shape}") # (64, 100)
# Your training code here
break
Note
Always set batch_size=None in the DataLoader when using scDataset,
as batching is handled internally by the dataset.
Sampling Strategies#
scDataset supports several sampling strategies:
Streaming (Sequential)#
from scdataset import Streaming
# Sequential access without shuffling
strategy = Streaming()
dataset = scDataset(data, strategy, batch_size=64)
# Sequential access with buffer-level shuffling (like Ray Dataset/WebDataset)
strategy = Streaming(shuffle=True)
dataset = scDataset(data, strategy, batch_size=64)
# This shuffles batches within each fetch buffer while maintaining
# sequential order between buffers
Block Shuffling#
from scdataset import BlockShuffling
# Shuffle in blocks for better I/O while maintaining some randomness
strategy = BlockShuffling(block_size=8)
dataset = scDataset(data, strategy, batch_size=64)
Weighted Sampling#
from scdataset import BlockWeightedSampling
# Sample with custom weights (e.g., higher weight for rare samples)
weights = np.random.rand(len(data)) # Custom weights per sample
strategy = BlockWeightedSampling(
weights=weights,
total_size=10000, # Generate 10000 samples per epoch
block_size=8
)
dataset = scDataset(data, strategy, batch_size=64)
Class Balanced Sampling#
from scdataset import ClassBalancedSampling
# Automatically balance classes
labels = np.random.choice(['A', 'B', 'C'], size=len(data))
strategy = ClassBalancedSampling(labels, total_size=10000)
dataset = scDataset(data, strategy, batch_size=64)
Working with Different Data Formats#
NumPy Arrays#
import numpy as np
data = np.random.randn(5000, 2000)
dataset = scDataset(data, Streaming(), batch_size=64)
AnnData Objects#
import anndata as ad
import scanpy as sc
# Load your single-cell data
adata = sc.datasets.pbmc3k()
# Use the expression matrix
dataset = scDataset(adata.X, Streaming(), batch_size=64)
# Or create a custom fetch callback for more complex data
def fetch_adata(collection, indices):
return collection[indices].X.toarray()
dataset = scDataset(adata, Streaming(), batch_size=64, fetch_callback=fetch_adata)
HuggingFace Datasets#
from datasets import load_dataset
dataset_hf = load_dataset("your/dataset", split="train")
dataset = scDataset(dataset_hf, Streaming(), batch_size=64)
Performance Optimization#
For large datasets, you can optimize performance using these parameters:
dataset = scDataset(
data,
BlockShuffling(block_size=4),
batch_size=64,
fetch_factor=16, # Fetch 16 batches at once
)
loader = DataLoader(
dataset,
num_workers=12, # Multiple workers for parallel loading
prefetch_factor=17, # fetch_factor + 1
)
Data Transforms#
You can apply transforms at different stages:
def normalize_batch(batch):
# Apply per-batch normalization
return (batch - batch.mean()) / batch.std()
def preprocess_fetch(data):
# Apply to fetched data before batching
return data.astype(np.float32)
dataset = scDataset(
data,
Streaming(),
batch_size=64,
fetch_transform=preprocess_fetch,
batch_transform=normalize_batch
)
Next Steps#
See Examples for more detailed use cases
Check the API Reference for complete API reference
Read about advanced features in the full examples