Changelog#
[0.3.0] - 2025-01-30#
Major Features#
Native DDP support: Full Distributed Data Parallel support with round-robin fetch distribution across ranks. Auto-detects
torch.distributedsettings.All sampling strategies (Streaming, BlockShuffling, BlockWeightedSampling, ClassBalancedSampling) work seamlessly with DDP
No
DistributedSamplerneeded - partitioning handled internally
Built-in transform functions (
transforms.py):adata_to_mindex()- Transform AnnData/AnnCollection to MultiIndexablehf_tahoe_to_tensor()- Convert HuggingFace sparse data to dense tensorsbionemo_to_tensor()- Convert BioNemo data format to tensors
Auto-configuration module (
scdataset.experimental.auto_config):suggest_parameters()- Automatically suggest optimalnum_workers,fetch_factor, andblock_sizebased on data and system resources
Training experiments module (
training_experiments/): Comprehensive framework for benchmarking data loading strategies on the Tahoe-100M dataset
Bug Fixes#
Fixed unsorted indices issue: Sampling strategies now automatically sort indices to ensure optimal I/O performance. When unsorted indices are provided, a warning is issued and indices are sorted automatically. This fix addresses issues with disk access patterns that could occur when users passed indices in arbitrary order. (Thanks to @deto for reporting this issue)
Fixed ClassBalancedSampling with indices: Class-balanced sampling now correctly handles subset indices by computing weights only for the specified subset. Previously, when
indiceswas provided, weights could mismatch the subset size.Fixed BlockWeightedSampling weights handling with indices: When both
weightsandindicesare provided, weights are now properly aligned with the subset. Supports both full weights (matching data_collection) that get subsetted, and pre-subsetted weights (matching indices length).
Added#
Unstructured data support in MultiIndexable:
New
unstructuredparameter to store non-indexable metadataUseful for storing gene names, dataset info, or other metadata
Unstructured data is preserved through subsetting operations
New
unstructured_keysproperty to list available keys
Jupyter notebook integration in documentation:
Added
nbsphinxextension for including Jupyter notebooks in docsTutorial notebook (
tahoe_tutorial.ipynb) now available in docsNotebooks are rendered without execution for faster builds
Comprehensive test suite:
Tests for all strategies, MultiIndexable, scDataset, and auto_config
Tests for dict-like interface (items, keys, values) in MultiIndexable
Tests for error handling and edge cases
Tests for doc code snippets from quickstart guide
Documentation improvements:
New transforms guide (
transforms.rst) documentingfetch_callback,fetch_transform,batch_callback, andbatch_transformComprehensive AnnCollection example in examples
Documentation badge added to README and docs
Updated benchmarks README with utility documentation
Dependencies#
Added optional
[auto]extras for auto-configuration:pip install scDataset[auto]Added optional
[docs]extras for documentation building:pip install scDataset[docs]Added
[dev]extras for development:pip install scDataset[dev]
[0.2.0] - 2025-08-28#
Breaking Changes#
Completely redesigned API: scDataset now uses a strategy-based sampling approach instead of modes
Constructor changes:
scDataset(data_collection, strategy, batch_size, ...)replaces oldscDataset(data_collection, batch_size, ...)New required parameter:
strategy- must provide aSamplingStrategyinstanceblock_sizeparameter moved to strategiesRemoved methods:
subset(),set_mode()
Added#
Strategy-based sampling system:
SamplingStrategy- Abstract base class for all sampling strategiesStreaming- Sequential sampling with optional buffer-level shufflingBlockShuffling- Block-based shuffling for data locality while maintaining randomizationBlockWeightedSampling- Weighted sampling with configurable block sizes and replacement optionsClassBalancedSampling- Automatic class balancing for imbalanced datasets
MultiIndexable class - Container for multi-modal data with synchronized indexing:
Supports multiple indexable objects (arrays, lists, etc.) that are indexed together
Named and positional access to contained indexables
Useful for gene expression + protein data, features + labels, etc.
Migration Guide#
Old v0.1.x syntax:
# v0.1.x - No longer supported
dataset = scDataset(data, batch_size=64, block_size=8, fetch_factor=4)
dataset.subset(train_indices)
dataset.set_mode('train')
New v0.2.0 syntax:
# v0.2.0 - Strategy-based approach
from scdataset import scDataset, BlockShuffling
strategy = BlockShuffling(block_size=8, indices=train_indices)
dataset = scDataset(data, strategy, batch_size=64, fetch_factor=4)