Changelog#

[0.3.0] - 2025-01-30#

Major Features#

Native DDP support: Full Distributed Data Parallel support with round-robin fetch distribution across ranks. Auto-detects torch.distributed settings.
- All sampling strategies (Streaming, BlockShuffling, BlockWeightedSampling, ClassBalancedSampling) work seamlessly with DDP
- No DistributedSampler needed - partitioning handled internally
Built-in transform functions (transforms.py):
- adata_to_mindex() - Transform AnnData/AnnCollection to MultiIndexable
- hf_tahoe_to_tensor() - Convert HuggingFace sparse data to dense tensors
- bionemo_to_tensor() - Convert BioNemo data format to tensors
Auto-configuration module (scdataset.experimental.auto_config):
- suggest_parameters() - Automatically suggest optimal num_workers, fetch_factor, and block_size based on data and system resources
Training experiments module (training_experiments/): Comprehensive framework for benchmarking data loading strategies on the Tahoe-100M dataset

Bug Fixes#

Fixed unsorted indices issue: Sampling strategies now automatically sort indices to ensure optimal I/O performance. When unsorted indices are provided, a warning is issued and indices are sorted automatically. This fix addresses issues with disk access patterns that could occur when users passed indices in arbitrary order. (Thanks to @deto for reporting this issue)
Fixed ClassBalancedSampling with indices: Class-balanced sampling now correctly handles subset indices by computing weights only for the specified subset. Previously, when indices was provided, weights could mismatch the subset size.
Fixed BlockWeightedSampling weights handling with indices: When both weights and indices are provided, weights are now properly aligned with the subset. Supports both full weights (matching data_collection) that get subsetted, and pre-subsetted weights (matching indices length).

Added#

Unstructured data support in MultiIndexable:
- New unstructured parameter to store non-indexable metadata
- Useful for storing gene names, dataset info, or other metadata
- Unstructured data is preserved through subsetting operations
- New unstructured_keys property to list available keys
Jupyter notebook integration in documentation:
- Added nbsphinx extension for including Jupyter notebooks in docs
- Tutorial notebook (tahoe_tutorial.ipynb) now available in docs
- Notebooks are rendered without execution for faster builds
Comprehensive test suite:
- Tests for all strategies, MultiIndexable, scDataset, and auto_config
- Tests for dict-like interface (items, keys, values) in MultiIndexable
- Tests for error handling and edge cases
- Tests for doc code snippets from quickstart guide
Documentation improvements:
- New transforms guide (transforms.rst) documenting fetch_callback, fetch_transform, batch_callback, and batch_transform
- Comprehensive AnnCollection example in examples
- Documentation badge added to README and docs
- Updated benchmarks README with utility documentation

Dependencies#

Added optional [auto] extras for auto-configuration: pip install scDataset[auto]
Added optional [docs] extras for documentation building: pip install scDataset[docs]
Added [dev] extras for development: pip install scDataset[dev]

[0.2.0] - 2025-08-28#

Breaking Changes#

Completely redesigned API: scDataset now uses a strategy-based sampling approach instead of modes
Constructor changes: scDataset(data_collection, strategy, batch_size, ...) replaces old scDataset(data_collection, batch_size, ...)
New required parameter: strategy - must provide a SamplingStrategy instance
block_size parameter moved to strategies
Removed methods: subset(), set_mode()

Added#

Strategy-based sampling system:
- SamplingStrategy - Abstract base class for all sampling strategies
- Streaming - Sequential sampling with optional buffer-level shuffling
- BlockShuffling - Block-based shuffling for data locality while maintaining randomization
- BlockWeightedSampling - Weighted sampling with configurable block sizes and replacement options
- ClassBalancedSampling - Automatic class balancing for imbalanced datasets
MultiIndexable class - Container for multi-modal data with synchronized indexing:
- Supports multiple indexable objects (arrays, lists, etc.) that are indexed together
- Named and positional access to contained indexables
- Useful for gene expression + protein data, features + labels, etc.

Migration Guide#

Old v0.1.x syntax:

# v0.1.x - No longer supported
dataset = scDataset(data, batch_size=64, block_size=8, fetch_factor=4)
dataset.subset(train_indices)
dataset.set_mode('train')

New v0.2.0 syntax:

# v0.2.0 - Strategy-based approach
from scdataset import scDataset, BlockShuffling

strategy = BlockShuffling(block_size=8, indices=train_indices)
dataset = scDataset(data, strategy, batch_size=64, fetch_factor=4)

Changelog

Contents

Changelog#

[0.3.0] - 2025-01-30#

Major Features#

Bug Fixes#

Added#

Dependencies#

[0.2.0] - 2025-08-28#

Breaking Changes#

Added#

Migration Guide#