scdataset.strategy.BlockShuffling#
- class scdataset.strategy.BlockShuffling(block_size: int = 8, indices: _Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | None = None, drop_last: bool = False)[source]
Bases:
SamplingStrategyBlock-based shuffling sampling strategy.
This strategy divides the data into blocks of fixed size and shuffles the order of blocks while maintaining the original order within each block. This provides a balance between randomization and maintaining some locality of data access patterns.
- Parameters:
block_size (int, default=8) – Size of each block for shuffling. Larger blocks maintain more locality but provide less randomization.
indices (array-like, optional) – Subset of indices to use for sampling. If None, uses all indices from 0 to len(data_collection)-1.
drop_last (bool, default=False) – Whether to drop the last incomplete block if the total number of indices is not divisible by block_size.
- Variables:
_shuffle_before_yield (bool) – Always True for block shuffling strategy.
_indices (numpy.ndarray or None) – Stored subset of indices if provided.
block_size (int) – Size of blocks for shuffling.
drop_last (bool) – Whether to drop incomplete blocks.
Notes
When
drop_last=Falseand there’s a remainder block smaller thanblock_size, it’s inserted at a random position among the shuffled complete blocks.Examples
>>> # Basic block shuffling >>> strategy = BlockShuffling(block_size=3) >>> np.random.seed(42) # For reproducible example >>> indices = strategy.get_indices(range(10), seed=42) >>> len(indices) 10
>>> # Drop incomplete blocks >>> strategy = BlockShuffling(block_size=3, drop_last=True) >>> indices = strategy.get_indices(range(10), seed=42) >>> len(indices) # 10 // 3 * 3 = 9 9
See also
StreamingFor sequential sampling without shuffling
BlockWeightedSamplingFor weighted block-based sampling
Methods
__init__([block_size, indices, drop_last])Initialize block shuffling strategy.
get_indices(data_collection[, seed, rng])Generate indices with block-based shuffling.
get_len(data_collection)Get the effective length of the data collection for this strategy.
- __init__(block_size: int = 8, indices: _Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | None = None, drop_last: bool = False)[source]
Initialize block shuffling strategy.
- Parameters:
- Raises:
ValueError – If block_size is not positive.
- get_len(data_collection) int[source]
Get the effective length of the data collection for this strategy.
Takes into account the drop_last setting when calculating the effective length.
- Parameters:
data_collection (object) – The data collection to get length from. Must support
len().- Returns:
Number of samples that will be yielded by this strategy.
- Return type:
Examples
>>> strategy = BlockShuffling(block_size=3, drop_last=False) >>> strategy.get_len(range(10)) 10
>>> strategy = BlockShuffling(block_size=3, drop_last=True) >>> strategy.get_len(range(10)) # 10 - (10 % 3) = 9 9
- get_indices(data_collection, seed: int | None = None, rng: Generator | None = None) ndarray[tuple[int, ...], dtype[int64]][source]
Generate indices with block-based shuffling.
Divides indices into blocks and shuffles the order of complete blocks. Incomplete blocks are either dropped or inserted at random positions depending on the
drop_lastsetting.- Parameters:
data_collection (object) – The data collection to sample from. Must support
len().seed (int, optional) – Random seed for reproducible shuffling. Ignored if
rngis provided.rng (numpy.random.Generator, optional) – Random number generator to use for shuffling. If provided,
seedis ignored.
- Returns:
Array of indices with blocks shuffled.
- Return type:
Notes
When
drop_last=Trueand there are remainder indices that don’t form a complete block, they are randomly removed from the dataset.When
drop_last=False, remainder indices are inserted at a random position among the shuffled complete blocks.Examples
>>> strategy = BlockShuffling(block_size=2, drop_last=False) >>> indices = strategy.get_indices(range(5), seed=42) >>> len(indices) 5
>>> strategy = BlockShuffling(block_size=2, drop_last=True) >>> indices = strategy.get_indices(range(5), seed=42) >>> len(indices) # Drops the last incomplete block 4
- Raises:
ValueError – If the random number generator cannot sample the required number of indices for removal when drop_last=True.