scdataset.strategy.BlockShuffling#

Bases: SamplingStrategy

Block-based shuffling sampling strategy.

This strategy divides the data into blocks of fixed size and shuffles the order of blocks while maintaining the original order within each block. This provides a balance between randomization and maintaining some locality of data access patterns.

Parameters:

block_size (int, default=8) – Size of each block for shuffling. Larger blocks maintain more locality but provide less randomization.
indices (array-like, optional) – Subset of indices to use for sampling. If None, uses all indices from 0 to len(data_collection)-1.
drop_last (bool, default=False) – Whether to drop the last incomplete block if the total number of indices is not divisible by block_size.

Variables:

_shuffle_before_yield (bool) – Always True for block shuffling strategy.
_indices (numpy.ndarray or None) – Stored subset of indices if provided.
block_size (int) – Size of blocks for shuffling.
drop_last (bool) – Whether to drop incomplete blocks.

Notes

When drop_last=False and there’s a remainder block smaller than block_size, it’s inserted at a random position among the shuffled complete blocks.

Examples

>>> # Basic block shuffling
>>> strategy = BlockShuffling(block_size=3)
>>> np.random.seed(42)  # For reproducible example
>>> indices = strategy.get_indices(range(10), seed=42)
>>> len(indices)
10

>>> # Drop incomplete blocks
>>> strategy = BlockShuffling(block_size=3, drop_last=True)
>>> indices = strategy.get_indices(range(10), seed=42)
>>> len(indices)  # 10 // 3 * 3 = 9
9

See also

Streaming: For sequential sampling without shuffling
BlockWeightedSampling: For weighted block-based sampling

Methods

`__init__`([block_size, indices, drop_last])	Initialize block shuffling strategy.
`get_indices`(data_collection[, seed, rng])	Generate indices with block-based shuffling.
`get_len`(data_collection)	Get the effective length of the data collection for this strategy.

Initialize block shuffling strategy.

Parameters:

block_size (int, default=8) – Size of blocks for shuffling. Must be positive.
indices (array-like, optional) – Subset of indices to sample from.
drop_last (bool, default=False) – Whether to drop the last incomplete block.

Raises:

ValueError – If block_size is not positive.

get_len(data_collection) → int[source]

Get the effective length of the data collection for this strategy.

Takes into account the drop_last setting when calculating the effective length.

Parameters:: data_collection (object) – The data collection to get length from. Must support len().
Returns:: Number of samples that will be yielded by this strategy.
Return type:: int

Examples

>>> strategy = BlockShuffling(block_size=3, drop_last=False)
>>> strategy.get_len(range(10))
10

>>> strategy = BlockShuffling(block_size=3, drop_last=True)
>>> strategy.get_len(range(10))  # 10 - (10 % 3) = 9
9

get_indices(data_collection, seed: int | None = None, rng: Generator | None = None) → ndarray[tuple[int, ...], dtype[int64]][source]

Generate indices with block-based shuffling.

Divides indices into blocks and shuffles the order of complete blocks. Incomplete blocks are either dropped or inserted at random positions depending on the drop_last setting.

Parameters:

data_collection (object) – The data collection to sample from. Must support len().
seed (int, optional) – Random seed for reproducible shuffling. Ignored if rng is provided.
rng (numpy.random.Generator, optional) – Random number generator to use for shuffling. If provided, seed is ignored.

Returns:

Array of indices with blocks shuffled.

Return type:

numpy.ndarray

Notes

When drop_last=True and there are remainder indices that don’t form a complete block, they are randomly removed from the dataset.

When drop_last=False, remainder indices are inserted at a random position among the shuffled complete blocks.

Examples

>>> strategy = BlockShuffling(block_size=2, drop_last=False)
>>> indices = strategy.get_indices(range(5), seed=42)
>>> len(indices)
5

>>> strategy = BlockShuffling(block_size=2, drop_last=True)
>>> indices = strategy.get_indices(range(5), seed=42)
>>> len(indices)  # Drops the last incomplete block
4

Raises:: ValueError – If the random number generator cannot sample the required number of indices for removal when drop_last=True.