scdataset.strategy.BlockShuffling

scdataset.strategy.BlockShuffling#

class scdataset.strategy.BlockShuffling(block_size: int = 8, indices: _Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | None = None, drop_last: bool = False)[source]

Bases: SamplingStrategy

Block-based shuffling sampling strategy.

This strategy divides the data into blocks of fixed size and shuffles the order of blocks while maintaining the original order within each block. This provides a balance between randomization and maintaining some locality of data access patterns.

Parameters:
  • block_size (int, default=8) – Size of each block for shuffling. Larger blocks maintain more locality but provide less randomization.

  • indices (array-like, optional) – Subset of indices to use for sampling. If None, uses all indices from 0 to len(data_collection)-1.

  • drop_last (bool, default=False) – Whether to drop the last incomplete block if the total number of indices is not divisible by block_size.

Variables:
  • _shuffle_before_yield (bool) – Always True for block shuffling strategy.

  • _indices (numpy.ndarray or None) – Stored subset of indices if provided.

  • block_size (int) – Size of blocks for shuffling.

  • drop_last (bool) – Whether to drop incomplete blocks.

Notes

When drop_last=False and there’s a remainder block smaller than block_size, it’s inserted at a random position among the shuffled complete blocks.

Examples

>>> # Basic block shuffling
>>> strategy = BlockShuffling(block_size=3)
>>> np.random.seed(42)  # For reproducible example
>>> indices = strategy.get_indices(range(10), seed=42)
>>> len(indices)
10
>>> # Drop incomplete blocks
>>> strategy = BlockShuffling(block_size=3, drop_last=True)
>>> indices = strategy.get_indices(range(10), seed=42)
>>> len(indices)  # 10 // 3 * 3 = 9
9

See also

Streaming

For sequential sampling without shuffling

BlockWeightedSampling

For weighted block-based sampling

Methods

__init__([block_size, indices, drop_last])

Initialize block shuffling strategy.

get_indices(data_collection[, seed, rng])

Generate indices with block-based shuffling.

get_len(data_collection)

Get the effective length of the data collection for this strategy.

__init__(block_size: int = 8, indices: _Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | None = None, drop_last: bool = False)[source]

Initialize block shuffling strategy.

Parameters:
  • block_size (int, default=8) – Size of blocks for shuffling. Must be positive.

  • indices (array-like, optional) – Subset of indices to sample from.

  • drop_last (bool, default=False) – Whether to drop the last incomplete block.

Raises:

ValueError – If block_size is not positive.

get_len(data_collection) int[source]

Get the effective length of the data collection for this strategy.

Takes into account the drop_last setting when calculating the effective length.

Parameters:

data_collection (object) – The data collection to get length from. Must support len().

Returns:

Number of samples that will be yielded by this strategy.

Return type:

int

Examples

>>> strategy = BlockShuffling(block_size=3, drop_last=False)
>>> strategy.get_len(range(10))
10
>>> strategy = BlockShuffling(block_size=3, drop_last=True)
>>> strategy.get_len(range(10))  # 10 - (10 % 3) = 9
9
get_indices(data_collection, seed: int | None = None, rng: Generator | None = None) ndarray[tuple[int, ...], dtype[int64]][source]

Generate indices with block-based shuffling.

Divides indices into blocks and shuffles the order of complete blocks. Incomplete blocks are either dropped or inserted at random positions depending on the drop_last setting.

Parameters:
  • data_collection (object) – The data collection to sample from. Must support len().

  • seed (int, optional) – Random seed for reproducible shuffling. Ignored if rng is provided.

  • rng (numpy.random.Generator, optional) – Random number generator to use for shuffling. If provided, seed is ignored.

Returns:

Array of indices with blocks shuffled.

Return type:

numpy.ndarray

Notes

When drop_last=True and there are remainder indices that don’t form a complete block, they are randomly removed from the dataset.

When drop_last=False, remainder indices are inserted at a random position among the shuffled complete blocks.

Examples

>>> strategy = BlockShuffling(block_size=2, drop_last=False)
>>> indices = strategy.get_indices(range(5), seed=42)
>>> len(indices)
5
>>> strategy = BlockShuffling(block_size=2, drop_last=True)
>>> indices = strategy.get_indices(range(5), seed=42)
>>> len(indices)  # Drops the last incomplete block
4
Raises:

ValueError – If the random number generator cannot sample the required number of indices for removal when drop_last=True.