scdataset.strategy.ClassBalancedSampling#
- class scdataset.strategy.ClassBalancedSampling(labels: _Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], block_size: int = 8, indices: _Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | None = None, total_size: int | None = None, replace: bool = True, sampling_size: int | None = None)[source]
Bases:
BlockWeightedSamplingClass-balanced sampling with automatic weight computation.
This strategy extends
BlockWeightedSamplingby automatically computing balanced weights from provided labels, making each class equally likely to be sampled regardless of the original class distribution in the dataset.The weights are computed as the inverse of class frequencies, ensuring that underrepresented classes get higher sampling probability and overrepresented classes get lower sampling probability.
Dual Behavior for Labels:
The strategy supports two modes based on the labels array length:
Global class balancing (labels length = full dataset): Weights are computed from the full dataset’s class distribution. When sampling from a subset (via
indices), samples are weighted according to their importance in the global distribution, not the subset.Subset class balancing (labels length = indices length): Weights are computed only from the labels of the subset indices. This balances classes within the subset, ignoring the global distribution.
- Parameters:
labels (array-like) –
Class labels for each sample. The length determines the balancing mode:
If
len(labels) == len(data_collection): Global balancing mode. Weights computed from full dataset, then applied to subset.If
len(labels) == len(indices): Subset balancing mode. Weights computed only from the subset’s labels.
block_size (int, default=8) – Size of blocks for block shuffling after sampling.
indices (array-like, optional) – Subset of indices to sample from. If None, uses all indices.
total_size (int, optional) – Total number of samples to draw. If None, uses the length of indices or data_collection.
replace (bool, default=True) – Whether to sample with replacement.
sampling_size (int, optional) – Size of each sampling round when
replace=False. Required whenreplace=False.
- Variables:
labels (numpy.ndarray) – Array of class labels for each sample.
- Raises:
ValueError – If labels array is empty.
Examples
Global balancing - balance for full dataset distribution:
>>> # Full dataset: 90% class 0, 10% class 1 >>> full_labels = [0]*90 + [1]*10 # 100 samples total >>> subset_indices = [0, 1, 90, 91, 92, 93, 94, 95, 96, 97] # 2 of class 0, 8 of class 1 >>> >>> # Global balancing: uses full dataset weights >>> strategy = ClassBalancedSampling(full_labels, indices=subset_indices, total_size=20) >>> # Class 1 samples get ~9x higher weight (because 1/10 vs 1/90 in global dist) >>> # Even though subset is 80% class 1, global weights still favor class 1
Subset balancing - balance within the subset only:
>>> # Only provide labels for the subset indices >>> subset_labels = [0, 0, 1, 1, 1, 1, 1, 1, 1, 1] # Labels for subset: 20% class 0, 80% class 1 >>> strategy = ClassBalancedSampling(subset_labels, indices=subset_indices, total_size=20) >>> # Now class 0 samples get 4x higher weight (because 1/2 vs 1/8 in subset dist) >>> # This balances within the subset, ignoring global distribution
See also
BlockWeightedSamplingFor manual weight specification
BlockShufflingFor unweighted sampling
Notes
The computed weights ensure that each class has equal probability of being sampled, not that each class appears equally often in the final sample. The actual class distribution in samples will depend on the random sampling process and may vary between different runs.
When using global balancing with a subset that has different class proportions than the full dataset, the output may appear imbalanced relative to the subset. This is intentional - the weights reflect global importance.
Methods
__init__(labels[, block_size, indices, ...])Initialize class-balanced sampling strategy.
get_indices(data_collection[, seed, rng])Generate indices using weighted sampling followed by block shuffling.
get_len(data_collection)Get the effective length of the data collection for this strategy.
- __init__(labels: _Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], block_size: int = 8, indices: _Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | None = None, total_size: int | None = None, replace: bool = True, sampling_size: int | None = None)[source]
Initialize class-balanced sampling strategy.
- Parameters:
labels (array-like) –
Class labels for samples. The length of labels determines the balancing mode (see class docstring for details):
If
len(labels) == len(indices): subset balancing mode. Labels correspond to the subset samples only.If
len(labels) > len(indices): global balancing mode. Labels correspond to the full dataset.
block_size (int, default=8) – Size of blocks for shuffling. Must be positive.
indices (array-like, optional) – Subset of indices to sample from.
total_size (int, optional) – Total number of samples to generate.
replace (bool, default=True) – Whether to sample with replacement.
sampling_size (int, optional) – Required when replace=False. Size of each sampling round.
- Raises:
ValueError – If labels array is empty, block_size is not positive, or labels length doesn’t match indices length (for subset mode) or exceed it (for global mode).