scdataset.experimental.suggest_parameters#
- scdataset.experimental.suggest_parameters(data_collection, batch_size: int, target_ram_fraction: float = 0.2, max_workers: int = 16, min_workers: int = 1, verbose: bool = True, fetch_callback: Callable | None = None, fetch_transform: Callable | None = None, batch_callback: Callable | None = None, batch_transform: Callable | None = None) Dict[str, Any][source]#
Suggest optimal parameters for scDataset based on system resources.
This function analyzes the data collection and available system resources to suggest optimal values for
num_workers,fetch_factor, andblock_sizeparameters.- Parameters:
data_collection (object) – The data collection to be used with scDataset.
batch_size (int) – The batch size you plan to use.
target_ram_fraction (float, default=0.20) – Maximum fraction of available RAM to use for data loading. Default is 20% which leaves room for model and other processes.
max_workers (int, default=16) – Maximum number of workers to suggest. More than 16 workers typically has diminishing returns.
min_workers (int, default=1) – Minimum number of workers to suggest.
verbose (bool, default=True) – If True, print detailed suggestions and explanations.
fetch_callback (Callable, optional) – Custom fetch function. Pass the same function you will use with scDataset for accurate memory estimation.
fetch_transform (Callable, optional) – Transform to apply after fetching data. Pass the same function you will use with scDataset for accurate memory estimation.
batch_callback (Callable, optional) – Custom batch extraction function. Pass the same function you will use with scDataset for accurate memory estimation.
batch_transform (Callable, optional) – Transform to apply to batches. Pass the same function you will use with scDataset for accurate memory estimation.
- Returns:
Dictionary containing suggested parameters:
num_workers: Suggested number of DataLoader workersfetch_factor: Suggested fetch factor for scDatasetblock_size_conservative: Block size for more randomness (fetch_factor // 2)block_size_balanced: Block size balancing randomness and throughputblock_size_aggressive: Block size for maximum throughput (fetch_factor * 2)prefetch_factor: Suggested prefetch_factor for DataLoaderestimated_memory_per_fetch_mb: Estimated memory per fetch operation in MBsystem_info: Dictionary with system information used for calculation
- Return type:
Examples
>>> import numpy as np >>> from scdataset import scDataset, BlockShuffling >>> from scdataset.experimental import suggest_parameters >>> from torch.utils.data import DataLoader >>> >>> data = np.random.randn(10000, 200) >>> params = suggest_parameters(data, batch_size=64, verbose=False) >>> >>> # Use suggested parameters >>> strategy = BlockShuffling(block_size=params['block_size_balanced']) >>> dataset = scDataset( ... data, strategy, ... batch_size=64, ... fetch_factor=params['fetch_factor'] ... ) >>> loader = DataLoader( ... dataset, batch_size=None, ... num_workers=min(params['num_workers'], 2), # Limit for example ... prefetch_factor=params['prefetch_factor'] ... )
Notes
Worker selection logic:
The number of workers is set to
min(available_cores // 2, max_workers). Using half the cores leaves resources for the main process and model training.Fetch factor selection logic:
The fetch factor is chosen such that the total data loaded by all workers does not exceed
target_ram_fractionof available RAM. The calculation accounts for prefetching (prefetch_factor = fetch_factor + 1), which effectively doubles memory usage since both the current and prefetched data are in memory simultaneously:\[\begin{split}2 \\times batch\\_size \\times fetch\\_factor \\times num\\_workers \\times sample\\_size < target\\_ram\\_fraction \\times RAM\end{split}\]The factor of 2 accounts for the prefetch buffer in the DataLoader.
Block size recommendations:
block_size_conservative(fetch_factor // 2): More randomness, slightly lower throughput. Good for training where randomization is important.block_size_balanced(fetch_factor): Balanced randomness and throughput.block_size_aggressive(fetch_factor * 2): Higher throughput, less randomness.
Block sizes smaller than
fetch_factor // 2or larger thanfetch_factor * 2have diminishing returns.- Raises:
ImportError – If psutil is not installed (optional dependency).
- Warns:
UserWarning – If psutil is not available, uses conservative defaults.