scdataset.experimental.suggest_parameters

scdataset.experimental.suggest_parameters#

scdataset.experimental.suggest_parameters(data_collection, batch_size: int, target_ram_fraction: float = 0.2, max_workers: int = 16, min_workers: int = 1, verbose: bool = True, fetch_callback: Callable | None = None, fetch_transform: Callable | None = None, batch_callback: Callable | None = None, batch_transform: Callable | None = None) Dict[str, Any][source]#

Suggest optimal parameters for scDataset based on system resources.

This function analyzes the data collection and available system resources to suggest optimal values for num_workers, fetch_factor, and block_size parameters.

Parameters:
  • data_collection (object) – The data collection to be used with scDataset.

  • batch_size (int) – The batch size you plan to use.

  • target_ram_fraction (float, default=0.20) – Maximum fraction of available RAM to use for data loading. Default is 20% which leaves room for model and other processes.

  • max_workers (int, default=16) – Maximum number of workers to suggest. More than 16 workers typically has diminishing returns.

  • min_workers (int, default=1) – Minimum number of workers to suggest.

  • verbose (bool, default=True) – If True, print detailed suggestions and explanations.

  • fetch_callback (Callable, optional) – Custom fetch function. Pass the same function you will use with scDataset for accurate memory estimation.

  • fetch_transform (Callable, optional) – Transform to apply after fetching data. Pass the same function you will use with scDataset for accurate memory estimation.

  • batch_callback (Callable, optional) – Custom batch extraction function. Pass the same function you will use with scDataset for accurate memory estimation.

  • batch_transform (Callable, optional) – Transform to apply to batches. Pass the same function you will use with scDataset for accurate memory estimation.

Returns:

Dictionary containing suggested parameters:

  • num_workers: Suggested number of DataLoader workers

  • fetch_factor: Suggested fetch factor for scDataset

  • block_size_conservative: Block size for more randomness (fetch_factor // 2)

  • block_size_balanced: Block size balancing randomness and throughput

  • block_size_aggressive: Block size for maximum throughput (fetch_factor * 2)

  • prefetch_factor: Suggested prefetch_factor for DataLoader

  • estimated_memory_per_fetch_mb: Estimated memory per fetch operation in MB

  • system_info: Dictionary with system information used for calculation

Return type:

dict

Examples

>>> import numpy as np
>>> from scdataset import scDataset, BlockShuffling
>>> from scdataset.experimental import suggest_parameters
>>> from torch.utils.data import DataLoader
>>>
>>> data = np.random.randn(10000, 200)
>>> params = suggest_parameters(data, batch_size=64, verbose=False)
>>>
>>> # Use suggested parameters
>>> strategy = BlockShuffling(block_size=params['block_size_balanced'])
>>> dataset = scDataset(
...     data, strategy,
...     batch_size=64,
...     fetch_factor=params['fetch_factor']
... )
>>> loader = DataLoader(
...     dataset, batch_size=None,
...     num_workers=min(params['num_workers'], 2),  # Limit for example
...     prefetch_factor=params['prefetch_factor']
... )

Notes

Worker selection logic:

The number of workers is set to min(available_cores // 2, max_workers). Using half the cores leaves resources for the main process and model training.

Fetch factor selection logic:

The fetch factor is chosen such that the total data loaded by all workers does not exceed target_ram_fraction of available RAM. The calculation accounts for prefetching (prefetch_factor = fetch_factor + 1), which effectively doubles memory usage since both the current and prefetched data are in memory simultaneously:

\[\begin{split}2 \\times batch\\_size \\times fetch\\_factor \\times num\\_workers \\times sample\\_size < target\\_ram\\_fraction \\times RAM\end{split}\]

The factor of 2 accounts for the prefetch buffer in the DataLoader.

Block size recommendations:

  • block_size_conservative (fetch_factor // 2): More randomness, slightly lower throughput. Good for training where randomization is important.

  • block_size_balanced (fetch_factor): Balanced randomness and throughput.

  • block_size_aggressive (fetch_factor * 2): Higher throughput, less randomness.

Block sizes smaller than fetch_factor // 2 or larger than fetch_factor * 2 have diminishing returns.

Raises:

ImportError – If psutil is not installed (optional dependency).

Warns:

UserWarning – If psutil is not available, uses conservative defaults.