scdataset.transforms.hf_tahoe_to_tensor

scdataset.transforms.hf_tahoe_to_tensor#

scdataset.transforms.hf_tahoe_to_tensor(batch, num_genes: int = 62713)[source]#

Transform HuggingFace Tahoe-100M sparse gene expression data to dense tensors.

This transform converts sparse gene expression data stored in HuggingFace format (with separate ‘genes’ and ‘expressions’ arrays) into dense PyTorch tensors suitable for model training.

Parameters:
  • batch (dict or list) –

    Batch of data from HuggingFace dataset. Can be:

    • dict with ‘genes’ and ‘expressions’ keys (list of arrays)

    • list of dicts, each with ‘genes’ and ‘expressions’ keys

  • num_genes (int, default=62713) – Total number of genes (dimension of output tensor). Default is the Tahoe-100M gene count.

Returns:

Dense tensor of shape (batch_size, num_genes) with gene expression values.

Return type:

torch.Tensor

Examples

>>> # With scDataset
>>> from scdataset import scDataset, BlockShuffling
>>> from scdataset.transforms import hf_tahoe_to_tensor
>>>
>>> dataset = scDataset(
...     hf_dataset,
...     BlockShuffling(),
...     batch_size=64,
...     fetch_transform=hf_tahoe_to_tensor
... )

Notes

This transform is specifically designed for datasets like Tahoe-100M that store sparse gene expression data in HuggingFace Datasets format, where each sample has variable-length arrays of gene indices and their expression values.

The transform efficiently converts the sparse representation to dense tensors using numpy operations before converting to PyTorch, which is faster than building sparse PyTorch tensors directly.