scdataset.MultiIndexable

scdataset.MultiIndexable#

class scdataset.MultiIndexable(*indexables, names: List[str] | None = None, unstructured: Dict[str, Any] | None = None, **named_indexables)[source]

Bases: object

Container for multiple indexable objects that should be indexed together.

This class allows you to group multiple indexable objects (arrays, lists, etc.) and index them synchronously. It’s particularly useful for scenarios like:

  • Multi-modal single-cell data (gene expression + protein data)

  • Features and labels (X, y) that need to stay aligned

  • Multiple data modalities that share the same sample dimension

The class supports both positional and named access to the contained indexables, and ensures all indexables have the same length along the first dimension.

Additionally, it supports storing unstructured metadata that is not indexed but remains accessible after indexing operations. This is useful for keeping metadata like gene names, dataset info, or other non-sample-aligned data.

Parameters:
  • *indexables (indexable objects or dict) – Variable number of indexable objects that should be indexed together, OR a single dictionary where keys become names and values are indexables. All indexables must have the same length in the first dimension.

  • names (list of str, optional) – Names for the indexables when using positional arguments. Must have the same length as the number of indexables. Cannot be used with dictionary input.

  • unstructured (dict, optional) – Dictionary of non-indexable metadata. This data is preserved unchanged when the MultiIndexable is indexed/subsetted. Useful for storing metadata like gene names, dataset descriptions, or configuration.

  • **named_indexables (dict, optional) – Named indexable objects passed as keyword arguments. Cannot be used together with positional indexables.

Variables:
  • names (list of str or None) – Names of the indexables if provided, None otherwise.

  • count (int) – Number of indexables contained in this object.

  • unstructured (dict) – Dictionary of non-indexable metadata (empty dict if none provided).

Raises:
  • ValueError – If indexables have different lengths along the first dimension, or if the number of names doesn’t match the number of indexables.

  • TypeError – If both positional and keyword indexables are provided, or if unstructured is not a dictionary.

Examples

Create with positional arguments:

>>> import numpy as np
>>> x = np.random.randn(100, 50)
>>> y = np.random.randint(0, 3, 100)
>>> multi = MultiIndexable(x, y, names=['features', 'labels'])
>>> len(multi)
100
>>> multi.count
2

Create with dictionary as positional argument:

>>> data_dict = {
...     'genes': np.random.randn(100, 2000),
...     'proteins': np.random.randn(100, 100)
... }
>>> multi = MultiIndexable(data_dict)
>>> subset = multi[10:20]  # Get samples 10-19 from both modalities
>>> subset['genes'].shape
(10, 2000)

Create with keyword arguments:

>>> multi = MultiIndexable(
...     genes=np.random.randn(100, 2000),
...     proteins=np.random.randn(100, 100)
... )
>>> multi.names
['genes', 'proteins']

Create with unstructured metadata:

>>> gene_names = ['Gene_' + str(i) for i in range(2000)]
>>> multi = MultiIndexable(
...     X=np.random.randn(100, 2000),
...     unstructured={'gene_names': gene_names, 'dataset_name': 'MyDataset'}
... )
>>> multi.unstructured['gene_names'][:3]
['Gene_0', 'Gene_1', 'Gene_2']
>>> subset = multi[10:20]  # Unstructured data is preserved
>>> subset.unstructured['dataset_name']
'MyDataset'

Access by name or position:

>>> multi = MultiIndexable(x, y, names=['x', 'y'])
>>> same_x1 = multi[0]      # Access by position
>>> same_x2 = multi['x']    # Access by name
>>> np.array_equal(same_x1, same_x2)
True

Use with scDataset:

>>> from scdataset import scDataset, Streaming
>>> dataset = scDataset(multi, Streaming(), batch_size=32)
>>> for batch in dataset:
...     genes, proteins = batch[0], batch[1]  # or batch['genes'], batch['proteins']
...     break

See also

scdataset.scDataset

Main dataset class that can use MultiIndexable objects

Methods

__init__(*indexables[, names, unstructured])

Initialize MultiIndexable with indexable objects.

items()

Iterate over (name, indexable) pairs.

keys()

Get the names or indices of indexables.

values()

Get the indexable objects.

Attributes

count

Number of indexables contained in this object.

names

Names of the indexables, if provided.

unstructured

Dictionary of non-indexable metadata.

unstructured_keys

List of keys in the unstructured metadata dictionary.

__init__(*indexables, names: List[str] | None = None, unstructured: Dict[str, Any] | None = None, **named_indexables)[source]

Initialize MultiIndexable with indexable objects.

Can be initialized in four ways: 1. Positional: MultiIndexable(x, y, z) 2. Positional with names: MultiIndexable(x, y, names=[‘x’, ‘y’]) 3. Dictionary as positional: MultiIndexable({‘x’: x_data, ‘y’: y_data}) 4. Named keywords: MultiIndexable(x=x_data, y=y_data)

All variants support the optional unstructured parameter for non-indexable metadata.

property names: List[str] | None

Names of the indexables, if provided.

property count: int

Number of indexables contained in this object.

property unstructured: Dict[str, Any]

Dictionary of non-indexable metadata.

This data is preserved unchanged when the MultiIndexable is indexed or subsetted. Returns the internal dictionary directly for efficiency; modify with care if you need to preserve the original.

Returns:

Dictionary containing unstructured metadata.

Return type:

dict

Examples

>>> multi = MultiIndexable(
...     X=np.zeros((10, 5)),
...     unstructured={'gene_names': ['A', 'B', 'C', 'D', 'E']}
... )
>>> multi.unstructured['gene_names']
['A', 'B', 'C', 'D', 'E']
property unstructured_keys: List[str]

List of keys in the unstructured metadata dictionary.

Returns:

Keys present in the unstructured dictionary.

Return type:

list of str

Examples

>>> multi = MultiIndexable(
...     X=np.zeros((10, 5)),
...     unstructured={'gene_names': ['A', 'B'], 'dataset': 'test'}
... )
>>> multi.unstructured_keys
['gene_names', 'dataset']
__getitem__(key: int | str | slice | Sequence[int] | ndarray)[source]

Index the MultiIndexable object.

Parameters:

key (int, str, slice, or array-like) –

  • int: Return the indexable at that position

  • str: Return the indexable with that name (if names provided)

  • slice/array: Return new MultiIndexable with subsets at those sample indices

Returns:

  • Single indexable if key is int or str

  • New MultiIndexable with subsets if key represents sample indices

Return type:

object or MultiIndexable

Notes

When subsetting with slices or arrays, the unstructured metadata is preserved unchanged in the resulting MultiIndexable.

__len__() int[source]

Return the number of samples (length of first dimension).

__repr__() str[source]

Return string representation of the MultiIndexable.

__iter__()[source]

Iterate over indexables.

items()[source]

Iterate over (name, indexable) pairs.

Yields:

tuple – (name, indexable) pairs if names are available, (index, indexable) pairs otherwise.

keys()[source]

Get the names or indices of indexables.

Returns:

List of names if available, list of indices otherwise.

Return type:

list

values()[source]

Get the indexable objects.

Returns:

List of indexable objects.

Return type:

list