Analyze the sharded dataset in memory#

import lamindb as ln
import lnschema_bionty as lb
import anndata as ad
💡 loaded instance: testuser1/test-scrna (lamindb 0.54.3)
ln.track()
💡 notebook imports: anndata==0.9.2 lamindb==0.54.3 lnschema_bionty==0.31.2 scanpy==1.9.5
💡 Transform(id='mfWKm8OtAzp8z8', name='Analyze the sharded dataset in memory', short_name='scrna3', version='0', type=notebook, updated_at=2023-09-29 14:46:41, created_by_id='DzTjkKse')
💡 Run(id='J8w0qH8ZGFn49EehwWHM', run_at=2023-09-29 14:46:41, transform_id='mfWKm8OtAzp8z8', created_by_id='DzTjkKse')
ln.Dataset.filter().df()
name description version hash reference reference_type transform_id run_id file_id initial_version_id updated_at created_by_id
id
nV0w72HVEfJeK6lgb7BO My versioned scRNA-seq dataset None 1 WEFcMZxJNmMiUOFrcSTaig None None Nv48yAceNSh8z8 Pd5UweAXC3cCH1aOMdr1 nV0w72HVEfJeK6lgb7BO None 2023-09-29 14:45:51 DzTjkKse
nV0w72HVEfJeK6lgb70T My versioned scRNA-seq dataset None 2 0Uq1qU7xX7R6pyWN3oOT None None ManDYgmftZ8Cz8 0eCAXKvC5AGTwRu42M0X None nV0w72HVEfJeK6lgb7BO 2023-09-29 14:46:30 DzTjkKse
dataset = ln.Dataset.filter(name="My versioned scRNA-seq dataset", version="2").one()
dataset.files.df()
storage_id key suffix accessor description version size hash hash_type transform_id run_id initial_version_id updated_at created_by_id
id
Vf6s6oe8cTQ8oqMCCjeT 975nKuX0 None .h5ad AnnData 10x reference adata None 660792 a2V0IgOjMRHsCeZH169UOQ md5 ManDYgmftZ8Cz8 0eCAXKvC5AGTwRu42M0X None 2023-09-29 14:46:24 DzTjkKse
nV0w72HVEfJeK6lgb7BO 975nKuX0 None .h5ad AnnData Conde22 None 28049505 WEFcMZxJNmMiUOFrcSTaig md5 Nv48yAceNSh8z8 Pd5UweAXC3cCH1aOMdr1 None 2023-09-29 14:45:51 DzTjkKse

If the dataset doesn’t consist of too many files, we can now load it into memory.

Under-the-hood, the AnnData objects are concatenated during loading.

The amount of time this takes depends on a variety of factors.

If it occurs often, one might consider storing a concatenated version of the dataset, rather than the individual pieces.

adata = dataset.load()

The default is an outer join during concatenation as in pandas:

adata
AnnData object with n_obs × n_vars = 1718 × 36508
    obs: 'cell_type', 'n_genes', 'percent_mito', 'louvain', 'donor', 'tissue', 'assay', 'file_id'
    obsm: 'X_pca', 'X_umap'

The AnnData has the reference to the individual files in the .obs annotations:

adata.obs.file_id.cat.categories
Index(['Vf6s6oe8cTQ8oqMCCjeT', 'nV0w72HVEfJeK6lgb7BO'], dtype='object')

We can easily obtain ensemble IDs for gene symbols using the look up object:

genes = lb.Gene.lookup(field="symbol")
genes.itm2a.ensembl_gene_id
'ENSG00000078596'

Let us create a plot:

import scanpy as sc

sc.pp.pca(adata, n_comps=2)
2023-09-29 14:46:45,341:INFO - Failed to extract font properties from /usr/share/fonts/truetype/noto/NotoColorEmoji.ttf: In FT2Font: Can not load face (unknown file format; error code 0x2)
2023-09-29 14:46:45,367:INFO - generated new fontManager
sc.pl.pca(
    adata,
    color=genes.itm2a.ensembl_gene_id,
    title=(
        f"{genes.itm2a.symbol} / {genes.itm2a.ensembl_gene_id} /"
        f" {genes.itm2a.description}"
    ),
    save="_itm2a",
)
WARNING: saving figure to file figures/pca_itm2a.pdf
https://d33wubrfki0l68.cloudfront.net/a16b94d52965bac89711015a9f479cef8f8ce793/6e514/_images/ce5748da790bc9e7ae953d5579b9db31f24f18e7140bef36910844e66a896392.png
file = ln.File("./figures/pca_itm2a.pdf", description="My result on ITM2A")
file.save()
file.view_flow()
https://d33wubrfki0l68.cloudfront.net/a1567d8a581ab7bfbbd2429b9a12764de4b5b66a/c3ffa/_images/b58be580a92db002f0fd28eeca0a2112dd8d49261d87815f02036ef397180222.svg
# clean up test instance
!lamin delete --force test-scrna
!rm -r ./test-scrna
💡 deleting instance testuser1/test-scrna
✅     deleted instance settings file: /home/runner/.lamin/instance--testuser1--test-scrna.env
✅     instance cache deleted
✅     deleted '.lndb' sqlite file
❗     consider manually deleting your stored data: /home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna