Tutorial for spatial mapping using Tangram

What is Tangram?

Tangram is a method for mapping single-cell (or single-nucleus) gene expression data onto spatial gene expression data. Tangram takes as input a single-cell dataset and a spatial dataset, collected from the same anatomical region/tissue type. Via integration, Tangram creates new spatial data by aligning the scRNAseq profiles in space. This allows to project every annotation in the scRNAseq (e.g. cell types, program usage) on space.

What do I use Tangram for?

The most common application of Tangram is to resolve cell types in space. Another usage is to correct gene expression from spatial data: as scRNA-seq data are less prone to dropout than (e.g.) Visium or Slide-seq, the “new” spatial data generated by Tangram resolve many more genes. As a result, we can visualize program usage in space, which can be used for ligand-receptor pair discovery or, more generally, cell-cell communication mechanisms. If cell segmentation is available, Tangram can be also used for deconvolution of spatial data. If your single cell are multimodal, Tangram can be used to spatially resolve other modalities, such as chromatin accessibility.


Frequently Asked Questions about Tangram

How is Tangram different, than all the other deconvolution/mapping method?

  • Validation. Most methods “validate” mappings by looking at known patterns or proportion of cell types. These are good sanity checks, but are hardly useful when mapping is used for discovery. In Tangram, mappings are validated by inspective the predictions of holdout genes (test transcriptome).

My scRNAseq/spatial data come from different samples. Can I still use Tangram?

  • Yes. There is a clever variation invented by Sten Linnarsson, which consists of mapping average cells of a certain cell type, rather than single cells. This method is much faster, and smooths out variation in biological signal from different samples via averaging. However, it requires annotated scRNA-seq, sacrifices resolving biological variability at single-cell level. To map this way, pass mode=cluster.

Does Tangram only work on mouse brain data?

  • No. The original manuscript focused on mouse brain data b/c was funded by BICCN. We subsequently used Tangram for mapping lung, kidney and cancer tissue. If mapping doesn’t work for your case, that is hardly due to the complexity of the tissue.

Why doesn’t Tangram have hypotheses on the underlying model?

  • Most models used in biology are probabilistic: they assume that data are generated according to a certain probability distribution, hence the hypothesis. But Tangram doesn’t work that way: the hypothesis is that scRNA-seq and spatial data are generated with the same process (i.e. same biology) regardless of the process.

Where do I learn more about Tangram?


Setting up

Tangram is based on pytorch, scanpy and (optionally but highly-recommended) squidpy - this tutorial is designed to work with squidy. You can also check this tutorial, prior to integration with squidpy.

  • To run the notebook locally, create a conda environment as conda env create -f tangram_environment.yml using our YAML file.

  • This notebook is based on squidpy v1.1.0.

[1]:
import scanpy as sc
import squidpy as sq
import numpy as np
import pandas as pd
from anndata import AnnData
import pathlib
import matplotlib.pyplot as plt
import matplotlib as mpl
import skimage
import seaborn as sns
import tangram as tg

sc.logging.print_header()
print(f"squidpy=={sq.__version__}")

%load_ext autoreload
%autoreload 2
%matplotlib inline
scanpy==1.8.1 anndata==0.7.6 umap==0.5.1 numpy==1.19.1 scipy==1.5.2 pandas==1.3.4 scikit-learn==0.24.2 statsmodels==0.12.2 python-igraph==0.9.8 pynndescent==0.5.4
squidpy==1.1.2

Loading datasets

Load public data available in Squidpy, from mouse brain cortex. Single cell data are stored in adata_sc. Spatial data, in adata_st.

[2]:
adata_st = sq.datasets.visium_fluo_adata_crop()
adata_st = adata_st[
    adata_st.obs.cluster.isin([f"Cortex_{i}" for i in np.arange(1, 5)])
].copy()
img = sq.datasets.visium_fluo_image_crop()

adata_sc = sq.datasets.sc_mouse_cortex()

We subset the crop of the mouse brain to only contain clusters of the brain cortex. The pre-processed single cell dataset was taken from :cite:`tasic2018shared` and pre-processed with standard scanpy functions.

Let’s visualize both spatial and single-cell datasets.

[3]:
adata_st.obs
[3]:
in_tissue array_row array_col n_genes_by_counts log1p_n_genes_by_counts total_counts log1p_total_counts pct_counts_in_top_50_genes pct_counts_in_top_100_genes pct_counts_in_top_200_genes pct_counts_in_top_500_genes total_counts_MT log1p_total_counts_MT pct_counts_MT n_counts leiden cluster
AAATGGCATGTCTTGT-1 1 13 69 5191 8.554874 15977.0 9.678968 20.629655 26.757213 34.743694 48.889028 0.0 0.0 0.0 15977.0 0 Cortex_1
AACAACTGGTAGTTGC-1 1 28 42 5252 8.566555 16649.0 9.720165 20.481711 26.277855 34.092138 48.201093 0.0 0.0 0.0 16649.0 0 Cortex_1
AACAGGAAATCGAATA-1 1 15 67 6320 8.751633 23375.0 10.059465 17.929412 23.850267 32.077005 45.963636 0.0 0.0 0.0 23375.0 0 Cortex_1
AACCCAGAGACGGAGA-1 1 15 39 3659 8.205218 9229.0 9.130215 25.939972 31.964460 39.885145 53.212699 0.0 0.0 0.0 9229.0 1 Cortex_2
AACCGTTGTGTTTGCT-1 1 12 64 4512 8.414717 12679.0 9.447782 21.839262 28.038489 36.209480 50.540263 0.0 0.0 0.0 12679.0 0 Cortex_1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
TTGGATTGGGTACCAC-1 1 17 55 4980 8.513386 15381.0 9.640953 21.038944 27.059359 35.114752 49.197061 0.0 0.0 0.0 15381.0 0 Cortex_1
TTGGCTCGCATGAGAC-1 1 23 37 4620 8.438366 13193.0 9.487517 20.609414 26.445842 34.063519 48.442356 0.0 0.0 0.0 13193.0 5 Cortex_3
TTGTATCACACAGAAT-1 1 12 74 6120 8.719481 21951.0 9.996614 18.199626 24.235798 32.440436 46.663022 0.0 0.0 0.0 21951.0 0 Cortex_1
TTGTGGCCCTGACAGT-1 1 18 60 4971 8.511577 14779.0 9.601030 21.381690 27.924758 36.213546 49.780093 0.0 0.0 0.0 14779.0 0 Cortex_1
TTGTTAGCAAATTCGA-1 1 22 42 4820 8.480737 14396.0 9.574775 20.595999 26.674076 34.655460 48.624618 0.0 0.0 0.0 14396.0 5 Cortex_3

324 rows × 17 columns

[4]:
fig, axs = plt.subplots(1, 2, figsize=(20, 5))
sc.pl.spatial(
    adata_st, color="cluster", alpha=0.7, frameon=False, show=False, ax=axs[0]
)
sc.pl.umap(
    adata_sc, color="cell_subclass", size=10, frameon=False, show=False, ax=axs[1]
)
plt.tight_layout()
_images/tutorial_sq_link_8_0.png

Tangram learns a spatial alignment of the single cell data by looking at a subset of genes, specified by the user, called the training genes. Training genes need to bear interesting signal and to be measured with high quality. Typically, we choose the training genes are 100-1000 differentially expressedx genes, stratified across cell types. Sometimes, we also use the entire transcriptome, or perform different mappings using different set of training genes to see how much the result change.

Tangram fits the scRNA-seq profiles on space using a custom loss function based on cosine similarity. The method is summarized in the sketch below:

title

Pre-processing

For this case, we use 1401 marker genes as training genes.

[5]:
sc.tl.rank_genes_groups(adata_sc, groupby="cell_subclass", use_raw=False)
markers_df = pd.DataFrame(adata_sc.uns["rank_genes_groups"]["names"]).iloc[0:100, :]
markers = list(np.unique(markers_df.melt().value.values))
len(markers)
WARNING: Default of the method has been changed to 't-test' from 't-test_overestim_var'
[5]:
1401

We prepares the data using pp_adatas, which does the following: - Takes a list of genes from user via the genes argument. These genes are used as training genes. - Annotates training genes under the training_genes field, in uns dictionary, of each AnnData. - Ensure consistent gene order in the datasets (Tangram requires that the the \(j\)-th column in each matrix correspond to the same gene). - If the counts for a gene are all zeros in one of the datasets, the gene is removed from the training genes. - If a gene is not present in both datasets, the gene is removed from the training genes.

[6]:
tg.pp_adatas(adata_sc, adata_st, genes=markers)
INFO:root:1280 training genes are saved in `uns``training_genes` of both single cell and spatial Anndatas.
INFO:root:14785 overlapped genes are saved in `uns``overlap_genes` of both single cell and spatial Anndatas.
INFO:root:uniform based density prior is calculated and saved in `obs``uniform_density` of the spatial Anndata.
INFO:root:rna count based density prior is calculated and saved in `obs``rna_count_based_density` of the spatial Anndata.

Two datasets contain 1280 training genes of the 1401 originally provided, as some training genes have been removed.

Find alignment

To find the optimal spatial alignment for scRNA-seq profiles, we use the map_cells_to_space function: - The function maps iteratively as specified by num_epochs. We typically interrupt mapping after the score plateaus. - The score measures the similarity between the gene expression of the mapped cells vs spatial data on the training genes. - The default mapping mode is mode='cells', which is recommended to run on a GPU. - Alternatively, one can specify mode='clusters' which averages the single cells beloning to the same cluster (pass annotations via cluster_label). This is faster, and is our chioce when scRNAseq and spatial data come from different specimens. - If you wish to run Tangram with a GPU, set device=cuda:0 otherwise use the set device=cpu. - density_prior specifies the cell density within each spatial voxel. Use uniform if the spatial voxels are at single cell resolution (ie MERFISH). The default value, rna_count_based, assumes that cell density is proportional to the number of RNA molecules.

[7]:
ad_map = tg.map_cells_to_space(adata_sc, adata_st,
    mode="cells",
#     mode="clusters",
#     cluster_label='cell_subclass',  # .obs field w cell types
    density_prior='rna_count_based',
    num_epochs=500,
    # device="cuda:0",
    device='cpu',
)
INFO:root:Allocate tensors for mapping.
INFO:root:Begin training with 1280 genes and rna_count_based density_prior in cells mode...
INFO:root:Printing scores every 100 epochs.
Score: 0.613, KL reg: 0.001
Score: 0.733, KL reg: 0.000
Score: 0.736, KL reg: 0.000
Score: 0.737, KL reg: 0.000
Score: 0.737, KL reg: 0.000
INFO:root:Saving results..

The mapping results are stored in the returned AnnData structure, saved as ad_map, structured as following: - The cell-by-spot matrix X contains the probability of cell i to be in spot j. - The obs dataframe contains the metadata of the single cells. - The var dataframe contains the metadata of the spatial data. - The uns dictionary contains a dataframe with various information about the training genes (saved as train_genes_df).

Cell type maps

To visualize cell types in space, we invoke project_cell_annotation to transfer the annotation from the mapping to space. We can then call plot_cell_annotation to visualize it. You can set the perc argument to set the range to the colormap, which would help remove outliers.

[8]:
ad_map
[8]:
AnnData object with n_obs × n_vars = 21697 × 324
    obs: 'sample_name', 'organism', 'donor_sex', 'cell_class', 'cell_subclass', 'cell_cluster', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt', 'n_counts'
    var: 'in_tissue', 'array_row', 'array_col', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'total_counts_MT', 'log1p_total_counts_MT', 'pct_counts_MT', 'n_counts', 'leiden', 'cluster', 'uniform_density', 'rna_count_based_density'
    uns: 'train_genes_df', 'training_history'
[9]:
tg.project_cell_annotations(ad_map, adata_st, annotation="cell_subclass")
annotation_list = list(pd.unique(adata_sc.obs['cell_subclass']))
tg.plot_cell_annotation_sc(adata_st, annotation_list,perc=0.02)
INFO:root:spatial prediction dataframe is saved in `obsm` `tangram_ct_pred` of the spatial AnnData.
_images/tutorial_sq_link_23_1.png

The first way to get a sense if mapping was successful is to look for known cell type patterns. To get a deeper sense, we can use the helper plot_training_scores which gives us four panels:

[10]:
tg.plot_training_scores(ad_map, bins=20, alpha=.5)
_images/tutorial_sq_link_25_0.png
  • The first panel is a histogram of the simlarity scores for each training gene.

  • In the second panel, each dot is a training gene and we can observe the training score (y-axis) and the sparsity in the scRNA-seq data (x-axis) of each gene.

  • The third panel is similar to the second one, but contains the gene sparsity of the spatial data. Spatial data are usually more sparse than single cell data, a discrepancy which is often responsible for low quality mapping.

  • In the last panel, we show the training scores as a function of the difference in sparsity between the dataset. For genes with comparable sparsity, the mapped gene expression is very similar to that in the spatial data. However, if a gene is quite sparse in one dataset (typically, the spatial data) but not in other, the mapping score is lower. This occurs as Tangram cannot properly matched the gene pattern because of inconsistent amount of dropouts between the datasets.

Although the above plots give us a summary of scores at single-gene level, we would need to know which are the genes are mapped with low scores. These information are stored in the dataframe .uns['train_genes_df']; this is the dataframe used to build the four plots above.

[11]:
ad_map.uns['train_genes_df']
[11]:
train_score sparsity_sc sparsity_sp sparsity_diff
ppia 0.998200 0.000092 0.000000 -0.000092
ubb 0.997364 0.000092 0.000000 -0.000092
atp1b1 0.997066 0.014334 0.000000 -0.014334
tmsb4x 0.996971 0.002811 0.000000 -0.002811
ckb 0.996360 0.002765 0.000000 -0.002765
... ... ... ... ...
trpc5 0.194716 0.569203 0.981481 0.412278
cdyl2 0.189953 0.425911 0.981481 0.555570
cntnap5c 0.157349 0.608241 0.993827 0.385586
dlx1as 0.142076 0.587777 0.990741 0.402964
kcnh6 0.133051 0.379131 0.996914 0.617783

1280 rows × 4 columns

New spatial data via aligned single cells

If the mapping mode is 'cells', we can now generate the “new spatial data” using the mapped single cell: this is done via project_genes. The function accepts as input a mapping (adata_map) and corresponding single cell data (adata_sc). The result is a voxel-by-gene AnnData, formally similar to adata_st, but containing gene expression from the mapped single cell data rather than Visium. For downstream analysis, we always replace adata_st with the corresponding ad_ge.

[14]:
ad_ge = tg.project_genes(adata_map=ad_map, adata_sc=adata_sc)
ad_ge
[14]:
AnnData object with n_obs × n_vars = 324 × 36826
    obs: 'in_tissue', 'array_row', 'array_col', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'total_counts_MT', 'log1p_total_counts_MT', 'pct_counts_MT', 'n_counts', 'leiden', 'cluster', 'uniform_density', 'rna_count_based_density'
    var: 'mt', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'n_cells', 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm', 'sparsity', 'is_training'
    uns: 'cell_class_colors', 'cell_subclass_colors', 'hvg', 'neighbors', 'pca', 'umap', 'rank_genes_groups', 'training_genes', 'overlap_genes'

Let’s choose a few training genes mapped with low score, to try to understand why.

[15]:
genes = ['rragb', 'trim17', 'eno1b']
ad_map.uns['train_genes_df'].loc[genes]
[15]:
train_score sparsity_sc sparsity_sp sparsity_diff
rragb 0.358785 0.079919 0.867284 0.787365
trim17 0.201789 0.069641 0.959877 0.890236
eno1b 0.342446 0.022492 0.885802 0.863311

To visualize gene patterns, we use the helper plot_genes. This function accepts two voxel-by-gene AnnData: the actual spatial data (adata_measured), and a Tangram spatial prediction (adata_predicted). The function returns gene expression maps from the two spatial AnnData on the genes genes.

[16]:
tg.plot_genes_sc(genes, adata_measured=adata_st, adata_predicted=ad_ge, perc=0.02)
_images/tutorial_sq_link_35_0.png

The above pictures explain the low training scores. Some genes are detected with very different levels of sparsity - typically they are much more sparse in the scRNAseq than in the spatial data. This is due to the fact that technologies like Visium are more prone to technical dropouts. Therefore, Tangram cannot find a good spatial alignment for these genes as the baseline signal is missing. However, so long as most training genes are measured with high quality, we can trust mapping and use Tangram prediction to correct gene expression. This is an imputation method which relies on entirely different premises than those in probabilistic models.

Another application is found by inspecting genes that are not detected in the spatial data, but are detected in the single cell data. They are removed before training with pp_adatas function, but Tangram can predict their expression.

[17]:
genes=['loc102633833', 'gm5700', 'gm8292']
tg.plot_genes_sc(genes, adata_measured=adata_st, adata_predicted=ad_ge, perc=0.02)
_images/tutorial_sq_link_37_0.png
  • So far, we only inspected genes used to align the data (training genes), but the mapped single cell data, ad_ge contains the whole transcriptome. That includes more than 35k test genes.

[18]:
(ad_ge.var.is_training == False).sum()
[18]:
35546

We can use plot_genes to inspect gene expression of test genes as well. Inspecting the test transcriptome is an essential to validate mapping. At the same time, we need to be careful that some prediction might disagree with spatial data because of the technical droputs.

It is convenient to compute the similarity scores of all genes, which can be done by compare_spatial_geneexp. This function accepts two spatial AnnDatas (ie voxel-by-gene), and returns a dataframe with simlarity scores for all genes. Training genes are flagged by the boolean field is_training. If we also pass single cell AnnData to compare_spatial_geneexp function like below, a dataframe with additional sparsity columns - sparsity_sc (single cell data sparsity) and sparsity_diff (spatial data sparsity - single cell data sparsity) will return. This is required if we want to call plot_test_scores function later with the returned datafrme from compare_spatial_geneexp function.

[19]:
df_all_genes = tg.compare_spatial_geneexp(ad_ge, adata_st, adata_sc)
df_all_genes
[19]:
score is_training sparsity_sp sparsity_sc sparsity_diff
snap25 0.998238 False 0.000000 0.014610 -0.014610
ppia 0.998200 True 0.000000 0.000092 -0.000092
gapdh 0.998200 False 0.000000 0.000968 -0.000968
calm1 0.997942 False 0.000000 0.000369 -0.000369
calm2 0.997779 False 0.000000 0.001751 -0.001751
... ... ... ... ... ...
6330420h09rik 0.000014 False 0.996914 0.998894 -0.001980
1810010k12rik 0.000014 False 0.996914 0.999585 -0.002672
cckar 0.000013 False 0.996914 0.999309 -0.002395
chil3 0.000013 False 0.996914 0.998894 -0.001980
cyp3a13 0.000013 False 0.996914 0.998479 -0.001565

14785 rows × 5 columns

The prediction on test genes can be graphically visualized using plot_auc:

[20]:
# sns.scatterplot(data=df_all_genes, x='score', y='sparsity_sp', hue='is_training', alpha=.5);  # for legacy
tg.plot_auc(df_all_genes);
<Figure size 432x288 with 0 Axes>
_images/tutorial_sq_link_43_1.png

This above figure is the most important validation plot in *Tangram*. Each dot represents a gene; the x-axis indicates the score, and the y-axis the sparsity of that gene in the spatial data. Unsurprisingly, the genes predicted with low score represents very sparse genes in the spatial data, suggesting that the Tangram predictions correct expression in those genes. Note that curve observed above is typical of Tangram mappings: the area under that curve is the most reliable metric we use to evaluate mapping.

Let’s inspect a few predictions. Some of these genes are biologically sparse, but well predicted:

[21]:
genes=['tfap2b', 'zic4']
tg.plot_genes_sc(genes, adata_measured=adata_st, adata_predicted=ad_ge, perc=0.02)
_images/tutorial_sq_link_45_0.png

Some non-sparse genes present petterns, that Tangram accentuates:

[22]:
genes = ['cd34', 'rasal1']
tg.plot_genes_sc(genes, adata_measured=adata_st, adata_predicted=ad_ge, perc=0.02)
_images/tutorial_sq_link_47_0.png

Finally, some unannotated genes have unknown function. These genes are often hardly detected in spatial data but Tangram provides prediction:

[23]:
genes = ['gm33027', 'gm5431']
tg.plot_genes_sc(genes[:5], adata_measured=adata_st, adata_predicted=ad_ge, perc=0.02)
_images/tutorial_sq_link_49_0.png

For untargeted spatial technologies, like Visium and Slide-seq, a spatial voxel may contain more than one cells. In these cases, it might be useful to disentangle gene expression into single cells - a process called deconvolution. Deconvolution is a requested feature, and also hard to obtain accurately with computational methods. If your goal is to study co-localization of cell types, we recommend you work with the spatial cell type maps instead. If your aim is discovery of cell-cell communication mechanisms, we suggest you compute gene programs, then use project_cell_annotations to spatially visualize program usage. To proceed with deconvolution anyways, see below.

In order to deconvolve cells, Tangram needs to know how many cells are present in each voxel. This is achieved by segmenting the cells on the corresponding histology, which squidpy makes possible with two lines of code: - squidpy.im.process applies smoothing as a pre-processing step. - squidpy.im.segment computes segmentation masks with watershed algorithm.

Note that some technologies, like Slide-seq, currently do not allow staining the same slide of tissue on which genes are profiled. For these data, you can still attempt a deconvolution by estimating cell density in a rough way - often we just pass a uniform prior. Finally, note that deconvolutions are hard to validate, as we do not have ground truth spatially-resolved single cells.

[24]:
sq.im.process(img=img, layer="image", method="smooth")
sq.im.segment(
    img=img,
    layer="image_smooth",
    method="watershed",
    channel=0,
)

Let’s visualize the segmentation results for an inset

[25]:
inset_y = 1500
inset_x = 1700
inset_sy = 400
inset_sx = 500

fig, axs = plt.subplots(1, 3, figsize=(30, 10))
sc.pl.spatial(
    adata_st, color="cluster", alpha=0.7, frameon=False, show=False, ax=axs[0], title=""
)
axs[0].set_title("Clusters", fontdict={"fontsize": 20})
sf = adata_st.uns["spatial"]["V1_Adult_Mouse_Brain_Coronal_Section_2"]["scalefactors"][
    "tissue_hires_scalef"
]
rect = mpl.patches.Rectangle(
    (inset_y * sf, inset_x * sf),
    width=inset_sx * sf,
    height=inset_sy * sf,
    ec="yellow",
    lw=4,
    fill=False,
)
axs[0].add_patch(rect)

axs[0].axes.xaxis.label.set_visible(False)
axs[0].axes.yaxis.label.set_visible(False)

axs[1].imshow(
    img["image"][inset_y : inset_y + inset_sy, inset_x : inset_x + inset_sx, 0, 0]
    / 65536,
    interpolation="none",
)
axs[1].grid(False)
axs[1].set_xticks([])
axs[1].set_yticks([])
axs[1].set_title("DAPI", fontdict={"fontsize": 20})

crop = img["segmented_watershed"][
    inset_y : inset_y + inset_sy, inset_x : inset_x + inset_sx
].values.squeeze(-1)
crop = skimage.segmentation.relabel_sequential(crop)[0]
cmap = plt.cm.plasma
cmap.set_under(color="black")
axs[2].imshow(crop, interpolation="none", cmap=cmap, vmin=0.001)
axs[2].grid(False)
axs[2].set_xticks([])
axs[2].set_yticks([])
axs[2].set_title("Nucleous segmentation", fontdict={"fontsize": 20});
_images/tutorial_sq_link_53_0.png

Comparison between DAPI and mask confirms the quality of the segmentation. We then need to extract some image features useful for the deconvolution task downstream. Specifically: - the number of unique segmentation objects (i.e. nuclei) under each spot. - the coordinates of the centroids of the segmentation object.

[26]:
# define image layer to use for segmentation
features_kwargs = {
    "segmentation": {
        "label_layer": "segmented_watershed",
        "props": ["label", "centroid"],
        "channels": [1, 2],
    }
}
# calculate segmentation features
sq.im.calculate_image_features(
    adata_st,
    img,
    layer="image",
    key_added="image_features",
    features_kwargs=features_kwargs,
    features="segmentation",
    mask_circle=True,
)
100%|██████████| 324/324 [01:19<00:00,  4.09/s]

We can visualize the total number of objects under each spot with scanpy.

[27]:
adata_st.obs["cell_count"] = adata_st.obsm["image_features"]["segmentation_label"]
sc.pl.spatial(adata_st, color=["cluster", "cell_count"], frameon=False)
_images/tutorial_sq_link_57_0.png

Deconvolution via alignment

The rationale for deconvolving with Tangram, is to constrain the number of mapped single cell profiles. This is different that most deconvolution method. Specifically, we set them equal to the number of segmented cells in the histology, in the following way: - We pass mode='constrained'. This adds a filter term to the loss function, and a boolean regularizer. - We set target_count equal to the total number of segmented cells. Tangram will look for the best target_count cells to align in space. - We pass a density_prior, containing the fraction of cells per voxel.

[28]:
ad_map = tg.map_cells_to_space(
    adata_sc,
    adata_st,
    mode="constrained",
    target_count=adata_st.obs.cell_count.sum(),
    density_prior=np.array(adata_st.obs.cell_count) / adata_st.obs.cell_count.sum(),
    num_epochs=1000,
#     device="cuda:0",
    device='cpu',
)
Score: 0.613, KL reg: 0.125, Count reg: 5724.304, Lambda f reg: 4490.422
Score: 0.698, KL reg: 0.012, Count reg: 1.051, Lambda f reg: 734.217
Score: 0.700, KL reg: 0.012, Count reg: 1.661, Lambda f reg: 243.458
Score: 0.701, KL reg: 0.012, Count reg: 0.286, Lambda f reg: 172.023
Score: 0.701, KL reg: 0.012, Count reg: 0.325, Lambda f reg: 143.205
Score: 0.701, KL reg: 0.012, Count reg: 0.129, Lambda f reg: 123.143
Score: 0.701, KL reg: 0.012, Count reg: 0.029, Lambda f reg: 107.319
Score: 0.701, KL reg: 0.012, Count reg: 0.530, Lambda f reg: 96.239
Score: 0.701, KL reg: 0.012, Count reg: 0.375, Lambda f reg: 90.205
Score: 0.701, KL reg: 0.012, Count reg: 0.081, Lambda f reg: 83.204

In the same way as before, we can plot cell type maps:

[29]:
tg.project_cell_annotations(ad_map, adata_st, annotation="cell_subclass")
annotation_list = list(pd.unique(adata_sc.obs['cell_subclass']))
tg.plot_cell_annotation_sc(adata_st, annotation_list, perc=0.02)
_images/tutorial_sq_link_62_0.png

We validate mapping by inspecting the test transcriptome:

[30]:
ad_ge = tg.project_genes(adata_map=ad_map, adata_sc=adata_sc)
df_all_genes = tg.compare_spatial_geneexp(ad_ge, adata_st, adata_sc)
tg.plot_auc(df_all_genes);
<Figure size 432x288 with 0 Axes>
_images/tutorial_sq_link_64_1.png

And here comes the key part, where we will use the results of the previous deconvolution steps. Previously, we computed the absolute numbers of unique segmentation objects under each spot, together with their centroids. Let’s extract them in the right format useful for Tangram. In the resulting dataframe, each row represents a single segmentation object (a cell). We also have the image coordinates as well as the unique centroid ID, which is a string that contains both the spot ID and a numerical index. Tangram provides a convenient function to export the mapping between spot ID and segmentation ID to adata.uns.

[31]:
tg.create_segment_cell_df(adata_st)
adata_st.uns["tangram_cell_segmentation"].head()
[31]:
spot_idx y x centroids
0 AAATGGCATGTCTTGT-1 5304.000000 731.000000 AAATGGCATGTCTTGT-1_0
1 AAATGGCATGTCTTGT-1 5320.947519 721.331554 AAATGGCATGTCTTGT-1_1
2 AAATGGCATGTCTTGT-1 5332.942342 717.447904 AAATGGCATGTCTTGT-1_2
3 AAATGGCATGTCTTGT-1 5348.865384 558.924248 AAATGGCATGTCTTGT-1_3
4 AAATGGCATGTCTTGT-1 5342.124989 567.208502 AAATGGCATGTCTTGT-1_4

We can use tangram.count_cell_annotation() to map cell types as result of the deconvolution step to putative segmentation ID.

[32]:
tg.count_cell_annotations(
    ad_map,
    adata_sc,
    adata_st,
    annotation="cell_subclass",
)
adata_st.obsm["tangram_ct_count"].head()
[32]:
x y cell_n centroids Pvalb L4 Vip L2/3 IT Lamp5 NP ... L5 PT Astro L6b Endo Peri Meis2 Macrophage CR VLMC SMC
AAATGGCATGTCTTGT-1 641 5393 13 [AAATGGCATGTCTTGT-1_0, AAATGGCATGTCTTGT-1_1, A... 1 0 1 0 0 0 ... 2 0 0 0 0 0 0 0 0 0
AACAACTGGTAGTTGC-1 4208 1672 16 [AACAACTGGTAGTTGC-1_0, AACAACTGGTAGTTGC-1_1, A... 1 0 4 0 2 1 ... 1 0 0 0 0 0 0 0 0 0
AACAGGAAATCGAATA-1 1117 5117 28 [AACAGGAAATCGAATA-1_0, AACAGGAAATCGAATA-1_1, A... 1 1 3 0 2 0 ... 0 0 1 0 1 1 0 0 0 0
AACCCAGAGACGGAGA-1 1101 1274 5 [AACCCAGAGACGGAGA-1_0, AACCCAGAGACGGAGA-1_1, A... 2 0 0 0 0 0 ... 0 1 0 0 0 0 1 0 0 0
AACCGTTGTGTTTGCT-1 399 4708 7 [AACCGTTGTGTTTGCT-1_0, AACCGTTGTGTTTGCT-1_1, A... 2 1 0 0 0 0 ... 1 0 0 2 0 0 0 0 0 1

5 rows × 27 columns

And finally export the results in a new AnnData object.

[33]:
adata_segment = tg.deconvolve_cell_annotations(adata_st)
adata_segment.obs.head()
[33]:
y x centroids cluster
0 5304.000000 731.000000 AAATGGCATGTCTTGT-1_0 Pvalb
1 5320.947519 721.331554 AAATGGCATGTCTTGT-1_1 Vip
2 5332.942342 717.447904 AAATGGCATGTCTTGT-1_2 Sst
3 5348.865384 558.924248 AAATGGCATGTCTTGT-1_3 L5 IT
4 5342.124989 567.208502 AAATGGCATGTCTTGT-1_4 L6 CT

Note that the AnnData object does not contain counts, but only cell type annotations, as results of the Tangram mapping. Nevertheless, it’s convenient to create such AnnData object for visualization purposes. Below you can appreciate how each dot is now not a Visium spot anymore, but a single unique segmentation object, with the mapped cell type.

[34]:
fig, ax = plt.subplots(1, 1, figsize=(20, 20))
sc.pl.spatial(
    adata_segment,
    color="cluster",
    size=0.4,
    show=False,
    frameon=False,
    alpha_img=0.2,
    legend_fontsize=20,
    ax=ax,
)
[34]:
[<AxesSubplot:title={'center':'cluster'}, xlabel='spatial1', ylabel='spatial2'>]
_images/tutorial_sq_link_72_1.png