Tutorial for spatial mapping using Tangram

by Ziqing Lu luz21@gene.com and Tommaso Biancalani biancalt@gene.com.
Last update: August 16th 2021

What is Tangram?

Tangram is a method for mapping single-cell (or single-nucleus) gene expression data onto spatial gene expression data. Tangram takes as input a single-cell dataset and a spatial dataset, collected from the same anatomical region/tissue type. Via integration, Tangram creates new spatial data by aligning the scRNAseq profiles in space. This allows to project every annotation in the scRNAseq (e.g. cell types, program usage) on space.

What do I use Tangram for?

The most common application of Tangram is to resolve cell types in space. Another usage is to correct gene expression from spatial data: as scRNA-seq data are less prone to dropout than (e.g.) Visium or Slide-seq, the “new” spatial data generated by Tangram resolve many more genes. As a result, we can visualize program usage in space, which can be used for ligand-receptor pair discovery or, more generally, cell-cell communication mechanisms. If cell segmentation is available, Tangram can be also used for deconvolution of spatial data. If your single cell are multimodal, Tangram can be used to spatially resolve other modalities, such as chromatin accessibility.

Frequently Asked Questions about Tangram

How is Tangram different, than all the other deconvolution/mapping method?

Validation. Most methods “validate” mappings by looking at known patterns or proportion of cell types. These are good sanity checks, but are hardly useful when mapping is used for discovery. In Tangram, mappings are validated by inspective the predictions of holdout genes (test transcriptome).

My scRNAseq/spatial data come from different samples. Can I still use Tangram?

Yes. There is a clever variation invented by Sten Linnarsson, which consists of mapping average cells of a certain cell type, rather than single cells. This method is much faster, and smooths out variation in biological signal from different samples via averaging. However, it requires annotated scRNA-seq, sacrifices resolving biological variability at single-cell level. To map this way, pass mode=cluster.

Does Tangram only work on mouse brain data?

No. The original manuscript focused on mouse brain data b/c was funded by BICCN. We subsequently used Tangram for mapping lung, kidney and cancer tissue. If mapping doesn’t work for your case, that is hardly due to the complexity of the tissue.

Why doesn’t Tangram have hypotheses on the underlying model?

Most models used in biology are probabilistic: they assume that data are generated according to a certain probability distribution, hence the hypothesis. But Tangram doesn’t work that way: the hypothesis is that scRNA-seq and spatial data are generated with the same process (i.e. same biology) regardless of the process.

Where do I learn more about Tangram?

Check out our documentation for learning more about the method, or our GitHub repo for the latest version of the code. Tangram has been released in :cite:`tangram`.

Setting up

Tangram is based on pytorch, scanpy and (optionally but highly-recommended) squidpy - this tutorial is designed to work with squidy. You can also check this tutorial, prior to integration with squidpy.

To run the notebook locally, create a conda environment as conda env create -f tangram_environment.yml using our YAML file.
This notebook is based on squidpy v1.1.0.

[1]:

import scanpy as sc
import squidpy as sq
import numpy as np
import pandas as pd
from anndata import AnnData
import pathlib
import matplotlib.pyplot as plt
import matplotlib as mpl
import skimage
import seaborn as sns
import tangram as tg

sc.logging.print_header()
print(f"squidpy=={sq.__version__}")

%load_ext autoreload
%autoreload 2
%matplotlib inline

scanpy==1.8.1 anndata==0.7.6 umap==0.5.1 numpy==1.19.1 scipy==1.5.2 pandas==1.3.4 scikit-learn==0.24.2 statsmodels==0.12.2 python-igraph==0.9.8 pynndescent==0.5.4
squidpy==1.1.2

Loading datasets

Load public data available in Squidpy, from mouse brain cortex. Single cell data are stored in adata_sc. Spatial data, in adata_st.

[2]:

adata_st = sq.datasets.visium_fluo_adata_crop()
adata_st = adata_st[
    adata_st.obs.cluster.isin([f"Cortex_{i}" for i in np.arange(1, 5)])
].copy()
img = sq.datasets.visium_fluo_image_crop()

adata_sc = sq.datasets.sc_mouse_cortex()

We subset the crop of the mouse brain to only contain clusters of the brain cortex. The pre-processed single cell dataset was taken from :cite:`tasic2018shared` and pre-processed with standard scanpy functions.

Let’s visualize both spatial and single-cell datasets.

[3]:

adata_st.obs

[3]:

	in_tissue	array_row	array_col	n_genes_by_counts	log1p_n_genes_by_counts	total_counts	log1p_total_counts	pct_counts_in_top_50_genes	pct_counts_in_top_100_genes	pct_counts_in_top_200_genes	pct_counts_in_top_500_genes	total_counts_MT	log1p_total_counts_MT	pct_counts_MT	n_counts	leiden	cluster
AAATGGCATGTCTTGT-1	1	13	69	5191	8.554874	15977.0	9.678968	20.629655	26.757213	34.743694	48.889028	0.0	0.0	0.0	15977.0	0	Cortex_1
AACAACTGGTAGTTGC-1	1	28	42	5252	8.566555	16649.0	9.720165	20.481711	26.277855	34.092138	48.201093	0.0	0.0	0.0	16649.0	0	Cortex_1
AACAGGAAATCGAATA-1	1	15	67	6320	8.751633	23375.0	10.059465	17.929412	23.850267	32.077005	45.963636	0.0	0.0	0.0	23375.0	0	Cortex_1
AACCCAGAGACGGAGA-1	1	15	39	3659	8.205218	9229.0	9.130215	25.939972	31.964460	39.885145	53.212699	0.0	0.0	0.0	9229.0	1	Cortex_2
AACCGTTGTGTTTGCT-1	1	12	64	4512	8.414717	12679.0	9.447782	21.839262	28.038489	36.209480	50.540263	0.0	0.0	0.0	12679.0	0	Cortex_1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
TTGGATTGGGTACCAC-1	1	17	55	4980	8.513386	15381.0	9.640953	21.038944	27.059359	35.114752	49.197061	0.0	0.0	0.0	15381.0	0	Cortex_1
TTGGCTCGCATGAGAC-1	1	23	37	4620	8.438366	13193.0	9.487517	20.609414	26.445842	34.063519	48.442356	0.0	0.0	0.0	13193.0	5	Cortex_3
TTGTATCACACAGAAT-1	1	12	74	6120	8.719481	21951.0	9.996614	18.199626	24.235798	32.440436	46.663022	0.0	0.0	0.0	21951.0	0	Cortex_1
TTGTGGCCCTGACAGT-1	1	18	60	4971	8.511577	14779.0	9.601030	21.381690	27.924758	36.213546	49.780093	0.0	0.0	0.0	14779.0	0	Cortex_1
TTGTTAGCAAATTCGA-1	1	22	42	4820	8.480737	14396.0	9.574775	20.595999	26.674076	34.655460	48.624618	0.0	0.0	0.0	14396.0	5	Cortex_3

324 rows × 17 columns

[4]:

fig, axs = plt.subplots(1, 2, figsize=(20, 5))
sc.pl.spatial(
    adata_st, color="cluster", alpha=0.7, frameon=False, show=False, ax=axs[0]
)
sc.pl.umap(
    adata_sc, color="cell_subclass", size=10, frameon=False, show=False, ax=axs[1]
)
plt.tight_layout()

Tangram learns a spatial alignment of the single cell data by looking at a subset of genes, specified by the user, called the training genes. Training genes need to bear interesting signal and to be measured with high quality. Typically, we choose the training genes are 100-1000 differentially expressedx genes, stratified across cell types. Sometimes, we also use the entire transcriptome, or perform different mappings using different set of training genes to see how much the result change.

Tangram fits the scRNA-seq profiles on space using a custom loss function based on cosine similarity. The method is summarized in the sketch below:

Pre-processing

For this case, we use 1401 marker genes as training genes.

[5]:

sc.tl.rank_genes_groups(adata_sc, groupby="cell_subclass", use_raw=False)
markers_df = pd.DataFrame(adata_sc.uns["rank_genes_groups"]["names"]).iloc[0:100, :]
markers = list(np.unique(markers_df.melt().value.values))
len(markers)

WARNING: Default of the method has been changed to 't-test' from 't-test_overestim_var'

[5]:

We prepares the data using pp_adatas, which does the following: - Takes a list of genes from user via the genes argument. These genes are used as training genes. - Annotates training genes under the training_genes field, in uns dictionary, of each AnnData. - Ensure consistent gene order in the datasets (Tangram requires that the the \(j\)-th column in each matrix correspond to the same gene). - If the counts for a gene are all zeros in one of the datasets, the gene is removed from the training genes. - If a gene is not present in both datasets, the gene is removed from the training genes.

[6]:

tg.pp_adatas(adata_sc, adata_st, genes=markers)

INFO:root:1280 training genes are saved in `uns``training_genes` of both single cell and spatial Anndatas.
INFO:root:14785 overlapped genes are saved in `uns``overlap_genes` of both single cell and spatial Anndatas.
INFO:root:uniform based density prior is calculated and saved in `obs``uniform_density` of the spatial Anndata.
INFO:root:rna count based density prior is calculated and saved in `obs``rna_count_based_density` of the spatial Anndata.

Two datasets contain 1280 training genes of the 1401 originally provided, as some training genes have been removed.

Find alignment

To find the optimal spatial alignment for scRNA-seq profiles, we use the map_cells_to_space function: - The function maps iteratively as specified by num_epochs. We typically interrupt mapping after the score plateaus. - The score measures the similarity between the gene expression of the mapped cells vs spatial data on the training genes. - The default mapping mode is mode='cells', which is recommended to run on a GPU. - Alternatively, one can specify mode='clusters' which averages the single cells beloning to the same cluster (pass annotations via cluster_label). This is faster, and is our chioce when scRNAseq and spatial data come from different specimens. - If you wish to run Tangram with a GPU, set device=cuda:0 otherwise use the set device=cpu. - density_prior specifies the cell density within each spatial voxel. Use uniform if the spatial voxels are at single cell resolution (ie MERFISH). The default value, rna_count_based, assumes that cell density is proportional to the number of RNA molecules.

[7]:

ad_map = tg.map_cells_to_space(adata_sc, adata_st,
    mode="cells",
#     mode="clusters",
#     cluster_label='cell_subclass',  # .obs field w cell types
    density_prior='rna_count_based',
    num_epochs=500,
    # device="cuda:0",
    device='cpu',
)

INFO:root:Allocate tensors for mapping.
INFO:root:Begin training with 1280 genes and rna_count_based density_prior in cells mode...
INFO:root:Printing scores every 100 epochs.

Score: 0.613, KL reg: 0.001
Score: 0.733, KL reg: 0.000
Score: 0.736, KL reg: 0.000
Score: 0.737, KL reg: 0.000
Score: 0.737, KL reg: 0.000

INFO:root:Saving results..

The mapping results are stored in the returned AnnData structure, saved as ad_map, structured as following: - The cell-by-spot matrix X contains the probability of cell i to be in spot j. - The obs dataframe contains the metadata of the single cells. - The var dataframe contains the metadata of the spatial data. - The uns dictionary contains a dataframe with various information about the training genes (saved as train_genes_df).

Cell type maps

To visualize cell types in space, we invoke project_cell_annotation to transfer the annotation from the mapping to space. We can then call plot_cell_annotation to visualize it. You can set the perc argument to set the range to the colormap, which would help remove outliers.

[8]:

ad_map

[8]:

AnnData object with n_obs × n_vars = 21697 × 324
    obs: 'sample_name', 'organism', 'donor_sex', 'cell_class', 'cell_subclass', 'cell_cluster', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt', 'n_counts'
    var: 'in_tissue', 'array_row', 'array_col', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'total_counts_MT', 'log1p_total_counts_MT', 'pct_counts_MT', 'n_counts', 'leiden', 'cluster', 'uniform_density', 'rna_count_based_density'
    uns: 'train_genes_df', 'training_history'

[9]:

tg.project_cell_annotations(ad_map, adata_st, annotation="cell_subclass")
annotation_list = list(pd.unique(adata_sc.obs['cell_subclass']))
tg.plot_cell_annotation_sc(adata_st, annotation_list,perc=0.02)

INFO:root:spatial prediction dataframe is saved in `obsm` `tangram_ct_pred` of the spatial AnnData.

The first way to get a sense if mapping was successful is to look for known cell type patterns. To get a deeper sense, we can use the helper plot_training_scores which gives us four panels:

[10]:

tg.plot_training_scores(ad_map, bins=20, alpha=.5)

The first panel is a histogram of the simlarity scores for each training gene.
In the second panel, each dot is a training gene and we can observe the training score (y-axis) and the sparsity in the scRNA-seq data (x-axis) of each gene.
The third panel is similar to the second one, but contains the gene sparsity of the spatial data. Spatial data are usually more sparse than single cell data, a discrepancy which is often responsible for low quality mapping.
In the last panel, we show the training scores as a function of the difference in sparsity between the dataset. For genes with comparable sparsity, the mapped gene expression is very similar to that in the spatial data. However, if a gene is quite sparse in one dataset (typically, the spatial data) but not in other, the mapping score is lower. This occurs as Tangram cannot properly matched the gene pattern because of inconsistent amount of dropouts between the datasets.

Although the above plots give us a summary of scores at single-gene level, we would need to know which are the genes are mapped with low scores. These information are stored in the dataframe .uns['train_genes_df']; this is the dataframe used to build the four plots above.

[11]:

ad_map.uns['train_genes_df']

[11]:

	train_score	sparsity_sc	sparsity_sp	sparsity_diff
ppia	0.998200	0.000092	0.000000	-0.000092
ubb	0.997364	0.000092	0.000000	-0.000092
atp1b1	0.997066	0.014334	0.000000	-0.014334
tmsb4x	0.996971	0.002811	0.000000	-0.002811
ckb	0.996360	0.002765	0.000000	-0.002765
...	...	...	...	...
trpc5	0.194716	0.569203	0.981481	0.412278
cdyl2	0.189953	0.425911	0.981481	0.555570
cntnap5c	0.157349	0.608241	0.993827	0.385586
dlx1as	0.142076	0.587777	0.990741	0.402964
kcnh6	0.133051	0.379131	0.996914	0.617783

1280 rows × 4 columns

New spatial data via aligned single cells

If the mapping mode is 'cells', we can now generate the “new spatial data” using the mapped single cell: this is done via project_genes. The function accepts as input a mapping (adata_map) and corresponding single cell data (adata_sc). The result is a voxel-by-gene AnnData, formally similar to adata_st, but containing gene expression from the mapped single cell data rather than Visium. For downstream analysis, we always replace adata_st with the corresponding ad_ge.

[14]:

ad_ge = tg.project_genes(adata_map=ad_map, adata_sc=adata_sc)
ad_ge

[14]:

AnnData object with n_obs × n_vars = 324 × 36826
    obs: 'in_tissue', 'array_row', 'array_col', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'total_counts_MT', 'log1p_total_counts_MT', 'pct_counts_MT', 'n_counts', 'leiden', 'cluster', 'uniform_density', 'rna_count_based_density'
    var: 'mt', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'n_cells', 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm', 'sparsity', 'is_training'
    uns: 'cell_class_colors', 'cell_subclass_colors', 'hvg', 'neighbors', 'pca', 'umap', 'rank_genes_groups', 'training_genes', 'overlap_genes'

Let’s choose a few training genes mapped with low score, to try to understand why.

[15]:

genes = ['rragb', 'trim17', 'eno1b']
ad_map.uns['train_genes_df'].loc[genes]

[15]:

	train_score	sparsity_sc	sparsity_sp	sparsity_diff
rragb	0.358785	0.079919	0.867284	0.787365
trim17	0.201789	0.069641	0.959877	0.890236
eno1b	0.342446	0.022492	0.885802	0.863311

To visualize gene patterns, we use the helper plot_genes. This function accepts two voxel-by-gene AnnData: the actual spatial data (adata_measured), and a Tangram spatial prediction (adata_predicted). The function returns gene expression maps from the two spatial AnnData on the genes genes.

[16]:

tg.plot_genes_sc(genes, adata_measured=adata_st, adata_predicted=ad_ge, perc=0.02)

The above pictures explain the low training scores. Some genes are detected with very different levels of sparsity - typically they are much more sparse in the scRNAseq than in the spatial data. This is due to the fact that technologies like Visium are more prone to technical dropouts. Therefore, Tangram cannot find a good spatial alignment for these genes as the baseline signal is missing. However, so long as most training genes are measured with high quality, we can trust mapping and use Tangram prediction to correct gene expression. This is an imputation method which relies on entirely different premises than those in probabilistic models.

Another application is found by inspecting genes that are not detected in the spatial data, but are detected in the single cell data. They are removed before training with pp_adatas function, but Tangram can predict their expression.

[17]:

genes=['loc102633833', 'gm5700', 'gm8292']
tg.plot_genes_sc(genes, adata_measured=adata_st, adata_predicted=ad_ge, perc=0.02)

So far, we only inspected genes used to align the data (training genes), but the mapped single cell data, ad_ge contains the whole transcriptome. That includes more than 35k test genes.

[18]:

(ad_ge.var.is_training == False).sum()

[18]:

We can use plot_genes to inspect gene expression of test genes as well. Inspecting the test transcriptome is an essential to validate mapping. At the same time, we need to be careful that some prediction might disagree with spatial data because of the technical droputs.

It is convenient to compute the similarity scores of all genes, which can be done by compare_spatial_geneexp. This function accepts two spatial AnnDatas (ie voxel-by-gene), and returns a dataframe with simlarity scores for all genes. Training genes are flagged by the boolean field is_training. If we also pass single cell AnnData to compare_spatial_geneexp function like below, a dataframe with additional sparsity columns - sparsity_sc (single cell data sparsity) and sparsity_diff (spatial data sparsity - single cell data sparsity) will return. This is required if we want to call plot_test_scores function later with the returned datafrme from compare_spatial_geneexp function.

[19]:

df_all_genes = tg.compare_spatial_geneexp(ad_ge, adata_st, adata_sc)
df_all_genes

[19]:

	score	is_training	sparsity_sp	sparsity_sc	sparsity_diff
snap25	0.998238	False	0.000000	0.014610	-0.014610
ppia	0.998200	True	0.000000	0.000092	-0.000092
gapdh	0.998200	False	0.000000	0.000968	-0.000968
calm1	0.997942	False	0.000000	0.000369	-0.000369
calm2	0.997779	False	0.000000	0.001751	-0.001751
...	...	...	...	...	...
6330420h09rik	0.000014	False	0.996914	0.998894	-0.001980
1810010k12rik	0.000014	False	0.996914	0.999585	-0.002672
cckar	0.000013	False	0.996914	0.999309	-0.002395
chil3	0.000013	False	0.996914	0.998894	-0.001980
cyp3a13	0.000013	False	0.996914	0.998479	-0.001565

14785 rows × 5 columns

The prediction on test genes can be graphically visualized using plot_auc:

[20]:

# sns.scatterplot(data=df_all_genes, x='score', y='sparsity_sp', hue='is_training', alpha=.5);  # for legacy
tg.plot_auc(df_all_genes);

<Figure size 432x288 with 0 Axes>

This above figure is the most important validation plot in *Tangram*. Each dot represents a gene; the x-axis indicates the score, and the y-axis the sparsity of that gene in the spatial data. Unsurprisingly, the genes predicted with low score represents very sparse genes in the spatial data, suggesting that the Tangram predictions correct expression in those genes. Note that curve observed above is typical of Tangram mappings: the area under that curve is the most reliable metric we use to evaluate mapping.

Let’s inspect a few predictions. Some of these genes are biologically sparse, but well predicted:

[21]:

genes=['tfap2b', 'zic4']
tg.plot_genes_sc(genes, adata_measured=adata_st, adata_predicted=ad_ge, perc=0.02)

Some non-sparse genes present petterns, that Tangram accentuates:

[22]:

genes = ['cd34', 'rasal1']
tg.plot_genes_sc(genes, adata_measured=adata_st, adata_predicted=ad_ge, perc=0.02)

Finally, some unannotated genes have unknown function. These genes are often hardly detected in spatial data but Tangram provides prediction:

[23]:

genes = ['gm33027', 'gm5431']
tg.plot_genes_sc(genes[:5], adata_measured=adata_st, adata_predicted=ad_ge, perc=0.02)

For untargeted spatial technologies, like Visium and Slide-seq, a spatial voxel may contain more than one cells. In these cases, it might be useful to disentangle gene expression into single cells - a process called deconvolution. Deconvolution is a requested feature, and also hard to obtain accurately with computational methods. If your goal is to study co-localization of cell types, we recommend you work with the spatial cell type maps instead. If your aim is discovery of cell-cell communication mechanisms, we suggest you compute gene programs, then use project_cell_annotations to spatially visualize program usage. To proceed with deconvolution anyways, see below.

In order to deconvolve cells, Tangram needs to know how many cells are present in each voxel. This is achieved by segmenting the cells on the corresponding histology, which squidpy makes possible with two lines of code: - squidpy.im.process applies smoothing as a pre-processing step. - squidpy.im.segment computes segmentation masks with watershed algorithm.

Note that some technologies, like Slide-seq, currently do not allow staining the same slide of tissue on which genes are profiled. For these data, you can still attempt a deconvolution by estimating cell density in a rough way - often we just pass a uniform prior. Finally, note that deconvolutions are hard to validate, as we do not have ground truth spatially-resolved single cells.

[24]:

sq.im.process(img=img, layer="image", method="smooth")
sq.im.segment(
    img=img,
    layer="image_smooth",
    method="watershed",
    channel=0,
)

Let’s visualize the segmentation results for an inset

[25]:

inset_y = 1500
inset_x = 1700
inset_sy = 400
inset_sx = 500

fig, axs = plt.subplots(1, 3, figsize=(30, 10))
sc.pl.spatial(
    adata_st, color="cluster", alpha=0.7, frameon=False, show=False, ax=axs[0], title=""
)
axs[0].set_title("Clusters", fontdict={"fontsize": 20})
sf = adata_st.uns["spatial"]["V1_Adult_Mouse_Brain_Coronal_Section_2"]["scalefactors"][
    "tissue_hires_scalef"
]
rect = mpl.patches.Rectangle(
    (inset_y * sf, inset_x * sf),
    width=inset_sx * sf,
    height=inset_sy * sf,
    ec="yellow",
    lw=4,
    fill=False,
)
axs[0].add_patch(rect)

axs[0].axes.xaxis.label.set_visible(False)
axs[0].axes.yaxis.label.set_visible(False)

axs[1].imshow(
    img["image"][inset_y : inset_y + inset_sy, inset_x : inset_x + inset_sx, 0, 0]
    / 65536,
    interpolation="none",
)
axs[1].grid(False)
axs[1].set_xticks([])
axs[1].set_yticks([])
axs[1].set_title("DAPI", fontdict={"fontsize": 20})

crop = img["segmented_watershed"][
    inset_y : inset_y + inset_sy, inset_x : inset_x + inset_sx
].values.squeeze(-1)
crop = skimage.segmentation.relabel_sequential(crop)[0]
cmap = plt.cm.plasma
cmap.set_under(color="black")
axs[2].imshow(crop, interpolation="none", cmap=cmap, vmin=0.001)
axs[2].grid(False)
axs[2].set_xticks([])
axs[2].set_yticks([])
axs[2].set_title("Nucleous segmentation", fontdict={"fontsize": 20});

Comparison between DAPI and mask confirms the quality of the segmentation. We then need to extract some image features useful for the deconvolution task downstream. Specifically: - the number of unique segmentation objects (i.e. nuclei) under each spot. - the coordinates of the centroids of the segmentation object.

[26]:

# define image layer to use for segmentation
features_kwargs = {
    "segmentation": {
        "label_layer": "segmented_watershed",
        "props": ["label", "centroid"],
        "channels": [1, 2],
    }
}
# calculate segmentation features
sq.im.calculate_image_features(
    adata_st,
    img,
    layer="image",
    key_added="image_features",
    features_kwargs=features_kwargs,
    features="segmentation",
    mask_circle=True,
)

100%|██████████| 324/324 [01:19<00:00,  4.09/s]

We can visualize the total number of objects under each spot with scanpy.

[27]:

adata_st.obs["cell_count"] = adata_st.obsm["image_features"]["segmentation_label"]
sc.pl.spatial(adata_st, color=["cluster", "cell_count"], frameon=False)

Deconvolution via alignment

The rationale for deconvolving with Tangram, is to constrain the number of mapped single cell profiles. This is different that most deconvolution method. Specifically, we set them equal to the number of segmented cells in the histology, in the following way: - We pass mode='constrained'. This adds a filter term to the loss function, and a boolean regularizer. - We set target_count equal to the total number of segmented cells. Tangram will look for the best target_count cells to align in space. - We pass a density_prior, containing the fraction of cells per voxel.

[28]:

ad_map = tg.map_cells_to_space(
    adata_sc,
    adata_st,
    mode="constrained",
    target_count=adata_st.obs.cell_count.sum(),
    density_prior=np.array(adata_st.obs.cell_count) / adata_st.obs.cell_count.sum(),
    num_epochs=1000,
#     device="cuda:0",
    device='cpu',
)

Score: 0.613, KL reg: 0.125, Count reg: 5724.304, Lambda f reg: 4490.422
Score: 0.698, KL reg: 0.012, Count reg: 1.051, Lambda f reg: 734.217
Score: 0.700, KL reg: 0.012, Count reg: 1.661, Lambda f reg: 243.458
Score: 0.701, KL reg: 0.012, Count reg: 0.286, Lambda f reg: 172.023
Score: 0.701, KL reg: 0.012, Count reg: 0.325, Lambda f reg: 143.205
Score: 0.701, KL reg: 0.012, Count reg: 0.129, Lambda f reg: 123.143
Score: 0.701, KL reg: 0.012, Count reg: 0.029, Lambda f reg: 107.319
Score: 0.701, KL reg: 0.012, Count reg: 0.530, Lambda f reg: 96.239
Score: 0.701, KL reg: 0.012, Count reg: 0.375, Lambda f reg: 90.205
Score: 0.701, KL reg: 0.012, Count reg: 0.081, Lambda f reg: 83.204

In the same way as before, we can plot cell type maps:

[29]:

tg.project_cell_annotations(ad_map, adata_st, annotation="cell_subclass")
annotation_list = list(pd.unique(adata_sc.obs['cell_subclass']))
tg.plot_cell_annotation_sc(adata_st, annotation_list, perc=0.02)

We validate mapping by inspecting the test transcriptome:

[30]:

ad_ge = tg.project_genes(adata_map=ad_map, adata_sc=adata_sc)
df_all_genes = tg.compare_spatial_geneexp(ad_ge, adata_st, adata_sc)
tg.plot_auc(df_all_genes);

<Figure size 432x288 with 0 Axes>

And here comes the key part, where we will use the results of the previous deconvolution steps. Previously, we computed the absolute numbers of unique segmentation objects under each spot, together with their centroids. Let’s extract them in the right format useful for Tangram. In the resulting dataframe, each row represents a single segmentation object (a cell). We also have the image coordinates as well as the unique centroid ID, which is a string that contains both the spot ID and a numerical index. Tangram provides a convenient function to export the mapping between spot ID and segmentation ID to adata.uns.

[31]:

tg.create_segment_cell_df(adata_st)
adata_st.uns["tangram_cell_segmentation"].head()

[31]:

	spot_idx	y	x	centroids
0	AAATGGCATGTCTTGT-1	5304.000000	731.000000	AAATGGCATGTCTTGT-1_0
1	AAATGGCATGTCTTGT-1	5320.947519	721.331554	AAATGGCATGTCTTGT-1_1
2	AAATGGCATGTCTTGT-1	5332.942342	717.447904	AAATGGCATGTCTTGT-1_2
3	AAATGGCATGTCTTGT-1	5348.865384	558.924248	AAATGGCATGTCTTGT-1_3
4	AAATGGCATGTCTTGT-1	5342.124989	567.208502	AAATGGCATGTCTTGT-1_4

We can use tangram.count_cell_annotation() to map cell types as result of the deconvolution step to putative segmentation ID.

[32]:

tg.count_cell_annotations(
    ad_map,
    adata_sc,
    adata_st,
    annotation="cell_subclass",
)
adata_st.obsm["tangram_ct_count"].head()

[32]:

	x	y	cell_n	centroids	Pvalb	L4	Vip	Lamp5	NP	...	L5 PT	Astro	L6b	Endo	Peri	Meis2	Macrophage	SMC
AAATGGCATGTCTTGT-1	641	5393	13	[AAATGGCATGTCTTGT-1_0, AAATGGCATGTCTTGT-1_1, A...	1	0	1	0	0	...	2	0	0	0	0	0	0	0
AACAACTGGTAGTTGC-1	4208	1672	16	[AACAACTGGTAGTTGC-1_0, AACAACTGGTAGTTGC-1_1, A...	1	0	4	2	1	...	1	0	0	0	0	0	0	0
AACAGGAAATCGAATA-1	1117	5117	28	[AACAGGAAATCGAATA-1_0, AACAGGAAATCGAATA-1_1, A...	1	1	3	2	0	...	0	0	1	0	1	1	0	0
AACCCAGAGACGGAGA-1	1101	1274	5	[AACCCAGAGACGGAGA-1_0, AACCCAGAGACGGAGA-1_1, A...	2	0	0	0	0	...	0	1	0	0	0	0	1	0
AACCGTTGTGTTTGCT-1	399	4708	7	[AACCGTTGTGTTTGCT-1_0, AACCGTTGTGTTTGCT-1_1, A...	2	1	0	0	0	...	1	0	0	2	0	0	0	1

5 rows × 27 columns

And finally export the results in a new AnnData object.

[33]:

adata_segment = tg.deconvolve_cell_annotations(adata_st)
adata_segment.obs.head()

[33]:

	y	x	centroids	cluster
0	5304.000000	731.000000	AAATGGCATGTCTTGT-1_0	Pvalb
1	5320.947519	721.331554	AAATGGCATGTCTTGT-1_1	Vip
2	5332.942342	717.447904	AAATGGCATGTCTTGT-1_2	Sst
3	5348.865384	558.924248	AAATGGCATGTCTTGT-1_3	L5 IT
4	5342.124989	567.208502	AAATGGCATGTCTTGT-1_4	L6 CT

Note that the AnnData object does not contain counts, but only cell type annotations, as results of the Tangram mapping. Nevertheless, it’s convenient to create such AnnData object for visualization purposes. Below you can appreciate how each dot is now not a Visium spot anymore, but a single unique segmentation object, with the mapped cell type.

[34]:

fig, ax = plt.subplots(1, 1, figsize=(20, 20))
sc.pl.spatial(
    adata_segment,
    color="cluster",
    size=0.4,
    show=False,
    frameon=False,
    alpha_img=0.2,
    legend_fontsize=20,
    ax=ax,
)

[34]:

[<AxesSubplot:title={'center':'cluster'}, xlabel='spatial1', ylabel='spatial2'>]