TANGRAM¶
Tangram is a Python package, written in PyTorch and based on scanpy , for mapping single-cell (or single-nucleus) gene expression data onto spatial gene expression data. The single-cell dataset and the spatial dataset should be collected from the same anatomical region/tissue type, ideally from a biological replicate, and need to share a set of genes. Tangram aligns the single-cell data in space by fitting gene expression on the shared genes. The best way to familiarize yourself with Tangram is to check out our tutorials.

Tangram News¶
Citing Tangram¶
Tangram has been released in the following publication
Biancalani* T., Scalia* G. et al. - _Deep learning and alignment of spatially-resolved whole transcriptomes of single cells in the mouse brain with Tangram biorXiv 10.1101/2020.08.29.272831 (2020)
Release Notes¶
1.0.0 2021-08-06 - Initial Release
Getting Started¶
Installing Tangram¶
To install Tangram, make sure you have PyTorch and scanpy installed. If you need more details on the dependences, look at the environment.yml file.
Install Tangram from shell:
pip install tangram-sc
Running Tangram¶
Cell Level¶
To install Tangram, make sure you have PyTorch and scanpy installed. If you need more details on the dependences, look at the environment.yml file.
Install tangram-sc from shell:
pip install tangram-sc
Import tangram:
import tangram as tg
Then load your spatial data and your single cell data (which should be in AnnData format), and pre-process them using tg.pp_adatas:
ad_sp = sc.read_h5ad(path)
ad_sc = sc.read_h5ad(path)
tg.pp_adatas(ad_sc, ad_sp, genes=None)
The function pp_adatas finds the common genes between adata_sc, adata_sp, and saves them in two adatas.uns for mapping and analysis later. Also, it subsets the intersected genes to a set of training genes passed by genes. If genes=None, Tangram maps using all genes shared by the two datasets. Once the datasets are pre-processed we can map:
ad_map = tg.map_cells_to_space(ad_sc, ad_sp)
The returned AnnData, ad_map , is a cell-by-voxel structure where ad_map.X[i, j] gives the probability for cell i to be in voxel j. This structure can be used to project gene expression from the single cell data to space, which is achieved via tg.project_genes:
ad_ge = tg.project_genes(ad_map, ad_sc)
The returned ad_ge is a voxel-by-gene AnnData, similar to spatial data ad_sp, but where gene expression has been projected from the single cells. This allows to extend gene throughput, or correct for dropouts, if the single cells have higher quality (or more genes) than single cell data. It can also be used to transfer cell types onto space.
For more details on how to use Tangram check out our tutorial.
Cluster Level¶
To enable faster training and consume less memory, Tangram mapping can be done at cell cluster level.
Prepare the input data as the same you would do for cell level Tangram mapping. Then map using following code:
ad_map = tg.map_cells_to_space(
ad_sc,
ad_sp,
mode='clusters',
cluster_label='subclass_label')
Provided cluster_label must belong to ad_sc.obs. Above example code is to map at subclass_label level, and the subclass_label is in ad_sc.obs.
To project gene expression to space, use tg.project_genes and be sure to set the cluster_label argument to the same cluster label in mapping:
ad_ge = tg.project_genes(
ad_map,
ad_sc,
cluster_label='subclass_label')
Tangram Under the Hood¶
Tangram instantiates a Mapper object passing the following arguments: * _S_: single cell matrix with shape cell-by-gene. Note that genes is the number of training genes. * _G_: spatial data matrix with shape voxels-by-genes. Voxel can contain multiple cells.
Then, Tangram searches for a mapping matrix M, with shape voxels-by-cells, where the element M_ij signifies the probability of cell i of being in spot j. Tangram computes the matrix M by minimizing the following loss:

where cos_sim is the cosine similarity. The meaning of the loss function is that gene expression of the mapped single cells should be as similar as possible to the spatial data G, under the cosine similarity sense.
The above accounts for basic Tangram usage. In our manuscript, we modified the loss function in several ways so as to add various kinds of prior knowledge, such as number of cell contained in each voxels.
Classes¶
Library for instantiating and running the optimizer for Tangram. |
|
Mapping helpers |
|
This module includes plotting utility functions. |
|
Utility functions to pre- and post-process data for Tangram. |
Frequently Asked Questions¶
Do I need a GPU for running Tangram?
A GPU is not required but is recommended. We run most of our mappings on a single P100 which maps ~50k cells in a few minutes.
How do I choose a list of training genes?
A good way to start is to use the top 1k unique marker genes, stratified across cell types, as training genes. Alternatively, you can map using the whole transcriptome. Ideally, training genes should contain high quality signals: if most training genes are rich in dropouts or obtained with bad RNA probes your mapping will not be accurate.
Do I need cell segmentation for mapping on Visium data?
You do not need to segment cells in your histology for mapping on spatial transcriptomics data (including Visium and Slide-seq). You need, however, cell segmentation if you wish to deconvolve the data (_ie_ deterministically assign a single cell profile to each cell within a spatial voxel).
I run out of memory when I map: what should I do?
Reduce your spatial data in various parts and map each single part. If that is not sufficient, you will need to downsample your single cell data as well.
Tutorials¶
Tutorial for mapping data with Tangram¶
by Tommaso Biancalani biancalt@gene.com and Ziqing Lu luz21@gene.com
The notebook introduces to mapping single cell data on spatial data using the Tangram method.
The notebook uses data from mouse brain cortex (different than those adopted in the manuscript).
Last changelog¶
June 13th - Tommaso Biancalani biancalt@gene.com
Installation¶
Make sure
tangram-sc
is installed viapip install tangram-sc
.Otherwise, edit and uncomment the line starting with
sys.path
specifying the Tangram folder.The Python environment needs to install the packages listed in
environment.yml
.
[1]:
import os, sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scanpy as sc
import torch
#sys.path.append('/home/exx/git/Tangram/') # uncomment for local import
import tangram as tg
%load_ext autoreload
%autoreload 2
%matplotlib inline
tg.__version__
[1]:
'1.0.0'
Download the data¶
If you have
wget
installed, you can run the following code to automatically download and unzip the data.
[2]:
# Skip this cells if data are already downloaded
!wget https://storage.googleapis.com/tommaso-brain-data/tangram_demo/mop_sn_tutorial.h5ad.gz -O data/mop_sn_tutorial.h5ad.gz
!wget https://storage.googleapis.com/tommaso-brain-data/tangram_demo/slideseq_MOp_1217.h5ad.gz -O data/slideseq_MOp_1217.h5ad.gz
!wget https://storage.googleapis.com/tommaso-brain-data/tangram_demo/MOp_markers.csv -O data/MOp_markers.csv
!gunzip -f data/mop_sn_tutorial.h5ad.gz
!gunzip -f data/slideseq_MOp_1217.h5ad.gz
--2021-08-30 14:22:56-- https://storage.googleapis.com/tommaso-brain-data/tangram_demo/mop_sn_tutorial.h5ad.gz
Resolving storage.googleapis.com (storage.googleapis.com)... 172.217.14.208, 172.217.14.240, 142.250.69.208, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|172.217.14.208|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 474724402 (453M) [application/x-gzip]
Saving to: ‘data/mop_sn_tutorial.h5ad.gz’
data/mop_sn_tutoria 100%[===================>] 452.73M 136MB/s in 3.3s
2021-08-30 14:22:59 (136 MB/s) - ‘data/mop_sn_tutorial.h5ad.gz’ saved [474724402/474724402]
--2021-08-30 14:23:00-- https://storage.googleapis.com/tommaso-brain-data/tangram_demo/slideseq_MOp_1217.h5ad.gz
Resolving storage.googleapis.com (storage.googleapis.com)... 142.251.33.112, 142.251.33.80, 142.250.217.112, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|142.251.33.112|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12614106 (12M) [application/x-gzip]
Saving to: ‘data/slideseq_MOp_1217.h5ad.gz’
data/slideseq_MOp_1 100%[===================>] 12.03M 65.4MB/s in 0.2s
2021-08-30 14:23:01 (65.4 MB/s) - ‘data/slideseq_MOp_1217.h5ad.gz’ saved [12614106/12614106]
--2021-08-30 14:23:01-- https://storage.googleapis.com/tommaso-brain-data/tangram_demo/MOp_markers.csv
Resolving storage.googleapis.com (storage.googleapis.com)... 142.251.33.112, 142.251.33.80, 142.250.217.112, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|142.251.33.112|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2510 (2.5K) [text/csv]
Saving to: ‘data/MOp_markers.csv’
data/MOp_markers.cs 100%[===================>] 2.45K --.-KB/s in 0s
2021-08-30 14:23:02 (27.3 MB/s) - ‘data/MOp_markers.csv’ saved [2510/2510]
If you do not have
wget
installed, manually download data from the links below:snRNA-seq datasets collected from adult mouse cortex: 10Xv3 MOp.
For spatial data, we will use one coronal slice of Slide-seq2 data (adult mouse brain; MOp area).
We will map them via a few hundred marker genes, found in literature.
All datasets need to be unzipped: resulting
h5ad
andcsv
files should be placed in thedata
folder.
Load spatial data¶
Spatial data need to be organized as a voxel-by-gene matrix. Here, Slide-seq data contains 9852 spatial voxels, in each of which there are 24518 genes measured.
[2]:
path = os.path.join('data', 'slideseq_MOp_1217.h5ad')
ad_sp = sc.read_h5ad(path)
ad_sp
[2]:
AnnData object with n_obs × n_vars = 9852 × 24518
obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'x', 'y'
The voxel coordinates are saved in the fields
obs.x
andobs.y
which we can use to visualize the spatial ROI. Each “dot” is the center of a 10um voxel.
[3]:
xs = ad_sp.obs.x.values
ys = ad_sp.obs.y.values
plt.axis('off')
plt.scatter(xs, ys, s=.7);
plt.gca().invert_yaxis()

Single cell data¶
By single cell data, we generally mean either scRNAseq or snRNAseq.
We start by mapping the MOp 10Xv3 dataset, which contains single nuclei collected from a posterior region of the primary motor cortex.
They are approximately 26k profiled cells with 28k genes.
[4]:
path = os.path.join('data','mop_sn_tutorial.h5ad')
ad_sc = sc.read_h5ad(path)
ad_sc
[4]:
AnnData object with n_obs × n_vars = 26431 × 27742
obs: 'QC', 'batch', 'class_color', 'class_id', 'class_label', 'cluster_color', 'cluster_labels', 'dataset', 'date', 'ident', 'individual', 'nCount_RNA', 'nFeature_RNA', 'nGene', 'nUMI', 'project', 'region', 'species', 'subclass_id', 'subclass_label'
layers: 'logcounts'
Usually, we work with data in raw count form, especially if the spatial data are in raw count form as well.
If the data are in integer format, that probably means they are in raw count.
[5]:
np.unique(ad_sc.X.toarray()[0, :])
[5]:
array([ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9., 10.,
11., 12., 13., 14., 15., 16., 17., 18., 19., 20., 21.,
22., 23., 24., 25., 26., 27., 28., 29., 30., 31., 33.,
34., 36., 39., 40., 43., 44., 46., 47., 49., 50., 53.,
56., 57., 58., 62., 68., 69., 73., 77., 80., 85., 86.,
98., 104., 105., 118., 121., 126., 613.], dtype=float32)
Here, we only do some light pre-processing as library size correction (in scanpy, via
sc.pp.normalize
) to normalize the number of count within each cell to a fixed number.Sometimes, we apply more sophisticated pre-processing methods, for example for batch correction, although mapping works great with raw data.
Ideally, the single cell and spatial datasets, should exhibit signals as similar as possible and the pre-processing pipeline should be finalized to harmonize the signals.
[6]:
sc.pp.normalize_total(ad_sc)
It is a good idea to have annotations in the single cell data, as they will be projected on space after we map.
In this case, cell types are annotated in the
subclass_label
field, for which we plot cell counts.Note that cell type proportion should be similar in the two datasets: for example, if
Meis
is a rare cell type in the snRNA-seq then it is expected to be a rare one even in the spatial data as well.
[7]:
ad_sc.obs.subclass_label.value_counts()
[7]:
L5 IT 5623
Oligo 4330
L2/3 IT 3555
L6 CT 3118
Astro 2600
Micro-PVM 1121
Pvalb 972
L6 IT 919
L5 ET 903
L5/6 NP 649
Sst 627
Vip 435
L6b 361
Endo 357
Lamp5 332
VLMC 248
Peri 187
Sncg 94
Name: subclass_label, dtype: int64
Prepare to map¶
Tangram learns a spatial alignment of the single cell data so that the gene expression of the aligned single cell data is as similar as possible to that of the spatial data.
In doing this, Tangram only looks at a subset genes, specified by the user, called the training genes.
The choice of the training genes is a delicate step for mapping: they need to bear interesting signals and to be measured with high quality.
Typically, a good start is to choose 100-1000 top marker genes, evenly stratified across cell types. Sometimes, we also use the entire transcriptome, or perform different mappings using different set of training genes to see how much the result change.
For this case, we choose 253 marker genes of the MOp area which were curated in a different study.
[8]:
df_genes = pd.read_csv('data/MOp_markers.csv', index_col=0)
markers = np.reshape(df_genes.values, (-1, ))
markers = list(markers)
len(markers)
[8]:
253
We now need to prepare the datasets for mapping by creating
training_genes
field inuns
dictionary of the two AnnData structures.This
training_genes
field contains genes subset on the list of training genes. This field will be used later inside the mapping function to create training datasets.Also, the gene order needs to be the same in the datasets. This is because Tangram maps using only gene expression, so the \(j\)-th column in each matrix must correspond to the same gene.
And if data entries of a gene are all zero, this gene will be removed
This task is performed by the helper
pp_adatas
.
[9]:
tg.pp_adatas(ad_sc, ad_sp, genes=markers)
INFO:root:249 training genes are saved in `uns``training_genes` of both single cell and spatial Anndatas.
INFO:root:18000 overlapped genes are saved in `uns``overlap_genes` of both single cell and spatial Anndatas.
INFO:root:uniform based density prior is calculated and saved in `obs``uniform_density` of the spatial Anndata.
INFO:root:rna count based density prior is calculated and saved in `obs``rna_count_based_density` of the spatial Anndata.
You’ll now notice that the two datasets now contain 249 genes, but 253 markers were provided.
This is because the marker genes need to be shared by both dataset. If a gene is missing,
pp_adatas
will just take it out.Finally, the
assert
line below is a good way to ensure that the genes in thetraining_genes
field inuns
are actually ordered in bothAnnData
s.
[10]:
ad_sc
[10]:
AnnData object with n_obs × n_vars = 26431 × 26496
obs: 'QC', 'batch', 'class_color', 'class_id', 'class_label', 'cluster_color', 'cluster_labels', 'dataset', 'date', 'ident', 'individual', 'nCount_RNA', 'nFeature_RNA', 'nGene', 'nUMI', 'project', 'region', 'species', 'subclass_id', 'subclass_label'
var: 'n_cells'
uns: 'training_genes', 'overlap_genes'
layers: 'logcounts'
[11]:
ad_sp
[11]:
AnnData object with n_obs × n_vars = 9852 × 20864
obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'x', 'y', 'uniform_density', 'rna_count_based_density'
var: 'n_cells'
uns: 'training_genes', 'overlap_genes'
[12]:
assert ad_sc.uns['training_genes'] == ad_sp.uns['training_genes']
Map¶
We can now train the model (ie map the single cell data onto space).
Mapping should be interrupted after the score plateaus,which can be controlled by passing the
num_epochs
parameter.The score measures the similarity between the gene expression of the mapped cells vs spatial data: higher score means better mapping.
Note that we obtained excellent mapping even if Tangram converges to a low scores (the typical case is when the spatial data are very sparse): we use the score merely to assess convergence.
If you are running Tangram with a GPU, uncomment
device=cuda:0
and comment the linedevice=cpu
. On a MacBook Pro 2018, it takes ~1h to run. On a P100 GPU it should be done in a few minutes.For this basic mapping, we do not use regularizers. More sophisticated loss functions can be used using the Tangram library (refer to manuscript or dive into the code). For example, you can pass your
density_prior
with the hyperparameterlambda_d
to regularize the spatial density of cells. Currentlyuniform
,rna_count_based
and customized input array are supported fordensity_prior
argument.Instead of mapping single cells, we can “average” the cells within a cluster and map the averaged cells instead, which drammatically improves performances. This suggestion was proposed by Sten Linnarsson. To activate this mode, select
mode='clusters'
and pass the annotation field tocluster_label
.
[13]:
ad_map = tg.map_cells_to_space(
adata_sc=ad_sc,
adata_sp=ad_sp,
device='cpu',
# device='cuda:0',
)
INFO:root:Allocate tensors for mapping.
INFO:root:Begin training with 249 genes and None density_prior in cells mode...
INFO:root:Printing scores every 100 epochs.
Score: 0.103
Score: 0.802
Score: 0.819
Score: 0.822
Score: 0.824
Score: 0.825
Score: 0.826
Score: 0.826
Score: 0.827
Score: 0.827
INFO:root:Saving results..
The mapping results are stored in the returned
AnnData
structure, saved asad_map
, structured as following:The cell-by-spot matrix
X
contains the probability of cell \(i\) to be in spot \(j\).The
obs
dataframe contains the metadata of the single cells.The
var
dataframe contains the metadata of the spatial data.The
uns
dictionary contains a dataframe with various information about the training genes (saved adtrain_genes_df
).
We can now save the mapping results for post-analysis.
Analysis¶
The most common application for mapping single cell data onto space is to transfer the cell type annotations onto space.
This is dona via
plot_cell_annotation
, which visualizes spatial probability maps of theannotation
in theobs
dataframe (here, thesubclass_label
field). You can setrobust
argument toTrue
and at the same time set theperc
argument to set the range to the colormap, which would help remove outliers.The following plots recover cortical layers of excitatory neurons and sparse patterns of glia cells. The boundaries of the cortex are defined by layer 6b (cell type L6b) and oligodendrocytes are found concentrated into sub-cortical region, as expected.
Yet, the VLMC cell type patterns does not seem correct: VLMC cells are clustered in the first cortical layer, whereas here are sparse in the ROI. This usually means that either (1) we have not used good marker genes for VLMC cells in our training genes (2) the present marker genes are very sparse in the spatial data, therefore they don’t contain good mapping signal.
[15]:
tg.plot_cell_annotation(ad_map, ad_sp, annotation='subclass_label', nrows=5, ncols=4, robust=True, perc=0.05)
INFO:root:spatial prediction dataframe is saved in `obsm` `tangram_ct_pred` of the spatial AnnData.


Let’s try to get a deeper sense of how good this mapping is. A good helper is
plot_training_scores
which gives us four panels:The first panels is a histogram of the simlarity score for each training gene. Most genes are mapped with very high similarity (> .9) although few of them have score ~.5. We would like to understand why for these genes the score is lower.
The second panel shows that there is a neat correlation between the training score of a gene (y-axis) and the sparsity of that gene in the snRNA-seq data (x-axis). Each dot is a training gene. The trend is that the sparser the gene the higher the score: this usually happens because very sparse gene are easier to map, as their pattern is matched by placing a few “jackpot cells” in the right spots.
The third panel is similar to the second one, but contains the gene sparsity of the spatial data. Spatial data are usually more sparse than single cell data, a discrepancy which is often responsible for low quality mapping.
In the last panel, we show the training scores as a function of the difference in sparsity between the dataset. For genes with comparable sparsity, the mapped gene expression is very similar to that in the spatial data. However, if a gene is quite sparse in one dataset (typically, the spatial data) but not in other, the mapping score is lower. This occurs as Tangram cannot properly matched the gene pattern because of inconsistent amount of dropouts between the datasets.
[16]:
tg.plot_training_scores(ad_map, bins=10, alpha=.5)

Although the above plots give us a summary of scores at single-gene level, we would need to know which are the genes are mapped with low scores.
These information can be access from the dataframe
.uns['train_genes_df']
from the mapping results; this is the dataframe used to build the four plots above.
[17]:
ad_map.uns['train_genes_df']
[17]:
train_score | sparsity_sc | sparsity_sp | sparsity_diff | |
---|---|---|---|---|
igf2 | 0.999628 | 0.999924 | 0.994011 | -0.005913 |
chodl | 0.997236 | 0.999016 | 0.999086 | 0.000070 |
5031425f14rik | 0.995950 | 0.998789 | 0.999594 | 0.000805 |
car3 | 0.995266 | 0.999016 | 0.999695 | 0.000679 |
scgn | 0.994697 | 0.999205 | 0.999898 | 0.000693 |
... | ... | ... | ... | ... |
5730522e02rik | 0.514366 | 0.400401 | 0.984572 | 0.584171 |
rgs6 | 0.508304 | 0.305172 | 0.941941 | 0.636769 |
ptprk | 0.507358 | 0.357800 | 0.974218 | 0.616419 |
satb2 | 0.495069 | 0.455904 | 0.969549 | 0.513645 |
cdh12 | 0.464210 | 0.384889 | 0.972594 | 0.587705 |
249 rows × 4 columns
We want to inspect gene expression of training genes mapped with low scores, to understand the quality of mapping.
First, we need to generate “new spatial data” using the mapped single cell: this is done via
project_genes
.The function accepts as input a mapping (
adata_map
) and corresponding single cell data (adata_sc
).The result is a voxel-by-gene
AnnData
, formally similar toad_sp
, but containing gene expression from the mapped single cell data rather than Slide-seq.
[18]:
ad_ge = tg.project_genes(adata_map=ad_map, adata_sc=ad_sc)
ad_ge
[18]:
AnnData object with n_obs × n_vars = 9852 × 26496
obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'x', 'y', 'uniform_density', 'rna_count_based_density'
var: 'n_cells', 'sparsity', 'is_training'
uns: 'training_genes', 'overlap_genes'
We now choose a few training genes mapped with low score.
[19]:
genes = ['rgs6', 'satb2', 'cdh12']
ad_map.uns['train_genes_df'].loc[genes]
[19]:
train_score | sparsity_sc | sparsity_sp | sparsity_diff | |
---|---|---|---|---|
rgs6 | 0.508304 | 0.305172 | 0.941941 | 0.636769 |
satb2 | 0.495069 | 0.455904 | 0.969549 | 0.513645 |
cdh12 | 0.464210 | 0.384889 | 0.972594 | 0.587705 |
To visualize gene patterns, we use the helper
plot_genes
. This function accepts two voxel-by-geneAnnData
: the actual spatial data (adata_measured
), and a Tangram spatial prediction (adata_predicted
). The function returns gene expression maps from the two spatialAnnData
on the genesgenes
.As expected, the predited gene expression is less sparse albeit the main patterns are captured. For these genes, we trust more the mapped gene patterns, as Tangram “corrects” gene expression by aligning in space less sparse data.
[23]:
tg.plot_genes(genes, adata_measured=ad_sp, adata_predicted=ad_ge, robust=True, perc=0.02)


An even stronger example is found in genes that are not detected in the spatial data, but are detected in the single cell data. They are removed before training with
pp_adatas
function. But tangram could still generate insight on how the spatial patterns look like.
[24]:
genes=['mrgprx2', 'muc20', 'chrna2']
tg.plot_genes(genes, adata_measured=ad_sp, adata_predicted=ad_ge, robust=True, perc=0.02)


So far, we only inspected genes used to align the data (training genes), but the mapped single cell data,
ad_ge
contains the whole transcriptome. That includes more than 26k test genes.
[25]:
(ad_ge.var.is_training == False).sum()
[25]:
26247
We can use
plot_genes
to inspect gene expression of non training genes. This is an essential step as prediction of gene expression is the how we validate mapping.Before doing that, it is convenient to compute the similarity scores of all genes, which can be done by
compare_spatial_geneexp
. This function accepts two spatialAnnData
s (ie voxel-by-gene), and returns a dataframe with simlarity scores for all genes. Training genes are flagged by the Boolean fieldis_training
.If we also pass single cell
AnnData
tocompare_spatial_geneexp
function like below, a dataframe with additional sparsity columns - sparsity_sc (single cell data sparsity) and sparsity_diff (spatial data sparsity - single cell data sparsity) will return. This is required if we want to callplot_test_scores
function later with the returned datafrme fromcompare_spatial_geneexp
function.
[26]:
df_all_genes = tg.compare_spatial_geneexp(ad_ge, ad_sp, ad_sc)
df_all_genes
[26]:
score | is_training | sparsity_sp | sparsity_sc | sparsity_diff | |
---|---|---|---|---|---|
igf2 | 9.996283e-01 | True | 0.994011 | 0.999924 | -0.005913 |
chodl | 9.972357e-01 | True | 0.999086 | 0.999016 | 0.000070 |
5031425f14rik | 9.959502e-01 | True | 0.999594 | 0.998789 | 0.000805 |
car3 | 9.952664e-01 | True | 0.999695 | 0.999016 | 0.000679 |
scgn | 9.946969e-01 | True | 0.999898 | 0.999205 | 0.000693 |
... | ... | ... | ... | ... | ... |
cyp1a2 | 2.775963e-08 | False | 0.999898 | 0.999962 | -0.000064 |
ccdc185 | 1.200085e-08 | False | 0.999898 | 0.999962 | -0.000064 |
alox12e | 9.068886e-09 | False | 0.999898 | 0.999962 | -0.000064 |
cd69 | 4.023093e-09 | False | 0.999898 | 0.999962 | -0.000064 |
ms4a15 | 1.751993e-09 | False | 0.999898 | 0.999962 | -0.000064 |
18000 rows × 5 columns
The plot below give us a summary of scores at single-gene level for test genes
[27]:
tg.plot_test_scores(df_all_genes)

Let’s plot the scores of the test genes and see how they compare to the training genes. Following the strategy in the previous plots, we visualize the scores as a function of the sparsity of the spatial data.
(We have not wrapped this call into a function yet).
[28]:
sns.scatterplot(data=df_all_genes, x='score', y='sparsity_sp', hue='is_training', alpha=.5);

Again, sparser genes in the spatial data are predicted with low scores, which is due to the presence of dropouts in the spatial data.
Let’s choose a few test genes with varied scores and compared predictions vs measured gene expression.
[29]:
genes = ['snap25', 'atp1b1', 'atp1a3', 'ctgf', 'nefh', 'aak1', 'fa2h', ]
df_all_genes.loc[genes]
[29]:
score | is_training | sparsity_sp | sparsity_sc | sparsity_diff | |
---|---|---|---|---|---|
snap25 | 0.812584 | False | 0.402253 | 0.120048 | 0.282205 |
atp1b1 | 0.770782 | False | 0.579984 | 0.175778 | 0.404205 |
atp1a3 | 0.697557 | False | 0.658343 | 0.319587 | 0.338757 |
ctgf | 0.577751 | False | 0.981628 | 0.948243 | 0.033386 |
nefh | 0.522629 | False | 0.909156 | 0.916083 | -0.006928 |
aak1 | 0.492562 | False | 0.868047 | 0.179713 | 0.688334 |
fa2h | 0.342086 | False | 0.972493 | 0.860845 | 0.111648 |
We can use again
plot_genes
to visualize gene patterns.Interestingly, the agreement for genes
Atp1b1
orApt1a3
, seems less good than that forCtgf
andNefh
, despite the scores are higher for the former genes. This is because even though the latter gene patterns are localized correctly, their expression values are not so well correlated (for instance, inCtgf
the “bright yellow spot” is in different part of layer 6b). In contrast, forAtpb1
the gene expression pattern is largely recover, even though the overall gene expression in the spatial data is more dim.
[30]:
tg.plot_genes(genes, adata_measured=ad_sp, adata_predicted=ad_ge, robust=True, perc=0.02)


Leave-One-Out Cross Validation (LOOCV)¶
If number of genes is small, Leave-One-Out cross validation (LOOCV) is supported in Tangram to evaluate mapping performance.
LOOCV supported by Tangram:
Assume the number of genes we have in the dataset is N.
LOOCV would iterate over and map on the genes dataset N times.
Each time it hold out one gene as test gene (1 test gene) and trains on the rest of all genes (N-1 training genes).
After all trainings are done, average test/train score will be computed to evaluate the mapping performance.
Assume all genes we have is the training genes in the example above. Here we demo the LOOCV mapping at cluster level.
Restart the kernel and load single cell, spatial and gene markers data
Run
pp_adatas
to prepare data for mapping
[ ]:
import os, sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scanpy as sc
import torch
import tangram as tg
[ ]:
path = os.path.join('data', 'slideseq_MOp_1217.h5ad')
ad_sp = sc.read_h5ad(path)
path = os.path.join('data','mop_sn_tutorial.h5ad')
ad_sc = sc.read_h5ad(path)
sc.pp.normalize_total(ad_sc)
df_genes = pd.read_csv('data/MOp_markers.csv', index_col=0)
markers = np.reshape(df_genes.values, (-1, ))
markers = list(markers)
tg.pp_adatas(ad_sc, ad_sp, genes=markers)
INFO:root:249 training genes are saved in `uns``training_genes` of both single cell and spatial Anndatas.
INFO:root:18000 overlapped genes are saved in `uns``overlap_genes` of both single cell and spatial Anndatas.
INFO:root:uniform based density prior is calculated and saved in `obs``uniform_density` of the spatial Anndata.
INFO:root:rna count based density prior is calculated and saved in `obs``rna_count_based_density` of the spatial Anndata.
[ ]:
cv_dict, ad_ge_cv, df = tg.cross_val(ad_sc,
ad_sp,
device='cuda:0',
mode='clusters',
cv_mode='loo',
num_epochs=1000,
cluster_label='subclass_label',
return_gene_pred=True,
verbose=False,
)
100%|██████████| 249/249 [23:21<00:00, 5.63s/it]
cv avg test score 0.185
cv avg train score 0.296
cross_val
function will returncv_dict
andad_ge_cv
anddf_test_genes
inLOOCV
mode.cv_dict
contains the average score for cross validation,ad_ge_cv
stores the predicted expression value for each gene, anddf_test_genes
contains scores and sparsity for each test genes.
[ ]:
cv_dict
{'avg_test_score': 0.18502994, 'avg_train_score': 0.2960305045168084}
We can use
plot_test_scores
to display an overview of the cross validation test scores of each gene vs. sparsity.
[ ]:
tg.plot_test_scores(df, bins=10, alpha=.7)

Now, let’s compare a few genes between their ground truth and cross-validation predicted spatial pattern by calling the function
plot_genes
[ ]:
ad_ge_cv.var.sort_values(by='test_score', ascending=False)
test_score | |
---|---|
gad1 | 0.612824 |
gad2 | 0.538229 |
slc17a7 | 0.507557 |
vtn | 0.503726 |
pvalb | 0.498350 |
... | ... |
5031425f14rik | 0.015658 |
prok2 | 0.008798 |
teddm3 | 0.003808 |
scgn | 0.002912 |
dnase1l3 | 0.000634 |
249 rows × 1 columns
[ ]:
ranked_genes = list(ad_ge_cv.var.sort_values(by='test_score', ascending=False).index.values)
top_genes = ranked_genes[:3]
bottom_genes = ranked_genes[-3:]
[ ]:
tg.plot_genes(genes=top_genes, adata_measured=ad_sp, adata_predicted=ad_ge_cv, s=10, robust=True, perc=0.02)


[ ]:
tg.plot_genes(genes=bottom_genes, adata_measured=ad_sp, adata_predicted=ad_ge_cv, s=10, robust=True, perc=0.02)


Tutorial for spatial mapping using Tangram¶
by Ziqing Lu luz21@gene.com and Tommaso Biancalani biancalt@gene.com.
Last update: August 16th 2021
What is Tangram?¶
Tangram is a method for mapping single-cell (or single-nucleus) gene expression data onto spatial gene expression data. Tangram takes as input a single-cell dataset and a spatial dataset, collected from the same anatomical region/tissue type. Via integration, Tangram creates new spatial data by aligning the scRNAseq profiles in space. This allows to project every annotation in the scRNAseq (e.g. cell types, program usage) on space.
What do I use Tangram for?¶
The most common application of Tangram is to resolve cell types in space. Another usage is to correct gene expression from spatial data: as scRNA-seq data are less prone to dropout than (e.g.) Visium or Slide-seq, the “new” spatial data generated by Tangram resolve many more genes. As a result, we can visualize program usage in space, which can be used for ligand-receptor pair discovery or, more generally, cell-cell communication mechanisms. If cell segmentation is available, Tangram can be also used for deconvolution of spatial data. If your single cell are multimodal, Tangram can be used to spatially resolve other modalities, such as chromatin accessibility.
Frequently Asked Questions about Tangram¶
How is Tangram different, than all the other deconvolution/mapping method?¶
Validation. Most methods “validate” mappings by looking at known patterns or proportion of cell types. These are good sanity checks, but are hardly useful when mapping is used for discovery. In Tangram, mappings are validated by inspective the predictions of holdout genes (test transcriptome).
My scRNAseq/spatial data come from different samples. Can I still use Tangram?¶
Yes. There is a clever variation invented by Sten Linnarsson, which consists of mapping average cells of a certain cell type, rather than single cells. This method is much faster, and smooths out variation in biological signal from different samples via averaging. However, it requires annotated scRNA-seq, sacrifices resolving biological variability at single-cell level. To map this way, pass
mode=cluster
.
Does Tangram only work on mouse brain data?¶
No. The original manuscript focused on mouse brain data b/c was funded by BICCN. We subsequently used Tangram for mapping lung, kidney and cancer tissue. If mapping doesn’t work for your case, that is hardly due to the complexity of the tissue.
Why doesn’t Tangram have hypotheses on the underlying model?¶
Most models used in biology are probabilistic: they assume that data are generated according to a certain probability distribution, hence the hypothesis. But Tangram doesn’t work that way: the hypothesis is that scRNA-seq and spatial data are generated with the same process (i.e. same biology) regardless of the process.
Where do I learn more about Tangram?¶
Check out our documentation for learning more about the method, or our GitHub repo for the latest version of the code. Tangram has been released in :cite:`tangram`.
Setting up¶
Tangram is based on pytorch, scanpy and (optionally but highly-recommended) squidpy - this tutorial is designed to work with squidy. You can also check this tutorial, prior to integration with squidpy.
To run the notebook locally, create a conda environment as
conda env create -f tangram_environment.yml
using our YAML file.This notebook is based on squidpy v1.1.0.
[2]:
import scanpy as sc
import squidpy as sq
import numpy as np
import pandas as pd
from anndata import AnnData
import pathlib
import matplotlib.pyplot as plt
import matplotlib as mpl
import skimage
import seaborn as sns
import tangram as tg
sc.logging.print_header()
print(f"squidpy=={sq.__version__}")
%load_ext autoreload
%autoreload 2
%matplotlib inline
scanpy==1.8.1 anndata==0.7.6 umap==0.5.1 numpy==1.20.0 scipy==1.5.2 pandas==1.2.0 scikit-learn==0.24.2 statsmodels==0.12.2 python-igraph==0.9.6 pynndescent==0.5.4
squidpy==1.1.0
Loading datasets¶
Load public data available in Squidpy, from mouse brain cortex. Single cell data are stored in adata_sc
. Spatial data, in adata_st
.
[3]:
adata_st = sq.datasets.visium_fluo_adata_crop()
adata_st = adata_st[
adata_st.obs.cluster.isin([f"Cortex_{i}" for i in np.arange(1, 5)])
].copy()
img = sq.datasets.visium_fluo_image_crop()
adata_sc = sq.datasets.sc_mouse_cortex()
We subset the crop of the mouse brain to only contain clusters of the brain cortex. The pre-processed single cell dataset was taken from :cite:`tasic2018shared` and pre-processed with standard scanpy functions.
Let’s visualize both spatial and single-cell datasets.
[5]:
fig, axs = plt.subplots(1, 2, figsize=(20, 5))
sc.pl.spatial(
adata_st, color="cluster", alpha=0.7, frameon=False, show=False, ax=axs[0]
)
sc.pl.umap(
adata_sc, color="cell_subclass", size=10, frameon=False, show=False, ax=axs[1]
)
plt.tight_layout()

Tangram learns a spatial alignment of the single cell data by looking at a subset of genes, specified by the user, called the training genes. Training genes need to bear interesting signal and to be measured with high quality. Typically, we choose the training genes are 100-1000 differentially expressedx genes, stratified across cell types. Sometimes, we also use the entire transcriptome, or perform different mappings using different set of training genes to see how much the result change.
Tangram fits the scRNA-seq profiles on space using a custom loss function based on cosine similarity. The method is summarized in the sketch below:
Pre-processing¶
For this case, we use 1401 marker genes as training genes.
[6]:
sc.tl.rank_genes_groups(adata_sc, groupby="cell_subclass", use_raw=False)
markers_df = pd.DataFrame(adata_sc.uns["rank_genes_groups"]["names"]).iloc[0:100, :]
markers = list(np.unique(markers_df.melt().value.values))
len(markers)
WARNING: Default of the method has been changed to 't-test' from 't-test_overestim_var'
[6]:
1401
We prepares the data using pp_adatas
, which does the following: - Takes a list of genes from user via the genes
argument. These genes are used as training genes. - Annotates training genes under the training_genes
field, in uns
dictionary, of each AnnData. - Ensure consistent gene order in the datasets (Tangram requires that the the \(j\)-th column in each matrix correspond to the same gene). - If the counts for a gene are all zeros in one of the datasets, the gene is
removed from the training genes. - If a gene is not present in both datasets, the gene is removed from the training genes.
[7]:
tg.pp_adatas(adata_sc, adata_st, genes=markers)
INFO:root:1280 training genes are saved in `uns``training_genes` of both single cell and spatial Anndatas.
INFO:root:14785 overlapped genes are saved in `uns``overlap_genes` of both single cell and spatial Anndatas.
INFO:root:uniform based density prior is calculated and saved in `obs``uniform_density` of the spatial Anndata.
INFO:root:rna count based density prior is calculated and saved in `obs``rna_count_based_density` of the spatial Anndata.
Two datasets contain 1280 training genes of the 1401 originally provided, as some training genes have been removed.
Find alignment¶
To find the optimal spatial alignment for scRNA-seq profiles, we use the map_cells_to_space
function: - The function maps iteratively as specified by num_epochs
. We typically interrupt mapping after the score plateaus. - The score measures the similarity between the gene expression of the mapped cells vs spatial data on the training genes. - The default mapping mode is mode='cells'
, which is recommended to run on a GPU. - Alternatively, one can specify mode='clusters'
which
averages the single cells beloning to the same cluster (pass annotations via cluster_label
). This is faster, and is our chioce when scRNAseq and spatial data come from different specimens. - If you wish to run Tangram with a GPU, set device=cuda:0
otherwise use the set device=cpu
. - density_prior
specifies the cell density within each spatial voxel. Use uniform
if the spatial voxels are at single cell resolution (ie MERFISH). The default value, rna_count_based
, assumes
that cell density is proportional to the number of RNA molecules.
[8]:
ad_map = tg.map_cells_to_space(adata_sc, adata_st,
mode="cells",
# mode="clusters",
# cluster_label='cell_subclass', # .obs field w cell types
density_prior='rna_count_based',
num_epochs=500,
# device="cuda:0",
device='cpu',
)
INFO:root:Allocate tensors for mapping.
INFO:root:Begin training with 1280 genes and rna_count_based density_prior in cells mode...
INFO:root:Printing scores every 100 epochs.
Score: 0.613, KL reg: 0.001
Score: 0.733, KL reg: 0.000
Score: 0.736, KL reg: 0.000
Score: 0.737, KL reg: 0.000
Score: 0.737, KL reg: 0.000
INFO:root:Saving results..
The mapping results are stored in the returned AnnData
structure, saved as ad_map
, structured as following: - The cell-by-spot matrix X
contains the probability of cell i
to be in spot j
. - The obs
dataframe contains the metadata of the single cells. - The var
dataframe contains the metadata of the spatial data. - The uns
dictionary contains a dataframe with various information about the training genes (saved as train_genes_df
).
Cell type maps¶
To visualize cell types in space, we invoke project_cell_annotation
to transfer the annotation
from the mapping to space. We can then call plot_cell_annotation
to visualize it. You can set the perc
argument to set the range to the colormap, which would help remove outliers.
[10]:
tg.project_cell_annotations(ad_map, adata_st, annotation="cell_subclass")
annotation_list = list(pd.unique(adata_sc.obs['cell_subclass']))
tg.plot_cell_annotation_sc(adata_st, annotation_list, perc=0.02)
INFO:root:spatial prediction dataframe is saved in `obsm` `tangram_ct_pred` of the spatial AnnData.

The first way to get a sense if mapping was successful is to look for known cell type patterns. To get a deeper sense, we can use the helper plot_training_scores
which gives us four panels:
[11]:
tg.plot_training_scores(ad_map, bins=20, alpha=.5)

The first panel is a histogram of the simlarity scores for each training gene.
In the second panel, each dot is a training gene and we can observe the training score (y-axis) and the sparsity in the scRNA-seq data (x-axis) of each gene.
The third panel is similar to the second one, but contains the gene sparsity of the spatial data. Spatial data are usually more sparse than single cell data, a discrepancy which is often responsible for low quality mapping.
In the last panel, we show the training scores as a function of the difference in sparsity between the dataset. For genes with comparable sparsity, the mapped gene expression is very similar to that in the spatial data. However, if a gene is quite sparse in one dataset (typically, the spatial data) but not in other, the mapping score is lower. This occurs as Tangram cannot properly matched the gene pattern because of inconsistent amount of dropouts between the datasets.
Although the above plots give us a summary of scores at single-gene level, we would need to know which are the genes are mapped with low scores. These information are stored in the dataframe .uns['train_genes_df']
; this is the dataframe used to build the four plots above.
[12]:
ad_map.uns['train_genes_df']
[12]:
train_score | sparsity_sc | sparsity_sp | sparsity_diff | |
---|---|---|---|---|
ppia | 0.998164 | 0.000092 | 0.000000 | -0.000092 |
ubb | 0.997351 | 0.000092 | 0.000000 | -0.000092 |
atp1b1 | 0.997034 | 0.014334 | 0.000000 | -0.014334 |
tmsb4x | 0.996996 | 0.002811 | 0.000000 | -0.002811 |
ckb | 0.996355 | 0.002765 | 0.000000 | -0.002765 |
... | ... | ... | ... | ... |
gabrb2 | 0.197021 | 0.078951 | 0.956790 | 0.877839 |
cdyl2 | 0.186201 | 0.425911 | 0.981481 | 0.555570 |
cntnap5c | 0.159086 | 0.608241 | 0.993827 | 0.385586 |
dlx1as | 0.142598 | 0.587777 | 0.990741 | 0.402964 |
kcnh6 | 0.140526 | 0.379131 | 0.996914 | 0.617783 |
1280 rows × 4 columns
New spatial data via aligned single cells¶
If the mapping mode is 'cells'
, we can now generate the “new spatial data” using the mapped single cell: this is done via project_genes
. The function accepts as input a mapping (adata_map
) and corresponding single cell data (adata_sc
). The result is a voxel-by-gene AnnData
, formally similar to adata_st
, but containing gene expression from the mapped single cell data rather than Visium. For downstream analysis, we always replace adata_st
with the corresponding
ad_ge
.
[13]:
ad_ge = tg.project_genes(adata_map=ad_map, adata_sc=adata_sc)
ad_ge
[13]:
AnnData object with n_obs × n_vars = 324 × 36826
obs: 'in_tissue', 'array_row', 'array_col', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'total_counts_MT', 'log1p_total_counts_MT', 'pct_counts_MT', 'n_counts', 'leiden', 'cluster', 'uniform_density', 'rna_count_based_density'
var: 'mt', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'n_cells', 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm', 'sparsity', 'is_training'
uns: 'cell_class_colors', 'cell_subclass_colors', 'hvg', 'neighbors', 'pca', 'umap', 'rank_genes_groups', 'training_genes', 'overlap_genes'
Let’s choose a few training genes mapped with low score, to try to understand why.
[14]:
genes = ['rragb', 'trim17', 'eno1b']
ad_map.uns['train_genes_df'].loc[genes]
[14]:
train_score | sparsity_sc | sparsity_sp | sparsity_diff | |
---|---|---|---|---|
rragb | 0.357352 | 0.079919 | 0.867284 | 0.787365 |
trim17 | 0.203551 | 0.069641 | 0.959877 | 0.890236 |
eno1b | 0.341940 | 0.022492 | 0.885802 | 0.863311 |
To visualize gene patterns, we use the helper plot_genes
. This function accepts two voxel-by-gene AnnData
: the actual spatial data (adata_measured
), and a Tangram spatial prediction (adata_predicted
). The function returns gene expression maps from the two spatial AnnData
on the genes genes
.
[15]:
tg.plot_genes_sc(genes, adata_measured=adata_st, adata_predicted=ad_ge, perc=0.02)

The above pictures explain the low training scores. Some genes are detected with very different levels of sparsity - typically they are much more sparse in the scRNAseq than in the spatial data. This is due to the fact that technologies like Visium are more prone to technical dropouts. Therefore, Tangram cannot find a good spatial alignment for these genes as the baseline signal is missing. However, so long as most training genes are measured with high quality, we can trust mapping and use Tangram prediction to correct gene expression. This is an imputation method which relies on entirely different premises than those in probabilistic models.
Another application is found by inspecting genes that are not detected in the spatial data, but are detected in the single cell data. They are removed before training with pp_adatas
function, but Tangram can predict their expression.
[16]:
genes=['loc102633833', 'gm5700', 'gm8292']
tg.plot_genes_sc(genes, adata_measured=adata_st, adata_predicted=ad_ge, perc=0.02)

So far, we only inspected genes used to align the data (training genes), but the mapped single cell data,
ad_ge
contains the whole transcriptome. That includes more than 35k test genes.
[17]:
(ad_ge.var.is_training == False).sum()
[17]:
35546
We can use plot_genes
to inspect gene expression of test genes as well. Inspecting the test transcriptome is an essential to validate mapping. At the same time, we need to be careful that some prediction might disagree with spatial data because of the technical droputs.
It is convenient to compute the similarity scores of all genes, which can be done by compare_spatial_geneexp
. This function accepts two spatial AnnDatas (ie voxel-by-gene), and returns a dataframe with simlarity scores for all genes. Training genes are flagged by the boolean field is_training
. If we also pass single cell AnnData to compare_spatial_geneexp
function like below, a dataframe with additional sparsity columns - sparsity_sc (single cell data sparsity) and sparsity_diff
(spatial data sparsity - single cell data sparsity) will return. This is required if we want to call plot_test_scores
function later with the returned datafrme from compare_spatial_geneexp
function.
[18]:
df_all_genes = tg.compare_spatial_geneexp(ad_ge, adata_st, adata_sc)
df_all_genes
[18]:
score | is_training | sparsity_sp | sparsity_sc | sparsity_diff | |
---|---|---|---|---|---|
snap25 | 0.998254 | False | 0.000000 | 0.014610 | -0.014610 |
gapdh | 0.998188 | False | 0.000000 | 0.000968 | -0.000968 |
ppia | 0.998164 | True | 0.000000 | 0.000092 | -0.000092 |
calm1 | 0.997960 | False | 0.000000 | 0.000369 | -0.000369 |
calm2 | 0.997759 | False | 0.000000 | 0.001751 | -0.001751 |
... | ... | ... | ... | ... | ... |
otx2 | 0.000013 | False | 0.996914 | 0.998341 | -0.001427 |
prr32 | 0.000010 | False | 0.993827 | 0.999263 | -0.005435 |
clec12a | 0.000009 | False | 0.996914 | 0.998848 | -0.001934 |
sntn | 0.000008 | False | 0.996914 | 0.999493 | -0.002579 |
cckar | 0.000004 | False | 0.996914 | 0.999309 | -0.002395 |
14785 rows × 5 columns
The prediction on test genes can be graphically visualized using plot_auc
:
[19]:
# sns.scatterplot(data=df_all_genes, x='score', y='sparsity_sp', hue='is_training', alpha=.5); # for legacy
tg.plot_auc(df_all_genes);
<Figure size 432x288 with 0 Axes>

This above figure is the most important validation plot in *Tangram*. Each dot represents a gene; the x-axis indicates the score, and the y-axis the sparsity of that gene in the spatial data. Unsurprisingly, the genes predicted with low score represents very sparse genes in the spatial data, suggesting that the Tangram predictions correct expression in those genes. Note that curve observed above is typical of Tangram mappings: the area under that curve is the most reliable metric we use to evaluate mapping.
Let’s inspect a few predictions. Some of these genes are biologically sparse, but well predicted:
[21]:
genes=['tfap2b', 'zic4']
tg.plot_genes_sc(genes, adata_measured=adata_st, adata_predicted=ad_ge, perc=0.02)

Some non-sparse genes present petterns, that Tangram accentuates:
[22]:
genes = ['cd34', 'rasal1']
tg.plot_genes_sc(genes, adata_measured=adata_st, adata_predicted=ad_ge, perc=0.02)

Finally, some unannotated genes have unknown function. These genes are often hardly detected in spatial data but Tangram provides prediction:
[23]:
genes = ['gm33027', 'gm5431']
tg.plot_genes_sc(genes[:5], adata_measured=adata_st, adata_predicted=ad_ge, perc=0.02)

For untargeted spatial technologies, like Visium and Slide-seq, a spatial voxel may contain more than one cells. In these cases, it might be useful to disentangle gene expression into single cells - a process called deconvolution. Deconvolution is a requested feature, and also hard to obtain accurately with computational methods. If your goal is to study co-localization of cell types, we recommend you work with the spatial cell type maps instead. If your aim is discovery of cell-cell
communication mechanisms, we suggest you compute gene programs, then use project_cell_annotations
to spatially visualize program usage. To proceed with deconvolution anyways, see below.
In order to deconvolve cells, Tangram needs to know how many cells are present in each voxel. This is achieved by segmenting the cells on the corresponding histology, which squidpy makes possible with two lines of code: - squidpy.im.process
applies smoothing as a pre-processing step. - squidpy.im.segment
computes segmentation masks with watershed algorithm.
Note that some technologies, like Slide-seq, currently do not allow staining the same slide of tissue on which genes are profiled. For these data, you can still attempt a deconvolution by estimating cell density in a rough way - often we just pass a uniform prior. Finally, note that deconvolutions are hard to validate, as we do not have ground truth spatially-resolved single cells.
[24]:
sq.im.process(img=img, layer="image", method="smooth")
sq.im.segment(
img=img,
layer="image_smooth",
method="watershed",
channel=0,
)
Let’s visualize the segmentation results for an inset
[25]:
inset_y = 1500
inset_x = 1700
inset_sy = 400
inset_sx = 500
fig, axs = plt.subplots(1, 3, figsize=(30, 10))
sc.pl.spatial(
adata_st, color="cluster", alpha=0.7, frameon=False, show=False, ax=axs[0], title=""
)
axs[0].set_title("Clusters", fontdict={"fontsize": 20})
sf = adata_st.uns["spatial"]["V1_Adult_Mouse_Brain_Coronal_Section_2"]["scalefactors"][
"tissue_hires_scalef"
]
rect = mpl.patches.Rectangle(
(inset_y * sf, inset_x * sf),
width=inset_sx * sf,
height=inset_sy * sf,
ec="yellow",
lw=4,
fill=False,
)
axs[0].add_patch(rect)
axs[0].axes.xaxis.label.set_visible(False)
axs[0].axes.yaxis.label.set_visible(False)
axs[1].imshow(
img["image"][inset_y : inset_y + inset_sy, inset_x : inset_x + inset_sx, 0, 0]
/ 65536,
interpolation="none",
)
axs[1].grid(False)
axs[1].set_xticks([])
axs[1].set_yticks([])
axs[1].set_title("DAPI", fontdict={"fontsize": 20})
crop = img["segmented_watershed"][
inset_y : inset_y + inset_sy, inset_x : inset_x + inset_sx
].values.squeeze(-1)
crop = skimage.segmentation.relabel_sequential(crop)[0]
cmap = plt.cm.plasma
cmap.set_under(color="black")
axs[2].imshow(crop, interpolation="none", cmap=cmap, vmin=0.001)
axs[2].grid(False)
axs[2].set_xticks([])
axs[2].set_yticks([])
axs[2].set_title("Nucleous segmentation", fontdict={"fontsize": 20});

Comparison between DAPI and mask confirms the quality of the segmentation. We then need to extract some image features useful for the deconvolution task downstream. Specifically: - the number of unique segmentation objects (i.e. nuclei) under each spot. - the coordinates of the centroids of the segmentation object.
[26]:
# define image layer to use for segmentation
features_kwargs = {
"segmentation": {
"label_layer": "segmented_watershed",
"props": ["label", "centroid"],
"channels": [1, 2],
}
}
# calculate segmentation features
sq.im.calculate_image_features(
adata_st,
img,
layer="image",
key_added="image_features",
features_kwargs=features_kwargs,
features="segmentation",
mask_circle=True,
)
We can visualize the total number of objects under each spot with scanpy.
[27]:
adata_st.obs["cell_count"] = adata_st.obsm["image_features"]["segmentation_label"]
sc.pl.spatial(adata_st, color=["cluster", "cell_count"], frameon=False)

Deconvolution via alignment¶
The rationale for deconvolving with Tangram, is to constrain the number of mapped single cell profiles. This is different that most deconvolution method. Specifically, we set them equal to the number of segmented cells in the histology, in the following way: - We pass mode='constrained'
. This adds a filter term to the loss function, and a boolean regularizer. - We set target_count
equal to the total number of segmented cells. Tangram will look for the best target_count
cells to
align in space. - We pass a density_prior
, containing the fraction of cells per voxel.
[28]:
ad_map = tg.map_cells_to_space(
adata_sc,
adata_st,
mode="constrained",
target_count=adata_st.obs.cell_count.sum(),
density_prior=np.array(adata_st.obs.cell_count) / adata_st.obs.cell_count.sum(),
num_epochs=1000,
# device="cuda:0",
device='cpu',
)
Score: 0.613, KL reg: 0.125, Count reg: 5644.970, Lambda f reg: 4481.579
Score: 0.698, KL reg: 0.012, Count reg: 12.219, Lambda f reg: 725.094
Score: 0.700, KL reg: 0.012, Count reg: 2.325, Lambda f reg: 249.482
Score: 0.700, KL reg: 0.012, Count reg: 0.521, Lambda f reg: 171.574
Score: 0.701, KL reg: 0.012, Count reg: 1.381, Lambda f reg: 140.555
Score: 0.701, KL reg: 0.012, Count reg: 0.524, Lambda f reg: 125.590
Score: 0.701, KL reg: 0.012, Count reg: 0.990, Lambda f reg: 111.074
Score: 0.701, KL reg: 0.012, Count reg: 0.573, Lambda f reg: 99.697
Score: 0.701, KL reg: 0.012, Count reg: 0.889, Lambda f reg: 90.959
Score: 0.701, KL reg: 0.012, Count reg: 0.854, Lambda f reg: 80.407
In the same way as before, we can plot cell type maps:
[29]:
tg.project_cell_annotations(ad_map, adata_st, annotation="cell_subclass")
annotation_list = list(pd.unique(adata_sc.obs['cell_subclass']))
tg.plot_cell_annotation_sc(adata_st, annotation_list, perc=0.02)

We validate mapping by inspecting the test transcriptome:
[30]:
ad_ge = tg.project_genes(adata_map=ad_map, adata_sc=adata_sc)
df_all_genes = tg.compare_spatial_geneexp(ad_ge, adata_st, adata_sc)
tg.plot_auc(df_all_genes);
<Figure size 432x288 with 0 Axes>

And here comes the key part, where we will use the results of the previous deconvolution steps. Previously, we computed the absolute numbers of unique segmentation objects under each spot, together with their centroids. Let’s extract them in the right format useful for Tangram. In the resulting dataframe, each row represents a single segmentation object (a cell). We also have the image coordinates as well as the unique centroid ID, which is a string that contains both the spot ID and a
numerical index. Tangram provides a convenient function to export the mapping between spot ID and segmentation ID to adata.uns
.
[31]:
tg.create_segment_cell_df(adata_st)
adata_st.uns["tangram_cell_segmentation"].head()
[31]:
spot_idx | y | x | centroids | |
---|---|---|---|---|
0 | AAATGGCATGTCTTGT-1 | 5304.000000 | 731.000000 | AAATGGCATGTCTTGT-1_0 |
1 | AAATGGCATGTCTTGT-1 | 5320.947519 | 721.331554 | AAATGGCATGTCTTGT-1_1 |
2 | AAATGGCATGTCTTGT-1 | 5332.942342 | 717.447904 | AAATGGCATGTCTTGT-1_2 |
3 | AAATGGCATGTCTTGT-1 | 5348.865384 | 558.924248 | AAATGGCATGTCTTGT-1_3 |
4 | AAATGGCATGTCTTGT-1 | 5342.124989 | 567.208502 | AAATGGCATGTCTTGT-1_4 |
We can use tangram.count_cell_annotation()
to map cell types as result of the deconvolution step to putative segmentation ID.
[32]:
tg.count_cell_annotations(
ad_map,
adata_sc,
adata_st,
annotation="cell_subclass",
)
adata_st.obsm["tangram_ct_count"].head()
[32]:
x | y | cell_n | centroids | Pvalb | L4 | Vip | L2/3 IT | Lamp5 | NP | ... | L5 PT | Astro | L6b | Endo | Peri | Meis2 | Macrophage | CR | VLMC | SMC | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AAATGGCATGTCTTGT-1 | 641 | 5393 | 13 | [AAATGGCATGTCTTGT-1_0, AAATGGCATGTCTTGT-1_1, A... | 1 | 0 | 2 | 0 | 2 | 0 | ... | 1 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
AACAACTGGTAGTTGC-1 | 4208 | 1672 | 16 | [AACAACTGGTAGTTGC-1_0, AACAACTGGTAGTTGC-1_1, A... | 1 | 0 | 4 | 0 | 2 | 2 | ... | 2 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
AACAGGAAATCGAATA-1 | 1117 | 5117 | 28 | [AACAGGAAATCGAATA-1_0, AACAGGAAATCGAATA-1_1, A... | 2 | 0 | 2 | 1 | 1 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
AACCCAGAGACGGAGA-1 | 1101 | 1274 | 5 | [AACCCAGAGACGGAGA-1_0, AACCCAGAGACGGAGA-1_1, A... | 3 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
AACCGTTGTGTTTGCT-1 | 399 | 4708 | 7 | [AACCGTTGTGTTTGCT-1_0, AACCGTTGTGTTTGCT-1_1, A... | 2 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 27 columns
And finally export the results in a new AnnData
object.
[33]:
adata_segment = tg.deconvolve_cell_annotations(adata_st)
adata_segment.obs.head()
[33]:
y | x | centroids | cluster | |
---|---|---|---|---|
0 | 5304.000000 | 731.000000 | AAATGGCATGTCTTGT-1_0 | Pvalb |
1 | 5320.947519 | 721.331554 | AAATGGCATGTCTTGT-1_1 | Vip |
2 | 5332.942342 | 717.447904 | AAATGGCATGTCTTGT-1_2 | Vip |
3 | 5348.865384 | 558.924248 | AAATGGCATGTCTTGT-1_3 | Lamp5 |
4 | 5342.124989 | 567.208502 | AAATGGCATGTCTTGT-1_4 | Lamp5 |
Note that the AnnData object does not contain counts, but only cell type annotations, as results of the Tangram mapping. Nevertheless, it’s convenient to create such AnnData object for visualization purposes. Below you can appreciate how each dot is now not a Visium spot anymore, but a single unique segmentation object, with the mapped cell type.
[34]:
fig, ax = plt.subplots(1, 1, figsize=(20, 20))
sc.pl.spatial(
adata_segment,
color="cluster",
size=0.4,
show=False,
frameon=False,
alpha_img=0.2,
legend_fontsize=20,
ax=ax,
)
[34]:
[<AxesSubplot:title={'center':'cluster'}, xlabel='spatial1', ylabel='spatial2'>]

[ ]: