# import libraries
import scanpy as sc
import warnings
"ignore")
warnings.filterwarnings(
= sc.read_h5ad("GSE162610/milich.h5ad") # load in the data adata
Spinal Cord Data Preparation and Simple Processing Example
This notebook gives a quick example of how you could download, prepare, and process data from a publication. In this example, I used data from Milich et al., 2021, Single-cell analysis of the cellular heterogeneity and interactions in the injured mouse spinal cord, but in theory it could be done using any dataset from any paper.
Obtaining and preparing the data
Downloading data from GEO
The data was downloaded from the GEO repository associated with the publication, which is linked in the “Data Availability” section on the publication’s webpage. All reputable journals in the current era have something similar.
If you scroll down to the bottom of the GEO page, you will see a table containing “Supplementary Files”. I downloaded the files titled:
GSE162610_barcode_metadata.tsv.gz
GSE162610_barcodes.tsv.gz
GSE162610_gene_metadata.tsv.gz
GSE162610_genes.tsv.gz
GSE162610_sci_mat.mtx.gz
Note that to download supplementary files on GEO, you just click the “(http)” in the column titled “Download” for the corresponding entry.
Coercing the data into the proper format
From here, I then used functions from the python packages pandas
and scipy
to import and reformat the data, and functions from the python package anndata
to save the data in a format amenable for single-cell RNA-seq (the “anndata
”/adata
format). This is shown in the file below titled prepare_data.py
.
prepare_data.py
import pandas as pd
from scipy.io import mmread
from anndata import AnnData
= pd.read_csv("GSE162610_barcode_metadata.tsv.gz", delimiter = "\t")
obs_meta = pd.read_csv("GSE162610_gene_metadata.tsv.gz", delimiter = "\t")
var_meta = mmread("GSE162610_sci_mat.mtx.gz").tocsr().T
counts
= AnnData(X=mat, obs=obs_meta, var=var_meta)
adata "milich.h5ad") adata.write(
- 1
- Loads relevant libraries for executing this script.
- 2
- Loads in the file containing cell-level metadata.
- 3
- Loads in the file containing the gene-level metadata.
- 4
- Loads in the counts matrix.
- 5
-
Creates the
anndata
object. - 6
-
Saves the
anndata
object.
Note that the methods I used above to make the final milich.h5ad
file are specific to this dataset and will not work with any others. This is why it’s important to learn the very basics of python so that you can perform such tasks trivially.
(Pre)processing the data
I have saved the original data from the Milich et al. publication in your home directories in a folder called “GSE162610”, along with the milich.h5ad
file. You can load the .h5ad
file into python by importing scanpy
and using its function read_h5ad()
as follows:
You can view the cell- (obs) or gene-level (var) metadata of the adata
object using the adata.obs
or adata.var
methods, respectively:
adata.obs
orig.ident | nCount_RNA | nFeature_RNA | S.Score | G2M.Score | Phase | CC.Difference | nCount_SCT | nFeature_SCT | sample_id | ... | percent_rp | pass_percent_rp | percent_hbb | pass_percent_hbb | doublet_scores | is_doublet | integrated_snn_res.0.8 | seurat_clusters | default_cluster | celltype | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AAACCTGAGAAGGGTA_1dpi_sample1 | 1dpi_sample1 | 2963 | 1562 | -0.114955 | -0.080684 | G1 | -0.034271 | 4626 | 1582 | 1dpi_sample1 | ... | 0.093455 | True | 0.000000 | True | 0.006201 | False | 3 | 3 | 3 | Endothelial |
AAACCTGAGAATGTTG_1dpi_sample1 | 1dpi_sample1 | 5031 | 2105 | -0.030511 | -0.093308 | G1 | 0.062797 | 5137 | 2105 | 1dpi_sample1 | ... | 0.118418 | True | 0.000000 | True | 0.009843 | False | 3 | 3 | 3 | Endothelial |
AAACCTGAGACGCAAC_1dpi_sample1 | 1dpi_sample1 | 15277 | 3364 | 0.124416 | 0.425777 | G2M | -0.301362 | 5540 | 1899 | 1dpi_sample1 | ... | 0.171096 | True | 0.000000 | True | 0.127497 | False | 29 | 29 | 29 | Div-Myeloid |
AAACCTGAGCAATATG_1dpi_sample1 | 1dpi_sample1 | 4585 | 1019 | -0.062800 | -0.010219 | G1 | -0.052581 | 5057 | 1019 | 1dpi_sample1 | ... | 0.026390 | False | 0.000000 | True | 0.022712 | False | 14 | 14 | 14 | Neutrophil |
AAACCTGAGGACAGCT_1dpi_sample1 | 1dpi_sample1 | 11975 | 3138 | 0.333207 | 0.191388 | S | 0.141819 | 5923 | 2553 | 1dpi_sample1 | ... | 0.353290 | True | 0.000084 | True | 0.089927 | False | 15 | 15 | 15 | Div-Myeloid |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
TTTGTTGAGAATACAC_uninj_sample3 | uninj_sample3 | 4599 | 1895 | -0.013252 | -0.007464 | G1 | -0.005787 | 7337 | 1897 | uninj_sample3 | ... | 0.105847 | True | 0.000000 | True | 0.021030 | False | 6 | 6 | 6 | Endothelial |
TTTGTTGCAGTACTAC_uninj_sample3 | uninj_sample3 | 15700 | 4746 | 0.035824 | 0.033421 | S | 0.002404 | 8794 | 3983 | uninj_sample3 | ... | 0.090759 | True | 0.000000 | True | 0.024152 | False | 7 | 7 | 7 | Ependymal |
TTTGTTGGTAAGCAAT_uninj_sample3 | uninj_sample3 | 9030 | 3795 | 0.041746 | -0.084748 | S | 0.126494 | 8536 | 3794 | uninj_sample3 | ... | 0.069752 | True | 0.000000 | True | 0.021030 | False | 12 | 12 | 12 | OPC |
TTTGTTGTCACCTGTC_uninj_sample3 | uninj_sample3 | 11438 | 3516 | 0.020627 | -0.044879 | S | 0.065507 | 8541 | 3510 | uninj_sample3 | ... | 0.102508 | True | 0.000000 | True | 0.043727 | False | 3 | 3 | 3 | Endothelial |
TTTGTTGTCACTTATC_uninj_sample3 | uninj_sample3 | 7636 | 2636 | 0.007830 | -0.068450 | S | 0.076280 | 7700 | 2636 | uninj_sample3 | ... | 0.083290 | True | 0.000000 | True | 0.026155 | False | 2 | 2 | 2 | Microglia |
66178 rows × 29 columns
General Processing Workflow
We can see that the authors provided important metadata such as their annotated cell type (celltype
), condition (sample_id
), cell-cycle phase (Phase
), etc. We can look further into these later, but we must first process the data. To simplify this, we will subset the data to only contain a single condition.
adata.obs.sample_id.cat.categories
Index(['1dpi_sample1', '1dpi_sample2', '1dpi_sample3', '3dpi_sample1',
'3dpi_sample2', '7dpi_sample1', '7dpi_sample2', 'uninj_sample1',
'uninj_sample2', 'uninj_sample3'],
dtype='object')
= adata[adata.obs['sample_id']=='uninj_sample1', :] adata_uninj_sample1
Single-cell RNA-seq data processing follows a fairly standardized workflow due to its properties, mainly that it is count data and extremely sparse. Here I quickly define a function quick_preprocess()
to run through the standard workflow.
def quick_preprocess(adata):
= adata.copy()
adata_pp 'counts'] = adata.X.copy()
adata_pp.layers[=1e6)
sc.pp.normalize_total(adata_pp, target_sum
sc.pp.log1p(adata_pp)
sc.pp.pca(adata_pp)
sc.pp.neighbors(adata_pp)
sc.tl.leiden(adata_pp)
sc.tl.umap(adata_pp)return adata_pp
We can then run this function using the adata_uninjured_sample1
anndata object from before, and plot the results of clustering.
= quick_preprocess(adata_uninj_sample1)
adata_uninj_sample1 =['leiden','celltype'], ncols=2) sc.pl.umap(adata_uninj_sample1, color
The results from our quick preprocessing are similar, but not identical to their cell types. For example, clusters 1
, 2
, and 5
from our preprocessing are all contained within their Endothelial
cluster. Let us consider their annotations as ground truth. If we want our annotations to get closer to theirs, we can recluster using a few different resolution
parameters.
for res in [0.5, 0.35, 0.25, 0.15]:
=res, key_added=f'leiden_{res}')
sc.tl.leiden(adata_uninj_sample1, resolution
sc.pl.umap(
adata_uninj_sample1,=[
color'leiden',
'leiden_0.5',
'leiden_0.35',
'leiden_0.25',
'leiden_0.15',
'celltype'
],=[
title'Original Clustering',
'Modified Clustering (0.5)',
'Modified Clustering (0.35)',
'Modified Clustering (0.25)',
'Modified Clustering (0.15)',
'Milich et al. Annotations (ground truth)'
],=None,
legend_loc=3
ncols )
Using resolution=0.15
recapitulates the ground truth clustering fairly closely, and could be picked as a resolution to use for more downstream analyses.
We compare the datasets side-by-side and rename our clusters accordingly (while collapsing some clusters into one).
=['leiden_0.15','celltype']) sc.pl.umap(adata_uninj_sample1, color
= [
new_clusters 'Endothelial',
'Microglia',
'Ependymal',
'Immune',
'FibroPeri',
'OPC',
]'leiden_0.15'] = adata_uninj_sample1.obs['leiden_0.15'].cat.rename_categories(new_clusters) adata_uninj_sample1.obs[
=['leiden_0.15','celltype'], wspace=0.25) sc.pl.umap(adata_uninj_sample1, color
Now, if we wanted to know which genes were differentially expressed between the clusters, we could perform such tests.
='leiden_0.15', method='wilcoxon')
sc.tl.rank_genes_groups(adata_uninj_sample1, groupby= sc.get.rank_genes_groups_df(adata_uninj_sample1, group=None) df
We can manipulate the dataframe containing the results of the differential expression testing, df
. For example, we could view the top 5 differentially-expressed genes in the OPC cluster, sorted by Benjamini-Hochberg-corrected P-values:
'group']=='OPC'].sort_values('pvals_adj', ascending=True).head(5) df[df[
group | names | scores | logfoldchanges | pvals | pvals_adj | |
---|---|---|---|---|---|---|
165885 | OPC | Rtn1 | 8.548459 | 10.125089 | 1.247426e-17 | 4.138584e-13 |
165886 | OPC | Gpm6b | 8.349356 | 9.845983 | 6.863645e-17 | 9.533413e-13 |
165887 | OPC | Olig1 | 8.279991 | 15.367180 | 1.231844e-16 | 9.533413e-13 |
165888 | OPC | Scrg1 | 8.279263 | 14.186630 | 1.239407e-16 | 9.533413e-13 |
165889 | OPC | Pcsk1n | 8.261648 | 9.420261 | 1.436750e-16 | 9.533413e-13 |
Alternatively, we could view the top 10 differentially-expressed genes (by Z-score) in each cluster:
import pandas as pd
'rank_genes_groups']['names']).head(5) pd.DataFrame(adata_uninj_sample1.uns[
Endothelial | Microglia | Ependymal | Immune | FibroPeri | OPC | |
---|---|---|---|---|---|---|
0 | Bsg | Hexb | Mt3 | Cd52 | Pcolce | Rtn1 |
1 | Cldn5 | Cst3 | Dbi | Lyz2 | Lgals1 | Gpm6b |
2 | Ly6c1 | Ctsd | Nnat | Pfn1 | Gsn | Olig1 |
3 | Itm2a | Ctss | Mt1 | Arhgdib | Nbl1 | Scrg1 |
4 | Flt1 | C1qb | Clu | Rac2 | Nupr1 | Pcsk1n |
Importantly, this shows us that our annotations were at least somewhat accurate, as genes known to be associated with certain cell types are upregulated in their respective clusters. For example, Hexb is known to be a marker for microglia, and has the highest Z-score in the microglial cluster. Similarly, Olig1 is seen to be differentially expressed in the OPC cluster.
We can visualize some of these genes using a few different methods, which further increases our confidence in the cell type annotations:
=['leiden_0.15', 'Hexb']) sc.pl.umap(adata_uninj_sample1, color
=['leiden_0.15', 'Olig1']) sc.pl.umap(adata_uninj_sample1, color
import matplotlib.pyplot as plt
= plt.subplots(1, 2, figsize=(15,4))
fig, ax ='leiden_0.15', keys=['Hexb'], show=False, ax=ax[0])
sc.pl.violin(adata_uninj_sample1, groupby='leiden_0.15', keys=['Olig1'], show=False, ax=ax[1])
sc.pl.violin(adata_uninj_sample1, groupby plt.show()
Next Steps
There are numerous possibilities for analyses from here onwards, but you can take a look at some resources such as scanpy’s tutorials for some basic ideas. Feel free to ask me any theoretical or practical questions related to single-cell or programming as I am happy to answer.