Spinal Cord Data Preparation and Simple Processing Example

Author

Affiliation

Griffen Wakelin

Published

February 4, 2024

Modified

February 5, 2024

This notebook gives a quick example of how you could download, prepare, and process data from a publication. In this example, I used data from Milich et al., 2021, Single-cell analysis of the cellular heterogeneity and interactions in the injured mouse spinal cord, but in theory it could be done using any dataset from any paper.

Obtaining and preparing the data

Downloading data from GEO

The data was downloaded from the GEO repository associated with the publication, which is linked in the “Data Availability” section on the publication’s webpage. All reputable journals in the current era have something similar.

If you scroll down to the bottom of the GEO page, you will see a table containing “Supplementary Files”. I downloaded the files titled:

GSE162610_barcode_metadata.tsv.gz
GSE162610_barcodes.tsv.gz
GSE162610_gene_metadata.tsv.gz
GSE162610_genes.tsv.gz
GSE162610_sci_mat.mtx.gz

Note that to download supplementary files on GEO, you just click the “(http)” in the column titled “Download” for the corresponding entry.

Coercing the data into the proper format

From here, I then used functions from the python packages pandas and scipy to import and reformat the data, and functions from the python package anndata to save the data in a format amenable for single-cell RNA-seq (the “anndata”/adata format). This is shown in the file below titled prepare_data.py.

prepare_data.py

import pandas as pd
from scipy.io import mmread
from anndata import AnnData

obs_meta = pd.read_csv("GSE162610_barcode_metadata.tsv.gz", delimiter = "\t")
var_meta = pd.read_csv("GSE162610_gene_metadata.tsv.gz", delimiter = "\t")
counts = mmread("GSE162610_sci_mat.mtx.gz").tocsr().T

adata = AnnData(X=mat, obs=obs_meta, var=var_meta)
adata.write("milich.h5ad")

1: Loads relevant libraries for executing this script.
2: Loads in the file containing cell-level metadata.
3: Loads in the file containing the gene-level metadata.
4: Loads in the counts matrix.
5: Creates the anndata object.
6: Saves the anndata object.

Note that the methods I used above to make the final milich.h5ad file are specific to this dataset and will not work with any others. This is why it’s important to learn the very basics of python so that you can perform such tasks trivially.

(Pre)processing the data

I have saved the original data from the Milich et al. publication in your home directories in a folder called “GSE162610”, along with the milich.h5ad file. You can load the .h5ad file into python by importing scanpy and using its function read_h5ad() as follows:

# import libraries
import scanpy as sc
import warnings
warnings.filterwarnings("ignore")

adata = sc.read_h5ad("GSE162610/milich.h5ad") # load in the data

You can view the cell- (obs) or gene-level (var) metadata of the adata object using the adata.obs or adata.var methods, respectively:

adata.obs

	orig.ident	nCount_RNA	nFeature_RNA	S.Score	G2M.Score	Phase	CC.Difference	nCount_SCT	nFeature_SCT	sample_id	...	percent_rp	pass_percent_rp	percent_hbb	pass_percent_hbb	doublet_scores	is_doublet	integrated_snn_res.0.8	seurat_clusters	default_cluster	celltype
AAACCTGAGAAGGGTA_1dpi_sample1	1dpi_sample1	2963	1562	-0.114955	-0.080684	G1	-0.034271	4626	1582	1dpi_sample1	...	0.093455	True	0.000000	True	0.006201	False	3	3	3	Endothelial
AAACCTGAGAATGTTG_1dpi_sample1	1dpi_sample1	5031	2105	-0.030511	-0.093308	G1	0.062797	5137	2105	1dpi_sample1	...	0.118418	True	0.000000	True	0.009843	False	3	3	3	Endothelial
AAACCTGAGACGCAAC_1dpi_sample1	1dpi_sample1	15277	3364	0.124416	0.425777	G2M	-0.301362	5540	1899	1dpi_sample1	...	0.171096	True	0.000000	True	0.127497	False	29	29	29	Div-Myeloid
AAACCTGAGCAATATG_1dpi_sample1	1dpi_sample1	4585	1019	-0.062800	-0.010219	G1	-0.052581	5057	1019	1dpi_sample1	...	0.026390	False	0.000000	True	0.022712	False	14	14	14	Neutrophil
AAACCTGAGGACAGCT_1dpi_sample1	1dpi_sample1	11975	3138	0.333207	0.191388	S	0.141819	5923	2553	1dpi_sample1	...	0.353290	True	0.000084	True	0.089927	False	15	15	15	Div-Myeloid
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
TTTGTTGAGAATACAC_uninj_sample3	uninj_sample3	4599	1895	-0.013252	-0.007464	G1	-0.005787	7337	1897	uninj_sample3	...	0.105847	True	0.000000	True	0.021030	False	6	6	6	Endothelial
TTTGTTGCAGTACTAC_uninj_sample3	uninj_sample3	15700	4746	0.035824	0.033421	S	0.002404	8794	3983	uninj_sample3	...	0.090759	True	0.000000	True	0.024152	False	7	7	7	Ependymal
TTTGTTGGTAAGCAAT_uninj_sample3	uninj_sample3	9030	3795	0.041746	-0.084748	S	0.126494	8536	3794	uninj_sample3	...	0.069752	True	0.000000	True	0.021030	False	12	12	12	OPC
TTTGTTGTCACCTGTC_uninj_sample3	uninj_sample3	11438	3516	0.020627	-0.044879	S	0.065507	8541	3510	uninj_sample3	...	0.102508	True	0.000000	True	0.043727	False	3	3	3	Endothelial
TTTGTTGTCACTTATC_uninj_sample3	uninj_sample3	7636	2636	0.007830	-0.068450	S	0.076280	7700	2636	uninj_sample3	...	0.083290	True	0.000000	True	0.026155	False	2	2	2	Microglia

66178 rows × 29 columns

General Processing Workflow

We can see that the authors provided important metadata such as their annotated cell type (celltype), condition (sample_id), cell-cycle phase (Phase), etc. We can look further into these later, but we must first process the data. To simplify this, we will subset the data to only contain a single condition.

adata.obs.sample_id.cat.categories

Index(['1dpi_sample1', '1dpi_sample2', '1dpi_sample3', '3dpi_sample1',
       '3dpi_sample2', '7dpi_sample1', '7dpi_sample2', 'uninj_sample1',
       'uninj_sample2', 'uninj_sample3'],
      dtype='object')

adata_uninj_sample1 = adata[adata.obs['sample_id']=='uninj_sample1', :]

Single-cell RNA-seq data processing follows a fairly standardized workflow due to its properties, mainly that it is count data and extremely sparse. Here I quickly define a function quick_preprocess() to run through the standard workflow.

def quick_preprocess(adata):
    adata_pp = adata.copy()
    adata_pp.layers['counts'] = adata.X.copy()
    sc.pp.normalize_total(adata_pp, target_sum=1e6)
    sc.pp.log1p(adata_pp)
    sc.pp.pca(adata_pp)
    sc.pp.neighbors(adata_pp)
    sc.tl.leiden(adata_pp)
    sc.tl.umap(adata_pp)
    return adata_pp

We can then run this function using the adata_uninjured_sample1 anndata object from before, and plot the results of clustering.

adata_uninj_sample1 = quick_preprocess(adata_uninj_sample1)
sc.pl.umap(adata_uninj_sample1, color=['leiden','celltype'], ncols=2)

The results from our quick preprocessing are similar, but not identical to their cell types. For example, clusters 1, 2, and 5 from our preprocessing are all contained within their Endothelial cluster. Let us consider their annotations as ground truth. If we want our annotations to get closer to theirs, we can recluster using a few different resolution parameters.

for res in [0.5, 0.35, 0.25, 0.15]:
    sc.tl.leiden(adata_uninj_sample1, resolution=res, key_added=f'leiden_{res}')

sc.pl.umap(
    adata_uninj_sample1,
    color=[
    'leiden',
    'leiden_0.5',
    'leiden_0.35',
    'leiden_0.25',
    'leiden_0.15',
    'celltype'
    ],
    title=[
        'Original Clustering',
        'Modified Clustering (0.5)',
        'Modified Clustering (0.35)',
        'Modified Clustering (0.25)',
        'Modified Clustering (0.15)',
        'Milich et al. Annotations (ground truth)'
    ],
    legend_loc=None,
    ncols=3
)

Using resolution=0.15 recapitulates the ground truth clustering fairly closely, and could be picked as a resolution to use for more downstream analyses.

We compare the datasets side-by-side and rename our clusters accordingly (while collapsing some clusters into one).

sc.pl.umap(adata_uninj_sample1, color=['leiden_0.15','celltype'])

new_clusters = [
    'Endothelial',
    'Microglia',
    'Ependymal',
    'Immune',
    'FibroPeri',
    'OPC',
]
adata_uninj_sample1.obs['leiden_0.15'] = adata_uninj_sample1.obs['leiden_0.15'].cat.rename_categories(new_clusters)

sc.pl.umap(adata_uninj_sample1, color=['leiden_0.15','celltype'], wspace=0.25)

Now, if we wanted to know which genes were differentially expressed between the clusters, we could perform such tests.

sc.tl.rank_genes_groups(adata_uninj_sample1, groupby='leiden_0.15', method='wilcoxon')
df = sc.get.rank_genes_groups_df(adata_uninj_sample1, group=None)

We can manipulate the dataframe containing the results of the differential expression testing, df. For example, we could view the top 5 differentially-expressed genes in the OPC cluster, sorted by Benjamini-Hochberg-corrected P-values:

df[df['group']=='OPC'].sort_values('pvals_adj', ascending=True).head(5)

	group	names	scores	logfoldchanges	pvals	pvals_adj
165885	OPC	Rtn1	8.548459	10.125089	1.247426e-17	4.138584e-13
165886	OPC	Gpm6b	8.349356	9.845983	6.863645e-17	9.533413e-13
165887	OPC	Olig1	8.279991	15.367180	1.231844e-16	9.533413e-13
165888	OPC	Scrg1	8.279263	14.186630	1.239407e-16	9.533413e-13
165889	OPC	Pcsk1n	8.261648	9.420261	1.436750e-16	9.533413e-13

Alternatively, we could view the top 10 differentially-expressed genes (by Z-score) in each cluster:

import pandas as pd
pd.DataFrame(adata_uninj_sample1.uns['rank_genes_groups']['names']).head(5)

	Endothelial	Microglia	Ependymal	Immune	FibroPeri	OPC
0	Bsg	Hexb	Mt3	Cd52	Pcolce	Rtn1
1	Cldn5	Cst3	Dbi	Lyz2	Lgals1	Gpm6b
2	Ly6c1	Ctsd	Nnat	Pfn1	Gsn	Olig1
3	Itm2a	Ctss	Mt1	Arhgdib	Nbl1	Scrg1
4	Flt1	C1qb	Clu	Rac2	Nupr1	Pcsk1n

Importantly, this shows us that our annotations were at least somewhat accurate, as genes known to be associated with certain cell types are upregulated in their respective clusters. For example, Hexb is known to be a marker for microglia, and has the highest Z-score in the microglial cluster. Similarly, Olig1 is seen to be differentially expressed in the OPC cluster.

We can visualize some of these genes using a few different methods, which further increases our confidence in the cell type annotations:

sc.pl.umap(adata_uninj_sample1, color=['leiden_0.15', 'Hexb'])

sc.pl.umap(adata_uninj_sample1, color=['leiden_0.15', 'Olig1'])

import matplotlib.pyplot as plt

fig, ax = plt.subplots(1, 2, figsize=(15,4))
sc.pl.violin(adata_uninj_sample1, groupby='leiden_0.15', keys=['Hexb'], show=False, ax=ax[0])
sc.pl.violin(adata_uninj_sample1, groupby='leiden_0.15', keys=['Olig1'], show=False, ax=ax[1])
plt.show()

Next Steps

There are numerous possibilities for analyses from here onwards, but you can take a look at some resources such as scanpy’s tutorials for some basic ideas. Feel free to ask me any theoretical or practical questions related to single-cell or programming as I am happy to answer.