Spinal Cord Data Preparation and Simple Processing Example

Author
Affiliation

Griffen Wakelin

Published

February 4, 2024

Modified

February 5, 2024

This notebook gives a quick example of how you could download, prepare, and process data from a publication. In this example, I used data from Milich et al., 2021, Single-cell analysis of the cellular heterogeneity and interactions in the injured mouse spinal cord, but in theory it could be done using any dataset from any paper.

Obtaining and preparing the data

Downloading data from GEO

The data was downloaded from the GEO repository associated with the publication, which is linked in the “Data Availability” section on the publication’s webpage. All reputable journals in the current era have something similar.

If you scroll down to the bottom of the GEO page, you will see a table containing “Supplementary Files”. I downloaded the files titled:

  • GSE162610_barcode_metadata.tsv.gz
  • GSE162610_barcodes.tsv.gz
  • GSE162610_gene_metadata.tsv.gz
  • GSE162610_genes.tsv.gz
  • GSE162610_sci_mat.mtx.gz

Note that to download supplementary files on GEO, you just click the “(http)” in the column titled “Download” for the corresponding entry.

geo-downloads

Coercing the data into the proper format

From here, I then used functions from the python packages pandas and scipy to import and reformat the data, and functions from the python package anndata to save the data in a format amenable for single-cell RNA-seq (the “anndata”/adata format). This is shown in the file below titled prepare_data.py.

prepare_data.py
import pandas as pd
from scipy.io import mmread
from anndata import AnnData

obs_meta = pd.read_csv("GSE162610_barcode_metadata.tsv.gz", delimiter = "\t")
var_meta = pd.read_csv("GSE162610_gene_metadata.tsv.gz", delimiter = "\t")
counts = mmread("GSE162610_sci_mat.mtx.gz").tocsr().T

adata = AnnData(X=mat, obs=obs_meta, var=var_meta)
adata.write("milich.h5ad")
1
Loads relevant libraries for executing this script.
2
Loads in the file containing cell-level metadata.
3
Loads in the file containing the gene-level metadata.
4
Loads in the counts matrix.
5
Creates the anndata object.
6
Saves the anndata object.

Note that the methods I used above to make the final milich.h5ad file are specific to this dataset and will not work with any others. This is why it’s important to learn the very basics of python so that you can perform such tasks trivially.

(Pre)processing the data

I have saved the original data from the Milich et al. publication in your home directories in a folder called “GSE162610”, along with the milich.h5ad file. You can load the .h5ad file into python by importing scanpy and using its function read_h5ad() as follows:

# import libraries
import scanpy as sc
import warnings
warnings.filterwarnings("ignore")

adata = sc.read_h5ad("GSE162610/milich.h5ad") # load in the data

You can view the cell- (obs) or gene-level (var) metadata of the adata object using the adata.obs or adata.var methods, respectively:

adata.obs
orig.ident nCount_RNA nFeature_RNA S.Score G2M.Score Phase CC.Difference nCount_SCT nFeature_SCT sample_id ... percent_rp pass_percent_rp percent_hbb pass_percent_hbb doublet_scores is_doublet integrated_snn_res.0.8 seurat_clusters default_cluster celltype
AAACCTGAGAAGGGTA_1dpi_sample1 1dpi_sample1 2963 1562 -0.114955 -0.080684 G1 -0.034271 4626 1582 1dpi_sample1 ... 0.093455 True 0.000000 True 0.006201 False 3 3 3 Endothelial
AAACCTGAGAATGTTG_1dpi_sample1 1dpi_sample1 5031 2105 -0.030511 -0.093308 G1 0.062797 5137 2105 1dpi_sample1 ... 0.118418 True 0.000000 True 0.009843 False 3 3 3 Endothelial
AAACCTGAGACGCAAC_1dpi_sample1 1dpi_sample1 15277 3364 0.124416 0.425777 G2M -0.301362 5540 1899 1dpi_sample1 ... 0.171096 True 0.000000 True 0.127497 False 29 29 29 Div-Myeloid
AAACCTGAGCAATATG_1dpi_sample1 1dpi_sample1 4585 1019 -0.062800 -0.010219 G1 -0.052581 5057 1019 1dpi_sample1 ... 0.026390 False 0.000000 True 0.022712 False 14 14 14 Neutrophil
AAACCTGAGGACAGCT_1dpi_sample1 1dpi_sample1 11975 3138 0.333207 0.191388 S 0.141819 5923 2553 1dpi_sample1 ... 0.353290 True 0.000084 True 0.089927 False 15 15 15 Div-Myeloid
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
TTTGTTGAGAATACAC_uninj_sample3 uninj_sample3 4599 1895 -0.013252 -0.007464 G1 -0.005787 7337 1897 uninj_sample3 ... 0.105847 True 0.000000 True 0.021030 False 6 6 6 Endothelial
TTTGTTGCAGTACTAC_uninj_sample3 uninj_sample3 15700 4746 0.035824 0.033421 S 0.002404 8794 3983 uninj_sample3 ... 0.090759 True 0.000000 True 0.024152 False 7 7 7 Ependymal
TTTGTTGGTAAGCAAT_uninj_sample3 uninj_sample3 9030 3795 0.041746 -0.084748 S 0.126494 8536 3794 uninj_sample3 ... 0.069752 True 0.000000 True 0.021030 False 12 12 12 OPC
TTTGTTGTCACCTGTC_uninj_sample3 uninj_sample3 11438 3516 0.020627 -0.044879 S 0.065507 8541 3510 uninj_sample3 ... 0.102508 True 0.000000 True 0.043727 False 3 3 3 Endothelial
TTTGTTGTCACTTATC_uninj_sample3 uninj_sample3 7636 2636 0.007830 -0.068450 S 0.076280 7700 2636 uninj_sample3 ... 0.083290 True 0.000000 True 0.026155 False 2 2 2 Microglia

66178 rows × 29 columns

General Processing Workflow

We can see that the authors provided important metadata such as their annotated cell type (celltype), condition (sample_id), cell-cycle phase (Phase), etc. We can look further into these later, but we must first process the data. To simplify this, we will subset the data to only contain a single condition.

adata.obs.sample_id.cat.categories
Index(['1dpi_sample1', '1dpi_sample2', '1dpi_sample3', '3dpi_sample1',
       '3dpi_sample2', '7dpi_sample1', '7dpi_sample2', 'uninj_sample1',
       'uninj_sample2', 'uninj_sample3'],
      dtype='object')
adata_uninj_sample1 = adata[adata.obs['sample_id']=='uninj_sample1', :]

Single-cell RNA-seq data processing follows a fairly standardized workflow due to its properties, mainly that it is count data and extremely sparse. Here I quickly define a function quick_preprocess() to run through the standard workflow.

def quick_preprocess(adata):
    adata_pp = adata.copy()
    adata_pp.layers['counts'] = adata.X.copy()
    sc.pp.normalize_total(adata_pp, target_sum=1e6)
    sc.pp.log1p(adata_pp)
    sc.pp.pca(adata_pp)
    sc.pp.neighbors(adata_pp)
    sc.tl.leiden(adata_pp)
    sc.tl.umap(adata_pp)
    return adata_pp

We can then run this function using the adata_uninjured_sample1 anndata object from before, and plot the results of clustering.

adata_uninj_sample1 = quick_preprocess(adata_uninj_sample1)
sc.pl.umap(adata_uninj_sample1, color=['leiden','celltype'], ncols=2)

The results from our quick preprocessing are similar, but not identical to their cell types. For example, clusters 1, 2, and 5 from our preprocessing are all contained within their Endothelial cluster. Let us consider their annotations as ground truth. If we want our annotations to get closer to theirs, we can recluster using a few different resolution parameters.

for res in [0.5, 0.35, 0.25, 0.15]:
    sc.tl.leiden(adata_uninj_sample1, resolution=res, key_added=f'leiden_{res}')

sc.pl.umap(
    adata_uninj_sample1,
    color=[
    'leiden',
    'leiden_0.5',
    'leiden_0.35',
    'leiden_0.25',
    'leiden_0.15',
    'celltype'
    ],
    title=[
        'Original Clustering',
        'Modified Clustering (0.5)',
        'Modified Clustering (0.35)',
        'Modified Clustering (0.25)',
        'Modified Clustering (0.15)',
        'Milich et al. Annotations (ground truth)'
    ],
    legend_loc=None,
    ncols=3
)

Using resolution=0.15 recapitulates the ground truth clustering fairly closely, and could be picked as a resolution to use for more downstream analyses.

We compare the datasets side-by-side and rename our clusters accordingly (while collapsing some clusters into one).

sc.pl.umap(adata_uninj_sample1, color=['leiden_0.15','celltype'])

new_clusters = [
    'Endothelial',
    'Microglia',
    'Ependymal',
    'Immune',
    'FibroPeri',
    'OPC',
]
adata_uninj_sample1.obs['leiden_0.15'] = adata_uninj_sample1.obs['leiden_0.15'].cat.rename_categories(new_clusters)
sc.pl.umap(adata_uninj_sample1, color=['leiden_0.15','celltype'], wspace=0.25)

Now, if we wanted to know which genes were differentially expressed between the clusters, we could perform such tests.

sc.tl.rank_genes_groups(adata_uninj_sample1, groupby='leiden_0.15', method='wilcoxon')
df = sc.get.rank_genes_groups_df(adata_uninj_sample1, group=None)

We can manipulate the dataframe containing the results of the differential expression testing, df. For example, we could view the top 5 differentially-expressed genes in the OPC cluster, sorted by Benjamini-Hochberg-corrected P-values:

df[df['group']=='OPC'].sort_values('pvals_adj', ascending=True).head(5)
group names scores logfoldchanges pvals pvals_adj
165885 OPC Rtn1 8.548459 10.125089 1.247426e-17 4.138584e-13
165886 OPC Gpm6b 8.349356 9.845983 6.863645e-17 9.533413e-13
165887 OPC Olig1 8.279991 15.367180 1.231844e-16 9.533413e-13
165888 OPC Scrg1 8.279263 14.186630 1.239407e-16 9.533413e-13
165889 OPC Pcsk1n 8.261648 9.420261 1.436750e-16 9.533413e-13

Alternatively, we could view the top 10 differentially-expressed genes (by Z-score) in each cluster:

import pandas as pd
pd.DataFrame(adata_uninj_sample1.uns['rank_genes_groups']['names']).head(5)
Endothelial Microglia Ependymal Immune FibroPeri OPC
0 Bsg Hexb Mt3 Cd52 Pcolce Rtn1
1 Cldn5 Cst3 Dbi Lyz2 Lgals1 Gpm6b
2 Ly6c1 Ctsd Nnat Pfn1 Gsn Olig1
3 Itm2a Ctss Mt1 Arhgdib Nbl1 Scrg1
4 Flt1 C1qb Clu Rac2 Nupr1 Pcsk1n

Importantly, this shows us that our annotations were at least somewhat accurate, as genes known to be associated with certain cell types are upregulated in their respective clusters. For example, Hexb is known to be a marker for microglia, and has the highest Z-score in the microglial cluster. Similarly, Olig1 is seen to be differentially expressed in the OPC cluster.

We can visualize some of these genes using a few different methods, which further increases our confidence in the cell type annotations:

sc.pl.umap(adata_uninj_sample1, color=['leiden_0.15', 'Hexb'])

sc.pl.umap(adata_uninj_sample1, color=['leiden_0.15', 'Olig1'])

import matplotlib.pyplot as plt

fig, ax = plt.subplots(1, 2, figsize=(15,4))
sc.pl.violin(adata_uninj_sample1, groupby='leiden_0.15', keys=['Hexb'], show=False, ax=ax[0])
sc.pl.violin(adata_uninj_sample1, groupby='leiden_0.15', keys=['Olig1'], show=False, ax=ax[1])
plt.show()

Next Steps

There are numerous possibilities for analyses from here onwards, but you can take a look at some resources such as scanpy’s tutorials for some basic ideas. Feel free to ask me any theoretical or practical questions related to single-cell or programming as I am happy to answer.