Building an Easy Single-Cell RNA-seq Workflow with Scanpy
Single-cell RNA sequencing is a powerful technique to study individual cell behavior. If you’re interested in analyzing PBMCs (peripheral blood mononuclear cells), using tools like Scanpy can make the process smoother. This guide walks through how to set up a comprehensive analysis pipeline, from data loading to cell clustering and trajectory exploration, all with accessible code and explanations.
Getting Started with Data Loading and Quality Control
The first step in any single-cell analysis is loading your dataset. In this example, the PBMC-3k benchmark dataset is used, which contains gene expression data from 3,000 cells. After loading, it’s important to check the structure of the data to understand what you’re working with. Making gene names unique helps prevent issues later on.
Next, quality control metrics are calculated. These include the number of genes detected per cell, total counts, and the percentage of mitochondrial and ribosomal gene signals. Visualizations like violin plots and scatter plots help identify low-quality cells or technical artifacts. Filtering out cells with too few genes or high mitochondrial content improves data reliability.
Filtering, Doublet Detection, and Normalization
After initial QC, filtering removes cells with fewer than 200 genes or more than 2,500 detected genes, which could be doublets or dead cells. Cells with mitochondrial content above 5% are also excluded. To identify doublets—artificial cell mergers—Scrublet is integrated into the workflow. Predicted doublets are then removed to ensure clean data for analysis.
Once high-quality cells are selected, normalization is performed. Counts are scaled so each cell has a total count of 10,000, making expression levels comparable across cells. The data is then log-transformed to stabilize variance. To focus on the most informative genes, highly variable genes are identified. This step reduces noise and speeds up downstream analysis.
In this process, raw counts are preserved for future reference. Only the most variable genes are retained, which helps in identifying distinct cell populations and reduces computational load.
Cell Cycle Scoring and Dimensionality Reduction
Understanding cell cycle states can be important, especially in immune cell populations like PBMCs. Specific gene lists marker for S-phase and G2/M-phase are used to score each cell’s cycle phase. The workflow then regresses out technical effects such as total counts and mitochondrial percentage, removing unwanted variation.
Data scaling ensures that genes are on a similar scale, preventing highly expressed genes from dominating analysis. Principal Component Analysis (PCA) reduces data complexity while preserving meaningful variation. The variance explained by each principal component is visualized to decide how many PCs to use. Subsequently, algorithms like UMAP and t-SNE create two-dimensional visualizations that reveal the structure of the data and highlight distinct cell clusters.
Clustering and Cell Population Identification
To identify cell types, clustering algorithms such as Leiden are applied to the neighborhood graph built from the PCA space. Different resolutions can be tested to find the most meaningful clusters. These clusters are visualized on UMAP or t-SNE plots, allowing easy identification of unique cell populations.
Marker genes specific to each cluster are then identified. By comparing these markers to known cell-type signatures, researchers can annotate clusters as T cells, B cells, monocytes, or other immune cell types. This step is key to understanding the composition of the PBMC sample.
Finally, annotated data can be explored further. Trajectory analysis tools like PAGA and diffusion pseudotime help uncover how cells transition from one state to another, revealing potential developmental pathways or activation states within the immune system.
By following this workflow, researchers can turn raw single-cell data into meaningful biological insights. The combination of quality checks, filtering, clustering, and trajectory analysis provides a comprehensive view of cellular diversity and function in PBMC samples. Using Scanpy makes this process accessible and reproducible, even for those new to single-cell analysis.












What do you think?
It is nice to know your opinion. Leave a comment.