Single-Cell Cluster Annotator¶

Track: B - Genomics Researchers Difficulty: Advanced Time estimate: 2-4 hours

What You'll Build¶

A ClawBio skill that takes a Scanpy AnnData object with pre-computed clusters and marker genes, then matches those markers against PanglaoDB and CellMarker databases to suggest cell type labels for each cluster. The output is an annotated cluster table and a dotplot of top markers per predicted cell type.

Why This Matters¶

Cell type annotation is the interpretive bottleneck in single-cell analysis. Researchers spend hours manually comparing marker genes against databases. Automated annotation standardises this process and reduces subjective bias in cell type assignment.

Inputs and Outputs¶

Input: A Scanpy AnnData object (H5AD file) with clustering and rank_genes_groups results Output: TSV mapping each cluster to predicted cell types with confidence scores, plus a dotplot PNG of top markers

Key APIs / Data Sources¶

PanglaoDB - downloadable TSV of cell type markers (no API; bundle the marker file)
CellMarker 2.0 - downloadable marker database
Scanpy for data handling and plotting

Getting Started¶

Create your skill folder: skills/cell-type-annotator/
Download PanglaoDB markers TSV and parse it into a dict: {cell_type: [gene_list]}
For each cluster, get the top 20 marker genes from adata.uns['rank_genes_groups']
Score each cell type by counting overlapping markers (weighted by log fold change)
Assign the top-scoring cell type to each cluster, with a confidence score based on the overlap ratio

Domain Decisions for SKILL.md¶

Use human markers only (filter PanglaoDB by species)
Require at least 3 overlapping markers for a confident assignment
Report the top 3 candidate cell types per cluster, not just the best
Flag clusters with no confident match as "Unresolved" rather than forcing an assignment
Weight markers by their specificity score (PanglaoDB sensitivity/specificity columns)

Demo Data¶

Use the --demo flag from ClawBio's existing scRNA Orchestrator skill to generate a demo AnnData object. Alternatively, download the PBMC 3k dataset from 10x Genomics (freely available) and run basic Scanpy preprocessing: filter, normalise, PCA, neighbours, leiden clustering, rank_genes_groups.

Stretch Goals¶

Add tissue-specific filtering (e.g. only consider brain cell types for brain samples)
Support automated sub-clustering of ambiguous clusters
Cross-reference with the Human Cell Atlas ontology for standardised cell type names
Generate a UMAP coloured by predicted cell type