Gene Set Enrichment¶
Track: B - Genomics Researchers Difficulty: Intermediate Time estimate: 2-4 hours
What You'll Build¶
A ClawBio skill that takes a list of gene symbols, submits them to the Enrichr API, and retrieves enrichment results from three major pathway databases: KEGG, Reactome, and Gene Ontology (Biological Process). The output is a ranked table of significant pathways with p-values and a bar chart of the top 15.
Why This Matters¶
A list of differentially expressed genes is only the starting point. Pathway enrichment reveals the biological processes and mechanisms at play, turning a gene list into a biological story. Enrichr makes this accessible without local database installation.
Inputs and Outputs¶
Input: A text file with one gene symbol per line (HGNC symbols) Output: Markdown table of enriched pathways (name, p-value, adjusted p-value, overlapping genes) and a horizontal bar chart PNG of the top 15 pathways
Key APIs / Data Sources¶
- Enrichr API - POST gene list, then GET results per library
- Libraries to query:
KEGG_2021_Human,Reactome_2022,GO_Biological_Process_2023
Getting Started¶
- Create your skill folder:
skills/gene-set-enrichment/ - POST your gene list to
https://maayanlab.cloud/Enrichr/addListwith the genes as a newline-separated string - Use the returned
userListIdto GET results:https://maayanlab.cloud/Enrichr/enrich?userListId=ID&backgroundType=KEGG_2021_Human - Parse the JSON response: each entry contains term name, p-value, adjusted p-value, z-score, and overlapping genes
- Combine results from all three libraries, sort by adjusted p-value, and generate the chart
Domain Decisions for SKILL.md¶
- Report pathways with adjusted p-value < 0.05 only
- Use combined score (Enrichr's log(p) * z-score) as the primary ranking metric
- Show overlapping genes for each pathway to aid interpretation
- Warn the user if fewer than 10 genes are submitted (low statistical power)
Demo Data¶
Use a list of 50 well-known cancer-related genes: TP53, BRCA1, BRCA2, EGFR, KRAS, MYC, APC, RB1, PTEN, PIK3CA, BRAF, NRAS, CDH1, VHL, WT1, NF1, NF2, RET, KIT, PDGFRA, ALK, ERBB2, FGFR3, IDH1, IDH2, NPM1, FLT3, DNMT3A, TET2, JAK2, MPL, CALR, SF3B1, ASXL1, EZH2, NOTCH1, FBXW7, CTNNB1, SMAD4, STK11, CDKN2A, CDK4, MDM2, MET, ROS1, NTRK1, MAP2K1, ARID1A, KMT2A, CREBBP.
Stretch Goals¶
- Add dot plot visualisation (bubble size = gene count, colour = p-value)
- Support ranked gene lists for GSEA-style pre-ranked analysis
- Compare enrichment between two gene lists side by side
- Include disease-specific libraries (e.g. OMIM, DisGeNET)