Mass Spec QC Skill¶
Track: C - Proteomics Difficulty: Intermediate Time estimate: 2-4 hours
What You'll Build¶
A ClawBio skill that reads proteomics search engine output (MaxQuant proteinGroups.txt or DIA-NN report.tsv) and produces a standardised QC report. The report covers protein identification count, missing value percentage, coefficient of variation distribution, and contaminant fraction.
Why This Matters¶
Mass spectrometry experiments are expensive and technically demanding. QC failures caught late waste weeks of analysis time. A standardised QC skill catches problems at the data-loading stage, before any downstream biological interpretation.
Inputs and Outputs¶
Input: MaxQuant proteinGroups.txt or DIA-NN main report TSV Output: Markdown QC report with: total proteins identified, missing value % per sample, CV distribution plot (PNG), contaminant summary, and overall pass/fail verdict
Key APIs / Data Sources¶
- No external API needed; this skill parses local output files
- MaxQuant documentation for column definitions
- DIA-NN documentation for report format
Getting Started¶
- Create your skill folder:
skills/mass-spec-qc/ - Auto-detect the input format (MaxQuant vs DIA-NN) from column headers
- For MaxQuant: filter reverse hits (
Reverse == "+") and contaminants (Potential contaminant == "+") - Calculate per-sample metrics: protein count, missing values, intensity distribution
- Compute CVs across replicates and plot the distribution with matplotlib
Domain Decisions for SKILL.md¶
- Remove reverse hits and contaminants before counting protein identifications
- Missing value threshold: warn above 20%, fail above 40% per sample
- CV threshold: median CV below 20% is acceptable for label-free quantification
- Contaminant fraction: warn above 2%, fail above 5%
- Report contaminant identities (keratins, BSA, trypsin) individually
Demo Data¶
Create a synthetic proteinGroups.txt with 500 protein entries. Include columns: Protein IDs, Gene names, Intensity Sample1 through Sample6, Reverse, Potential contaminant. Set 10% of intensity values to 0 (missing), add 15 contaminant entries (CON__ prefix), and 10 reverse hits (REV__ prefix).
Stretch Goals¶
- Add sample correlation heatmap (Pearson r across all pairs)
- Support TMT/iTRAQ reporter ion data
- Detect batch effects by PCA of intensity profiles
- Compare QC metrics against a reference dataset of published standards