Outbreak Phylogenetic Clusterer¶
Track: E - Epidemiology Difficulty: Intermediate Time estimate: 2-4 hours
What You'll Build¶
A ClawBio skill that takes a FASTA file of consensus sequences (e.g. from a pathogen outbreak), builds a neighbour-joining phylogenetic tree, identifies transmission clusters based on genetic distance, and produces a tree figure and a cluster assignment table with a timeline.
Why This Matters¶
During infectious disease outbreaks, phylogenetic analysis reveals which cases are linked by transmission and which are independent introductions. This information guides public health response: contact tracing, quarantine decisions, and intervention targeting.
Inputs and Outputs¶
Input: A multi-FASTA file of aligned consensus sequences with date metadata in headers Output: Newick tree file, cluster assignment TSV (sequence ID, cluster number, date), tree visualisation PNG, and a timeline figure showing clusters over time
Key APIs / Data Sources¶
- BioPython - sequence parsing, alignment, tree construction
- No external API needed; all computation is local
- Optional: Nextstrain datasets for validation
Getting Started¶
- Create your skill folder:
skills/outbreak-clusterer/ - Parse the FASTA file using BioPython's
SeqIO - Compute a pairwise distance matrix using Hamming distance (proportion of differing sites)
- Build a neighbour-joining tree using BioPython's
DistanceTreeConstructor - Define clusters: sequences within a genetic distance threshold (e.g. < 5 SNPs) belong to the same cluster
- Draw the tree using BioPython's
Phylo.drawor matplotlib
Domain Decisions for SKILL.md¶
- Clustering threshold: 5 SNPs for SARS-CoV-2 (approximately 2 weeks of evolution); make configurable for other pathogens
- Require sequences to be pre-aligned (same length); reject input if lengths differ
- Extract dates from FASTA headers in format
>SAMPLE_ID|YYYY-MM-DD - Label clusters as Cluster_1, Cluster_2, etc., ordered by earliest sample date
- Singleton sequences (not within threshold of any other) are labelled "Sporadic"
Demo Data¶
Generate 20 synthetic SARS-CoV-2-like sequences (29,903 bp). Create 3 clusters: Cluster A (8 sequences, 0-3 SNP differences, dates in week 1-2), Cluster B (7 sequences, 0-4 SNP differences, dates in week 2-3), Cluster C (3 sequences, 0-2 SNP differences, dates in week 4), plus 2 singletons. Introduce 50+ SNP differences between clusters.
Stretch Goals¶
- Add bootstrap support values to internal nodes
- Detect potential super-spreader events (nodes with high branching)
- Integrate epidemiological metadata (location, age group) into the visualisation
- Support maximum likelihood tree building using IQ-TREE (external tool)