ClawBio WorkshopHands-on with AI for Variant Interpretation and GWAS
Dr Manuel Corpas · University of Westminster · 22 April 2026
Biology has become a data-saturated science. The tools to analyse it have not kept up.
A single human genome produces three billion base pairs. Clinical interpretation requires alignment, variant calling, functional annotation, literature cross-referencing, and synthesis into actionable reports. Each step depends on specialised software with its own dependencies, version histories, and configuration requirements.
Workflow managers (Nextflow, Snakemake, Galaxy) emerged because manual orchestration is unsustainable. Yet even with these frameworks, a biologist who wants to analyse their own sequencing data must either learn to code, hire someone who can, or rely on graphical interfaces that may not support the analysis they need.
Modern LLMs can write, debug, and execute code. They can plan multi-step operations, adapt based on intermediate results, and coordinate dozens of tools.
But a general-purpose LLM generating bioinformatics code from scratch is unreliable. Output varies between sessions. It lacks the specificity that domain experts build into workflows over years. It halluccinates gene-drug associations and invents variant classifications.
The problem is not the model. The problem is the harness: what constrains the model, what tools it can call, what guardrails prevent silent errors. Agentic engineering is the discipline of building that harness.
The first wave of LLMs in the life sciences was information retrieval: summarising papers, answering questions about pathways, extracting structured data from text. Useful but incremental.
The second wave is qualitatively different. When connected to file systems, databases, and command-line tools, LLMs become autonomous agents that plan multi-step operations, execute them, and adapt based on intermediate results.
The shift: from AI that tells you things to AI that does things. The researcher's role shifts from constructing the analysis to evaluating it.
Agentic genomics is the use of autonomous AI agents, powered by large language models and operating within domain-constrained skill libraries, to discover, plan, execute, and iteratively refine multi-step genomic analyses, where the agent exercises runtime decision-making over tool selection, parameterisation, error handling, and output evaluation.
Corpas, Fatumo, Guio. "Agentic Genomics: From Pipeline Automation to Autonomous Validation."
Four necessary conditions: autonomy (runtime decisions), domain constraint (skill libraries, not ad hoc code), iterative refinement (error diagnosis and self-repair), and natural language mediation (no programming required).
Bottleneck: code production
Bottleneck: validation and judgement
| Metric | Human-directed | Agent-mediated |
|---|---|---|
| Setup time | 2-4 hours | 5-15 minutes |
| Monitoring required | Continuous | Minimal (agent handles errors) |
| Reproducibility | Variable | High (skill specification fixed) |
| Error recovery | Manual debugging | Automated diagnosis and repair |
| Prerequisite expertise | Bioinformatics training | Domain knowledge for validation |
| Primary failure mode | Config errors, version conflicts | Silent plausible-looking errors |
Comparative analysis across exome and scRNA-seq workflows (Corpas, Fatumo, Guio).
An open-source toolkit of AI agent skills for genomic analysis.
A skill is a self-contained, versioned unit of bioinformatics functionality: a SKILL.md contract that encapsulates code, configuration, data references, I/O specifications, and test suites. The AI reads the contract; it never overrides it.
VEP, ClinVar, gnomAD, ACMG classification, pharmacogenomics via CPIC. One command, structured report.
Query 9 databases in parallel, compute polygenic risk scores, fine-map loci with SuSiE. All from summary statistics.
HEIM equity scorer measures how well a dataset represents diverse populations. Flags ancestry bias in analyses.
Today you will use two of these. Session 1: annotate a real genome and discover clinical findings. Session 2: run a GWAS lookup, compute PRS, and fine-map a locus. All in Google Colab, for free.
Agentic genomics lowers the barrier to generating analyses. It does not lower the barrier to evaluating them.
The primary failure mode: results that look correct but are not. One skill returned "all normal" pharmacogenomics for an empty input file.
Agents can cite non-existent gene-disease associations, fabricate references, or generate variant annotations that conflate unrelated loci.
Agents default to European-ancestry resources. 86% of GWAS data is European. Agentic genomics can automate existing biases at scale.
Domain expertise built over years cannot be shortcut by AI. A novice user will not catch the errors an experienced analyst would.
The democratisation is real, but partial. It expands the capacity to produce; it does not expand the capacity to judge.
The infrastructure barrier is gone.
Everything runs in Google Colab on a free tier.
Summary statistics are publicly released. No application, no waiting.
ClawBio wraps the full pipeline. One command per analysis.
Google Colab is free. ClawBio is MIT-licensed. All databases are public.
A researcher in Lima, Kampala, or Dhaka can run the same analyses as one at the Broad Institute. Today.
This workshop builds on research developed with Prof Segun Fatumo (Queen Mary University of London / PHURI) around removing the barriers that keep genomics concentrated in wealthy institutions:
Replaced by Google Colab. Zero setup, zero cost, runs anywhere with a browser.
Public summary statistics cover most common analyses. No institutional gatekeeping.
Open workshops, open materials, open code. Researchers anywhere can run publication-quality analyses.
A researcher in Lima, Kampala, or Dhaka can run the same analyses as one at the Broad Institute. Today.
Hands-on
Open Google Colab now
| Session | What you'll do | Time |
|---|---|---|
| Introduction | What is agentic genomics, how ClawBio works, what we will cover today | 10 min |
| Session 1: Variant Interpretation | Annotate a real human genome (the Corpasome). Discover Factor V Leiden, CFTR carrier status, warfarin sensitivity, APOE risk, and haemochromatosis. VEP, ClinVar, gnomAD, CPIC. | 30 min |
| Session 2: GWAS | Query variants across 9 federated databases. Compute polygenic risk scores. Fine-map a GWAS locus with SuSiE. Explore cross-ancestry differences. | 30 min |
| Q&A | Discussion and questions | 20 min |
Requirements: A Google account and a web browser. Nothing to install. No API keys. No payment.
Open the tutorial materials in a new tab:
Click the Colab link in the tutorial page to open the notebook. Then click "Copy to Drive" and follow along.
| Resource | Link |
|---|---|
| ClawBio GitHub | github.com/ClawBio/ClawBio |
| Documentation | docs.clawbio.ai |
| Variant Interpretation tutorial | docs.clawbio.ai/tutorials/variant-interpretation-workshop |
| GWAS tutorial | docs.clawbio.ai/tutorials/gwas-workshop |
| Corpasome (Zenodo) | doi:10.5281/zenodo.19297389 |
| WhatsApp group | |
| Discord | Discord |
github.com/ClawBio/ClawBio