GWAS Workshop¶
No installation. No terminal. No API keys. No cost. Everything runs in your browser via Google Colab. All you need is a Google account.
:material-open-in-new: Launch in Google Colab :material-presentation-play: View slides
Runs in Google Colab
All code in this workshop runs in Google Colab. Open a new notebook, clone ClawBio (!git clone https://github.com/ClawBio/ClawBio.git then %cd ClawBio and !pip install -q -r requirements.txt), and follow along.
Builds on the Variant Interpretation Workshop
This is the second workshop in the series. In the Variant Interpretation Workshop you annotated a single genome and found clinically relevant variants. Now we scale up: from one person to population-level analysis using GWAS summary statistics. You will query the same variant (rs429358/APOE) you found before, but this time across thousands of genomes and nine databases.
What's new compared to the Variant Interpretation Workshop¶
| Variant Interpretation | This workshop (GWAS) | |
|---|---|---|
| Scope | One genome | Thousands of genomes |
| Data type | Individual genotypes (23andMe) | Population summary statistics |
| Analysis | VEP, ClinVar, gnomAD, CPIC | GWAS Catalog, PGS Catalog, SuSiE |
| Question | "What do my variants mean?" | "Which variants cause disease in the population?" |
| New skills | vcf-annotator, pharmgx-reporter | gwas-lookup, gwas-prs, fine-mapping |
| Equity angle | Database representation bias | Cross-ancestry transferability of GWAS findings |
Part 1: What is a GWAS?¶
Association testing at scale¶
A genome-wide association study (GWAS) tests every common variant in the genome for statistical association with a trait or disease, across thousands or millions of participants. The output is a set of summary statistics: per-variant effect sizes, standard errors, and p-values.
| Field | Meaning | Example |
|---|---|---|
rsid |
Variant identifier | rs7903146 |
beta |
Effect size (log-odds or per-allele) | 0.31 |
se |
Standard error of beta | 0.02 |
p |
P-value for association | 5.2 x 10^-38 |
MAF |
Minor allele frequency | 0.28 |
Why summary statistics matter
Summary statistics are public, free, and sufficient to run polygenic risk scores, meta-analyses, and fine-mapping. No HPC infrastructure, no data access agreements, no individual-level data required. A researcher in Lima, Kampala, or Dhaka can run the same analyses as one at the Broad Institute.
The ancestry gap¶
Over 86% of GWAS participants are of European ancestry. Effect sizes, allele frequencies, and linkage disequilibrium patterns differ between populations. Polygenic risk scores trained on European cohorts perform poorly in African and South Asian populations. This is not just a technical limitation; it risks widening health disparities.
ClawBio addresses this by querying multiple biobanks in a single call: UK Biobank (multi-ancestry), FinnGen (Finnish), and Biobank Japan (East Asian), alongside the GWAS Catalog and Open Targets.
Part 2: Three ClawBio skills¶
This workshop uses three ClawBio skills that cover the full GWAS analysis workflow.
Skill 1: GWAS Lookup¶
Give it an rsID. It queries nine databases in parallel and returns a unified report in seconds.
| Database | What it returns | Ancestry coverage |
|---|---|---|
| GWAS Catalog | Published trait associations | Mixed |
| Open Targets | Credible sets, locus-to-gene scores | Mixed |
| UKB-TOPMed PheWeb | PheWAS across 4,500 phenotypes | Multi-ancestry |
| FinnGen r12 | Finnish disease endpoints | Finnish |
| Biobank Japan | East Asian PheWAS | Japanese |
| GTEx v8 | eQTL tissue expression | Mostly European |
| EBI eQTL Catalogue | Multi-tissue eQTL associations | Mixed |
| LocusZoom | Regional association context | Both builds |
| Ensembl | Variant resolution, consequence | Reference |
Or run the demo with pre-fetched data:
Skill 2: Polygenic Risk Scores (PRS)¶
A PRS sums the effects of many variants into a single risk estimate:
PRS = sum(dosage_i x effect_weight_i) across all matched variants.
ClawBio ships with 6 curated scores from the PGS Catalog for instant demos:
| Trait | PGS ID | Variants |
|---|---|---|
| Type 2 diabetes | PGS000013 | 8 |
| Coronary artery disease | PGS000004 | 46 |
| Breast cancer | PGS000001 | 77 |
| Prostate cancer | PGS000057 | 147 |
| Atrial fibrillation | PGS000011 | 12 |
| BMI | PGS000039 | 97 |
Risk categories: Low (<20th percentile), Average (20-80th), Elevated (80-95th), High (>95th).
Skill 3: SuSiE Fine-Mapping¶
GWAS finds associated regions. Fine-mapping finds the causal variants within them.
A single GWAS signal can contain 10-200 correlated SNPs in high linkage disequilibrium. SuSiE (Sum of Single Effects) applies iterative Bayesian stepwise selection to produce:
- Credible sets: the minimal set of SNPs capturing 95% of causal probability
- PIPs: posterior inclusion probability per variant
- Multiple signals: handles multiple independent causal variants per locus
All from summary statistics alone. No individual-level data needed.
Part 3: How to run the workshop¶
What you need¶
- A Google account (for Google Colab)
- A web browser
- Nothing else. No terminal, no installation, no API keys, no payment.
Step-by-step guide¶
Step 1: Setup (2 minutes)¶
Install ClawBio in Colab. Same as the variant interpretation workshop: click play on the first two cells.
Step 2: GWAS Lookup (5 minutes)¶
Query rs7903146 (the strongest common Type 2 diabetes signal, in TCF7L2) across all nine databases. Examine the unified report: trait associations, PheWAS hits, eQTL data, and credible set membership.
Try additional variants:
| rsID | Gene | Trait | Why it matters |
|---|---|---|---|
| rs7903146 | TCF7L2 | Type 2 diabetes | Strongest common T2D signal. OR 1.4 per allele. |
| rs429358 | APOE | Alzheimer's | You found this in the variant interpretation workshop. Now see the GWAS context. |
| rs3798220 | LPA | Cardiovascular | Lipoprotein(a), an independent risk factor for coronary events. |
| rs1801282 | PPARG | Type 2 diabetes | Drug target for thiazolidinediones (pioglitazone). |
Step 3: Cross-ancestry comparison (3 minutes)¶
Compare allele frequencies for your variant across UK Biobank, FinnGen, and Biobank Japan. Note how effect sizes and frequencies differ between populations.
Step 4: Polygenic Risk Scores (5 minutes)¶
Compute PRS for 6 traits using the Corpasome (Manuel Corpas's 23andMe data). The tool matches genotyped variants to published PGS Catalog scoring files and estimates percentile rank against reference populations.
Step 5: Fine-Mapping (5 minutes)¶
Run SuSiE on a demo locus containing 200 variants with 2 independent causal signals. Examine the credible sets, PIPs, and the locus plot showing which variants are most likely causal.
Take-home messages¶
- GWAS summary statistics are free and public. You do not need individual-level data to do meaningful population-level research.
- Three ClawBio skills cover the full workflow: lookup, PRS, and fine-mapping.
- Cross-ancestry analysis is not optional. Most GWAS are European-biased. Always query multiple biobanks to check transferability.
- Fine-mapping narrows GWAS hits to causal variants. SuSiE credible sets are the current state of the art.
- Infrastructure is no longer a barrier. Google Colab + ClawBio = publication-quality GWAS analysis, for free, anywhere in the world.
Resources¶
| Resource | Link |
|---|---|
| GWAS workshop slides | clawbio.ai/workshop-gwas-slides.html |
| Variant Interpretation Workshop | Previous workshop |
| ClawBio GitHub | github.com/ClawBio/ClawBio |
| GWAS Catalog | ebi.ac.uk/gwas |
| PGS Catalog | pgscatalog.org |
| Open Targets Genetics | genetics.opentargets.org |
| Corpasome dataset (Zenodo) | doi:10.5281/zenodo.19297389 |
| Ensembl VEP | ensembl.org/vep |
| gnomAD | gnomad.broadinstitute.org |
| CPIC Guidelines | cpicpgx.org |
What's next¶
You have now analysed variants at the individual level (Workshop 1) and the population level (this workshop). The final workshop goes deeper into the same individual genome:
| Workshop | What it adds |
|---|---|
| 30x WGS Workshop | Analyse Manuel Corpas's genome at full 30x whole-genome sequencing resolution. Discover 8,925 structural variants, 912K indels, and copy number changes that are invisible to both SNP arrays and summary statistics. Compare WGS findings with the SNP chip results from Workshop 1. |