GWAS Workshop¶

No installation. No terminal. No API keys. No cost. Everything runs in your browser via Google Colab. All you need is a Google account.

:material-open-in-new: Launch in Google Colab :material-presentation-play: View slides

Runs in Google Colab

All code in this workshop runs in Google Colab. Open a new notebook, clone ClawBio (!git clone https://github.com/ClawBio/ClawBio.git then %cd ClawBio and !pip install -q -r requirements.txt), and follow along.

Builds on the Variant Interpretation Workshop

This is the second workshop in the series. In the Variant Interpretation Workshop you annotated a single genome and found clinically relevant variants. Now we scale up: from one person to population-level analysis using GWAS summary statistics. You will query the same variant (rs429358/APOE) you found before, but this time across thousands of genomes and nine databases.

What's new compared to the Variant Interpretation Workshop¶

	Variant Interpretation	This workshop (GWAS)
Scope	One genome	Thousands of genomes
Data type	Individual genotypes (23andMe)	Population summary statistics
Analysis	VEP, ClinVar, gnomAD, CPIC	GWAS Catalog, PGS Catalog, SuSiE
Question	"What do my variants mean?"	"Which variants cause disease in the population?"
New skills	vcf-annotator, pharmgx-reporter	gwas-lookup, gwas-prs, fine-mapping
Equity angle	Database representation bias	Cross-ancestry transferability of GWAS findings

Part 1: What is a GWAS?¶

Association testing at scale¶

A genome-wide association study (GWAS) tests every common variant in the genome for statistical association with a trait or disease, across thousands or millions of participants. The output is a set of summary statistics: per-variant effect sizes, standard errors, and p-values.

Field	Meaning	Example
`rsid`	Variant identifier	rs7903146
`beta`	Effect size (log-odds or per-allele)	0.31
`se`	Standard error of beta	0.02
`p`	P-value for association	5.2 x 10^-38
`MAF`	Minor allele frequency	0.28

Why summary statistics matter

Summary statistics are public, free, and sufficient to run polygenic risk scores, meta-analyses, and fine-mapping. No HPC infrastructure, no data access agreements, no individual-level data required. A researcher in Lima, Kampala, or Dhaka can run the same analyses as one at the Broad Institute.

The ancestry gap¶

Over 86% of GWAS participants are of European ancestry. Effect sizes, allele frequencies, and linkage disequilibrium patterns differ between populations. Polygenic risk scores trained on European cohorts perform poorly in African and South Asian populations. This is not just a technical limitation; it risks widening health disparities.

ClawBio addresses this by querying multiple biobanks in a single call: UK Biobank (multi-ancestry), FinnGen (Finnish), and Biobank Japan (East Asian), alongside the GWAS Catalog and Open Targets.

Part 2: Three ClawBio skills¶

This workshop uses three ClawBio skills that cover the full GWAS analysis workflow.

Skill 1: GWAS Lookup¶

Give it an rsID. It queries nine databases in parallel and returns a unified report in seconds.

Database	What it returns	Ancestry coverage
GWAS Catalog	Published trait associations	Mixed
Open Targets	Credible sets, locus-to-gene scores	Mixed
UKB-TOPMed PheWeb	PheWAS across 4,500 phenotypes	Multi-ancestry
FinnGen r12	Finnish disease endpoints	Finnish
Biobank Japan	East Asian PheWAS	Japanese
GTEx v8	eQTL tissue expression	Mostly European
EBI eQTL Catalogue	Multi-tissue eQTL associations	Mixed
LocusZoom	Regional association context	Both builds
Ensembl	Variant resolution, consequence	Reference

python skills/gwas-lookup/gwas_lookup.py --rsid rs7903146 --output /tmp/gwas_demo

Or run the demo with pre-fetched data:

python skills/gwas-lookup/gwas_lookup.py --demo --output /tmp/gwas_demo

Skill 2: Polygenic Risk Scores (PRS)¶

A PRS sums the effects of many variants into a single risk estimate:

PRS = sum(dosage_i x effect_weight_i) across all matched variants.

ClawBio ships with 6 curated scores from the PGS Catalog for instant demos:

Trait	PGS ID	Variants
Type 2 diabetes	PGS000013	8
Coronary artery disease	PGS000004	46
Breast cancer	PGS000001	77
Prostate cancer	PGS000057	147
Atrial fibrillation	PGS000011	12
BMI	PGS000039	97

Risk categories: Low (<20th percentile), Average (20-80th), Elevated (80-95th), High (>95th).

python skills/gwas-prs/gwas_prs.py --demo --output /tmp/prs_demo

Skill 3: SuSiE Fine-Mapping¶

GWAS finds associated regions. Fine-mapping finds the causal variants within them.

A single GWAS signal can contain 10-200 correlated SNPs in high linkage disequilibrium. SuSiE (Sum of Single Effects) applies iterative Bayesian stepwise selection to produce:

Credible sets: the minimal set of SNPs capturing 95% of causal probability
PIPs: posterior inclusion probability per variant
Multiple signals: handles multiple independent causal variants per locus

All from summary statistics alone. No individual-level data needed.

python skills/fine-mapping/fine_mapping.py --demo --output /tmp/finemapping_demo

Part 3: How to run the workshop¶

What you need¶

A Google account (for Google Colab)
A web browser
Nothing else. No terminal, no installation, no API keys, no payment.

Step-by-step guide¶

Step 1: Setup (2 minutes)¶

Install ClawBio in Colab. Same as the variant interpretation workshop: click play on the first two cells.

Step 2: GWAS Lookup (5 minutes)¶

Query rs7903146 (the strongest common Type 2 diabetes signal, in TCF7L2) across all nine databases. Examine the unified report: trait associations, PheWAS hits, eQTL data, and credible set membership.

Try additional variants:

rsID	Gene	Trait	Why it matters
rs7903146	TCF7L2	Type 2 diabetes	Strongest common T2D signal. OR 1.4 per allele.
rs429358	APOE	Alzheimer's	You found this in the variant interpretation workshop. Now see the GWAS context.
rs3798220	LPA	Cardiovascular	Lipoprotein(a), an independent risk factor for coronary events.
rs1801282	PPARG	Type 2 diabetes	Drug target for thiazolidinediones (pioglitazone).

Step 3: Cross-ancestry comparison (3 minutes)¶

Compare allele frequencies for your variant across UK Biobank, FinnGen, and Biobank Japan. Note how effect sizes and frequencies differ between populations.

Step 4: Polygenic Risk Scores (5 minutes)¶

Compute PRS for 6 traits using the Corpasome (Manuel Corpas's 23andMe data). The tool matches genotyped variants to published PGS Catalog scoring files and estimates percentile rank against reference populations.

Step 5: Fine-Mapping (5 minutes)¶

Run SuSiE on a demo locus containing 200 variants with 2 independent causal signals. Examine the credible sets, PIPs, and the locus plot showing which variants are most likely causal.

Take-home messages¶

GWAS summary statistics are free and public. You do not need individual-level data to do meaningful population-level research.
Three ClawBio skills cover the full workflow: lookup, PRS, and fine-mapping.
Cross-ancestry analysis is not optional. Most GWAS are European-biased. Always query multiple biobanks to check transferability.
Fine-mapping narrows GWAS hits to causal variants. SuSiE credible sets are the current state of the art.
Infrastructure is no longer a barrier. Google Colab + ClawBio = publication-quality GWAS analysis, for free, anywhere in the world.

Resources¶

Resource	Link
GWAS workshop slides	clawbio.ai/workshop-gwas-slides.html
Variant Interpretation Workshop	Previous workshop
ClawBio GitHub	github.com/ClawBio/ClawBio
GWAS Catalog	ebi.ac.uk/gwas
PGS Catalog	pgscatalog.org
Open Targets Genetics	genetics.opentargets.org
Corpasome dataset (Zenodo)	doi:10.5281/zenodo.19297389
Ensembl VEP	ensembl.org/vep
gnomAD	gnomad.broadinstitute.org
CPIC Guidelines	cpicpgx.org

What's next¶

You have now analysed variants at the individual level (Workshop 1) and the population level (this workshop). The final workshop goes deeper into the same individual genome:

Workshop	What it adds
30x WGS Workshop	Analyse Manuel Corpas's genome at full 30x whole-genome sequencing resolution. Discover 8,925 structural variants, 912K indels, and copy number changes that are invisible to both SNP arrays and summary statistics. Compare WGS findings with the SNP chip results from Workshop 1.