30x Whole-Genome Sequencing Workshop¶
No installation. No terminal. No API keys. No cost. Everything runs in your browser via Google Colab. All you need is a Google account.
:material-open-in-new: Launch the workshop in Google Colab :material-presentation-play: View introductory slides
Early release: testers welcome
This workshop has not been fully tested end-to-end. If you hit errors, please file an issue on GitHub or message Manuel directly. Your feedback makes this better for everyone.
Builds on Workshops 1 and 2
This is the third and final workshop in the series. In the Variant Interpretation Workshop you analysed a SNP array (~600K positions). In the GWAS Workshop you scaled up to population-level analysis. Now you return to the same individual genome at full 30x whole-genome sequencing resolution, revealing variant types that are invisible to both arrays and summary statistics.
What's new compared to previous workshops¶
| Workshop 1 (Variant Interpretation) | Workshop 2 (GWAS) | This workshop (30x WGS) | |
|---|---|---|---|
| Data | 23andMe SNP array (~600K) | Population summary statistics | 30x WGS (4.6M variants) |
| Variant types | SNPs only | SNP associations | SNPs + indels + structural variants + CNVs |
| New concepts | ACMG, VEP, ClinVar, CPIC | PRS, fine-mapping, cross-ancestry | QC metrics, Ti/Tv, structural variants, WGS vs chip |
| Key insight | "What do my variants mean?" | "Which variants matter at the population level?" | "What does my genome look like at full resolution?" |
| Skills used | vcf-annotator, pharmgx-reporter | gwas-lookup, gwas-prs, fine-mapping | variant-annotation (WGS), QC baselines |
Part 1: What is this dataset?¶
A real genome, fully open¶
In 2013, Manuel Corpas published his 23andMe genotype data under CC0, creating one of the first fully open personal genomes (the Corpasome). In 2026, he published the full 30x whole-genome sequence from Dante Labs, also under CC0.
This workshop uses that 30x WGS dataset. It is hosted on Zenodo and citable:
Corpas, M. (2026). Personal Whole Genome Sequencing Variant Calls (SNPs, Indels, SVs, CNVs) of Manuel Corpas from Dante Labs 30x WGS. Zenodo. doi:10.5281/zenodo.19297389
Research and education only
This dataset is provided for research and educational purposes only. It must not be used for clinical decision-making. ClawBio is not a medical device.
Ethics approval
Use of this personal genome data for research was approved by the UNIR Research Ethics Committee (Comite de Etica de la Investigacion) on 28 January 2021, under protocol PI:029/2020 ("Healthy Genome Project"), with Manuel Corpas as principal investigator.
What WGS captures that SNP arrays miss¶
Consumer genotyping platforms (23andMe, AncestryDNA) test around 600,000 pre-selected positions using a SNP array. Whole-genome sequencing reads every base. The difference is significant:
| SNP Array (~600K) | 30x WGS | |
|---|---|---|
| SNPs | ~600,000 | 3,716,648 |
| Indels | 0 | 912,009 |
| Structural variants | 0 | 8,925 |
| Copy number variants | 0 | 1,387 |
| Gene coverage | Sparse (pre-selected positions) | Complete (every base) |
| Ti/Tv ratio | N/A | 2.03 |
| Het/Hom ratio | N/A | 1.63 |
SNP arrays answer pre-defined questions. WGS lets you ask questions you did not know to ask.
Pre-built subsets for instant analysis¶
ClawBio ships lightweight VCF subsets extracted from the full genome, committed to the repository so you do not need to download the full 224 MB dataset:
| Subset | What it contains | Purpose |
|---|---|---|
| chr20 SNPs + indels | Chromosome 20 variants | Tutorial exploration |
| PGx loci | 5 pharmacogenomic variant calls | WGS vs chip comparison |
| NutriGx loci | 11 nutrigenomics variant calls | Dietary genetics |
| SV calls | 8,925 structural variants | SV exploration |
| CNV calls | 1,387 copy number variants | CNV exploration |
Part 2: Background you need for this workshop¶
You do not need to memorise this before starting. Read through it once so the terms are familiar, then refer back as needed.
QC metrics: how to tell if a genome is good¶
When you receive whole-genome sequencing data, the first thing you check is quality. Three metrics matter most:
| Metric | Expected range | This genome | What it tells you |
|---|---|---|---|
| Ti/Tv ratio | ~2.0 for WGS | 2.03 | Ratio of transitions (A/G, C/T) to transversions (all other changes). Values well below 2.0 suggest sequencing errors or contamination. |
| Het/Hom ratio | 1.5 to 1.7 | 1.63 | Ratio of heterozygous to homozygous alternate calls. Values outside this range may indicate sample contamination or consanguinity. |
| Variant count | 3.5M to 4.5M SNPs | 3,716,648 | Total number of SNP calls. Significantly fewer suggests low coverage; significantly more suggests contamination or a calling error. |
Why is Ti/Tv expected to be ~2.0?
Transitions (purine to purine, or pyrimidine to pyrimidine) are chemically more likely than transversions (purine to pyrimidine or vice versa) due to the molecular structure of the bases. A random collection of errors would give Ti/Tv ~0.5, so a value near 2.0 confirms that most calls are real biological variants.
Structural variants: the hidden layer¶
Structural variants (SVs) are genomic rearrangements larger than 50 base pairs. They are invisible to SNP arrays but captured by WGS.
| Type | Full name | What happens | Count in this genome |
|---|---|---|---|
| DEL | Deletion | A segment of DNA is missing | 5,854 |
| BND | Breakend | A translocation or complex rearrangement joins distant genomic positions | 1,413 |
| DUP | Duplication | A segment is copied (can alter gene dosage) | 778 |
| INV | Inversion | A segment is flipped in orientation | 673 |
| INS | Insertion | New DNA is inserted (often mobile elements like Alu or LINE) | 207 |
SVs are estimated to cause around 20% of rare genetic disease, but they remain underexplored because older technologies could not detect them reliably.
From SNP chip to WGS: what changes clinically¶
WGS does not just find "more of the same." It finds qualitatively different types of variation:
- CYP2D6 gene deletions and duplications: The most important pharmacogene has common structural variants that change metaboliser status entirely. SNP chips cannot detect these.
- HLA typing: The immune system's HLA genes are too complex and variable for array-based genotyping. WGS captures the full haplotype.
- Coding indels: Insertions and deletions in protein-coding regions can cause frameshifts. Arrays test only SNPs.
- Novel variants: Arrays can only report positions they were designed to test. WGS discovers variants never seen before.
The Corpasome: 13 years of open genomics¶
| Year | Event |
|---|---|
| 2013 | Manuel Corpas publishes his 23andMe data under CC0 (one of the first open personal genomes) |
| 2013 | "Crowdsourcing the Corpasome" paper in Source Code for Biology and Medicine |
| 2026 | 30x whole-genome sequence published on Zenodo under CC0 |
| 2026 | Integrated into ClawBio as the default reference genome for demos, tutorials, and CI |
Part 3: How to run the workshop¶
What you need¶
- A Google account (any free Gmail account works)
- A web browser (Chrome, Firefox, or Safari)
- That is it. No software to install, no API keys, no payment.
Prerequisite¶
This workshop assumes you have completed the Variant Interpretation Workshop and ideally the GWAS Workshop. You should be comfortable with VCF files, variant annotation, and basic genomic terminology.
Opening the notebook¶
Click the button below to open the workshop notebook in Google Colab:
:material-open-in-new: Launch in Google Colab
When it opens, click the play button on each code cell in order, from top to bottom.
First time using Google Colab?
Google Colab is a free service that lets you run Python code in your browser. You do not need to know Python. Just click the play button on each code cell and read the output. If Colab asks you to "connect to a runtime", click Connect in the top right corner.
Step-by-step guide¶
Step 0: Setup (2 minutes)¶
Run the first two code cells. They clone ClawBio and install dependencies. You should see:
If something goes wrong
Click Runtime > Restart and run all in the Colab menu bar. This resets everything and runs all cells from the beginning.
Step 1: Explore the genome (5 minutes)¶
The notebook loads pre-computed QC baselines from the full 30x genome and displays summary statistics:
- 3,716,648 SNPs across all chromosomes
- 912,009 indels
- Ti/Tv ratio: 2.03 (excellent quality)
- Het/Hom ratio: 1.63 (normal for an outbred individual)
- 8,925 structural variants broken down by type
- 1,387 copy number variants
Then it loads the chr20 subset and shows the VCF format: CHROM, POS, ID, REF, ALT, QUAL, FILTER.
Step 2: Explore structural variants (8 minutes)¶
The notebook loads all 8,925 structural variants and computes:
- Count by type (DEL, DUP, INV, INS, BND)
- Size distribution (from 50 bp to megabases)
- The largest and smallest SVs
This is the section that distinguishes WGS from SNP arrays. None of these variants would be visible on a genotyping chip.
Step 3: Pharmacogenomic variants from WGS (5 minutes)¶
The notebook loads PGx variants from both the WGS and the 23andMe data and compares them side by side. The key insight: the SNP chip reports genotypes at 21 pre-selected positions (including reference-homozygous calls). WGS only outputs positions where the individual differs from reference. Positions with no WGS variant call are homozygous reference, which is itself informative.
Step 4: Annotate chr20 variants (5 minutes)¶
The notebook extracts 20 PASS variants from chromosome 20 and runs ClawBio's variant-annotation skill. This calls the Ensembl VEP REST API (free, public, no API key needed) and returns gene names, consequence types, ClinVar classifications, and gnomAD population frequencies.
No API key needed
The Ensembl VEP REST API is free and public. ClawBio handles all the formatting and submission automatically.
Step 5: Exercises (15 minutes, independent work)¶
| Exercise | What to do | Required? |
|---|---|---|
| 5a | Compare WGS and SNP chip findings at pharmacogenomic loci. Why does the chip report more "variants" than the WGS? Which platform gives you more confidence? | Yes |
| 5b | Find the largest deletion in the SV calls. What chromosome is it on? Does it overlap any known genes? (Use the UCSC Genome Browser with the GRCh37/hg19 assembly.) | Yes |
| 5c | How many chr20 variants have no rsID (ID column is ".")? What does a missing rsID mean? If you found a novel missense variant in a disease gene, what steps would you take to determine if it is pathogenic? | Yes |
Part 4: Understanding your results¶
QC interpretation¶
| Metric | Your value | Verdict |
|---|---|---|
| Ti/Tv ratio | 2.03 | Within expected range (1.9 to 2.2). No evidence of systematic errors. |
| Het/Hom ratio | 1.63 | Normal for a single outbred individual (expected 1.5 to 1.7). |
| Total SNPs | 3,716,648 | Within expected range for a European genome at 30x (3.5M to 4.5M). |
| Total indels | 912,009 | Consistent with WGS calling standards. |
Structural variant breakdown¶
| SV type | Count | What to look for |
|---|---|---|
| DEL | 5,854 | Large deletions overlapping coding genes. CYP2D6 whole-gene deletions change metaboliser phenotype. |
| BND | 1,413 | Breakends at or near gene boundaries may indicate translocations. |
| DUP | 778 | Gene duplications can increase expression (e.g., CYP2D6 ultrarapid metabolisers carry extra copies). |
| INV | 673 | Inversions at gene breakpoints can disrupt function. |
| INS | 207 | Mobile element insertions (Alu, LINE) are usually benign but can disrupt regulatory regions. |
What WGS found that the SNP chip missed¶
Key insight
The 23andMe chip tested 21 pharmacogenomic positions and found genotypes at all of them (including reference-homozygous calls). The WGS found 5 variant calls at those same positions: the others are homozygous reference. Both platforms agree on the result, but WGS also captures the surrounding genomic context: nearby indels, structural variants, and novel SNPs that the chip was not designed to test.
In clinical practice, this matters most for genes like CYP2D6, where whole-gene deletions and duplications (detectable by WGS, invisible to arrays) can completely change a patient's metaboliser status and drug response.
Take-home messages¶
-
WGS is the gold standard for variant discovery. It captures SNPs, indels, structural variants, and copy number variants in a single assay. Arrays test only what they were designed for.
-
Structural variants are clinically significant but underexplored. They account for an estimated 20% of rare disease, yet most diagnostic pipelines still focus on SNPs and small indels.
-
QC metrics are your first line of defence. Ti/Tv, Het/Hom, and total variant count tell you immediately whether the data is trustworthy. Always check before interpreting.
-
The same genome tells different stories at different resolution. The Corpasome at 600K positions (SNP chip) and at 30x coverage (WGS) reveal overlapping but complementary biology. The chip is faster and cheaper; the WGS is deeper and more complete.
-
Open data accelerates science. This entire workshop runs on a CC0-licensed genome, open-source tools, and free public APIs. Anyone in the world can reproduce it.
-
Agent-driven analysis makes WGS accessible. ClawBio reduces a multi-tool, multi-day annotation pipeline to a single command with a structured, reproducible output.
Medical disclaimer
ClawBio is a research and educational tool. It is not a medical device and does not provide clinical diagnoses. The findings discussed in this workshop are for educational purposes only. Consult a healthcare professional before making any medical decisions based on genetic data.
Resources¶
| Resource | Link |
|---|---|
| Google Colab notebook | :material-open-in-new: Open |
| Zenodo dataset | doi:10.5281/zenodo.19297389 |
| ClawBio GitHub | github.com/ClawBio/ClawBio |
| Variant Interpretation Workshop | Previous workshop |
| Corpasome paper | doi:10.1186/1751-0473-8-13 |
| Ensembl VEP | ensembl.org/vep |
| ClinVar | ncbi.nlm.nih.gov/clinvar |
| gnomAD | gnomad.broadinstitute.org |
| UCSC Genome Browser | genome.ucsc.edu |
| CPIC Guidelines | cpicpgx.org |