Reference genotype data and summary statistics
Introduction
Reference genotype panels are widely used in GWAS, especially when individual-level genotypes are unavailable or sequencing depth is low. They provide richer variant information and can improve downstream analyses. Below I use 1000 Genomes (1KGP) as an example, then summarize practical preprocessing tips for GWAS summary statistics.
Common reference panels include:
- HapMap3 reference panels: Early reference panels
- 1000 Genomes Phase3 (1KGP, widely used), 2506 individuals from 26 populations, 49,143,605 Sites. 504 individuals for European ancestry.
- The Haplotype Reference Consortium (HRC), the first release consists of 64,976 haplotypes at 39,235,157 SNPs. Mainly european ancestry but also includes 1KGP
- Topmed reference panel. Version r2 of the panel includes 97,256 reference samples and 308,107,085 genetic variants distributed across the 22 autosomes and the X chromosome.
Not all reference panels are public. Some large panels require applications. Here we focus on the 1KGP reference panel.
Work with 1KGP
access to the 1KGP project data
- Raw genotype varaint call files (vcf) could be downloaded from http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ at each chromosome. However, we remind the loci of variants were determined for version GRCh37.
- Variants on GRCh38 should check the information about the data collection at : https://www.internationalgenome.org/data-portal/data-collection/30x-grch38. It contains 2504 unrelated individuals and additional 698 related samples with high coverage.
To convert them to the plink file, we need to run following commands:
for j in {1..22}; do
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr$j.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
./plink --make-bed --out chr$j --maf 0.01 --vcf ALL.chr$j.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
done
- The plink file format of 1KGP could be found in https://www.coggenomics.org/plink/2.0/resources#1kg_phase3 in all_phase3.pgen, .psam, .pvar format
plink2 --pfile all_phase3 --pgen-info # check the information
# or use the following command if all_phase3 is zst compressed
plink2 --pfile all_phase3 vzs --pgen-info # check the information
The output is the information of the genotypes
--pgen-info on all_phase3.pgen:
Variants: 84805772
Samples: 2504
REF alleles are all known
Maximum allele count for a single variant: >2, not explicitly stored
Explicitly phased hardcalls present
No dosages present
Then we extract the European individuals with MAF > 0.01 and remove multi-allelic SNPs, duplicate positions, and duplicate IDs.
plink2 --pfile all_phase3 'vzs' --keep-if SuperPop == EUR --rm-dup exclude-all --snps-only 'just-acgt' --max-alleles 2 --chr 1-22 --maf 0.01 --make-bed --out all_phase3_eur_maf_0.01
2504 samples (1271 females, 1233 males; 2497 founders) loaded from
all_phase3.psam.
77818345 out of 84805772 variants loaded from all_phase3.pvar.zst.
2 categorical phenotypes loaded.
--rm-dup: 191 duplicated IDs, 382 variants removed.
--keep-if: 2001 samples removed.
503 samples (263 females, 240 males; 503 founders) remaining after main
filters.
Calculating allele frequencies... done.
69267844 variants removed due to allele frequency threshold(s)
(--maf/--max-maf/--mac/--max-mac).
8550119 variants remaining after main filters.
Download and insert genetic distances
plink --bfile all_phase3_eur_maf_0.01 --cm-map ./genetic_map/genetic_map_chr@_combined_b37.txt --make-bed --out all_phase3_eur_maf_0.01_ctm
Replace predictor names with generic names of the form Chr:BP (often not required, but I find this format more convenient). It is useful to create a file containing the original names, generic names and alleles.
genofile='all_phase3_eur_maf_0.01'
awk < ${genofile}_ctm.bim '{$2=$1":"$4;print $0}' > ${genofile}_ref.bim
awk < ${genofile}_ctm.bim '{print $2, $1":"$4, $5, $6}' > ${genofile}_ref.names
cp ${genofile}_ctm.bed ${genofile}_ref.bed
cp ${genofile}_ctm.fam ${genofile}_ref.fam
Then we remove the intermediate results and keep the ${genofile}_ref files as our reference panel.
Summary association statistics
Summary statistics come in many formats across cohorts, so a light but consistent preprocessing step is recommended before analysis.
Minimum recommended fields
- SNP, A1, A2
- Effect size (BETA/OR) and standard error (SE)
- P value
- Sample size (N or N_eff)
Common cleaning steps
- Remove duplicated SNPs, missing values, and obvious outliers
- Harmonize genome build (e.g., GRCh37 vs GRCh38)
- Align alleles and strand with the reference panel
- Filter very low MAF and low-quality variants
Useful links
Reference
-
A helpful tool for analyzing the GWAS data: plinkQC (Discussed later)