Reference genotype data and summary statistics

Introduction

Reference genotype panels are widely used in GWAS, especially when individual-level genotypes are unavailable or sequencing depth is low. They provide richer variant information and can improve downstream analyses. Below I use 1000 Genomes (1KGP) as an example, then summarize practical preprocessing tips for GWAS summary statistics.

Common reference panels include:

  • HapMap3 reference panels: Early reference panels
  • 1000 Genomes Phase3 (1KGP, widely used), 2506 individuals from 26 populations, 49,143,605 Sites. 504 individuals for European ancestry.
  • The Haplotype Reference Consortium (HRC), the first release consists of 64,976 haplotypes at 39,235,157 SNPs. Mainly european ancestry but also includes 1KGP
  • Topmed reference panel. Version r2 of the panel includes 97,256 reference samples and 308,107,085 genetic variants distributed across the 22 autosomes and the X chromosome.

Not all reference panels are public. Some large panels require applications. Here we focus on the 1KGP reference panel.

Work with 1KGP

access to the 1KGP project data

To convert them to the plink file, we need to run following commands:

for j in {1..22}; do
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr$j.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
./plink --make-bed --out chr$j --maf 0.01 --vcf ALL.chr$j.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
done
plink2 --pfile all_phase3 --pgen-info # check the information 
# or use the following command if all_phase3 is zst compressed
plink2 --pfile all_phase3 vzs --pgen-info # check the information 

The output is the information of the genotypes

--pgen-info on all_phase3.pgen:
  Variants: 84805772
  Samples: 2504
  REF alleles are all known
  Maximum allele count for a single variant: >2, not explicitly stored
  Explicitly phased hardcalls present
  No dosages present

Then we extract the European individuals with MAF > 0.01 and remove multi-allelic SNPs, duplicate positions, and duplicate IDs.


plink2 --pfile all_phase3 'vzs' --keep-if SuperPop == EUR --rm-dup exclude-all --snps-only  'just-acgt' --max-alleles 2 --chr 1-22 --maf 0.01 --make-bed --out all_phase3_eur_maf_0.01

2504 samples (1271 females, 1233 males; 2497 founders) loaded from
all_phase3.psam.
77818345 out of 84805772 variants loaded from all_phase3.pvar.zst.
2 categorical phenotypes loaded.
--rm-dup: 191 duplicated IDs, 382 variants removed.
--keep-if: 2001 samples removed.
503 samples (263 females, 240 males; 503 founders) remaining after main
filters.
Calculating allele frequencies... done.
69267844 variants removed due to allele frequency threshold(s)
(--maf/--max-maf/--mac/--max-mac).
8550119 variants remaining after main filters.

Download and insert genetic distances


plink --bfile all_phase3_eur_maf_0.01 --cm-map ./genetic_map/genetic_map_chr@_combined_b37.txt --make-bed --out all_phase3_eur_maf_0.01_ctm

Replace predictor names with generic names of the form Chr:BP (often not required, but I find this format more convenient). It is useful to create a file containing the original names, generic names and alleles.

genofile='all_phase3_eur_maf_0.01'
awk < ${genofile}_ctm.bim '{$2=$1":"$4;print $0}' > ${genofile}_ref.bim
awk < ${genofile}_ctm.bim '{print $2, $1":"$4, $5, $6}' > ${genofile}_ref.names
cp ${genofile}_ctm.bed ${genofile}_ref.bed
cp ${genofile}_ctm.fam ${genofile}_ref.fam

Then we remove the intermediate results and keep the ${genofile}_ref files as our reference panel.

Summary association statistics

Summary statistics come in many formats across cohorts, so a light but consistent preprocessing step is recommended before analysis.

Minimum recommended fields

  • SNP, A1, A2
  • Effect size (BETA/OR) and standard error (SE)
  • P value
  • Sample size (N or N_eff)

Common cleaning steps

  • Remove duplicated SNPs, missing values, and obvious outliers
  • Harmonize genome build (e.g., GRCh37 vs GRCh38)
  • Align alleles and strand with the reference panel
  • Filter very low MAF and low-quality variants

Useful links

Reference

Jianqiao Wang
Jianqiao Wang
Assistant Professor

My research interests include statistical genetics and genomics.