QUILT2 Tutorial
This is the detailed guideline of QUILT2. Also, check out the nifty lcWGS-imputation-workflow for automate and reproducible workflow.
Split the genome into chunks
For many reasons, imputation is usually done for each genomic regions/chunks
independently with some overlapped buffer in between. For example, for
computational reason, we want to distribute our tasks into multiple machines
with limited memory. However, for the accuracy reason, we need to impute with
reasonable big chunks so that the model can work well. QUILT2 offers the
quilt_chunk_map
function for splitting the genome into multiple reasonable
chunks given the genetic map file.
dat <- QUILT::quilt_chunk_map("chr20", "package/CEU-chr20-final.b38.txt.gz")
str(dat)
## 'data.frame': 16 obs. of 3 variables:
## $ chunk : num 0 1 2 3 4 5 6 7 8 9 ...
## $ chr : chr "chr20" "chr20" "chr20" "chr20" ...
## $ region: chr "chr20:1-2998495" "chr20:2965894-5963047" ...
Prepare the reference panel seperately
For large reference panels, and for many jobs involving imputing few samples,
it can be computationally efficient to pre-process the reference panel and
save the output, and use this output for multiple independent runs, which is
done via the which is done via QUILT2_prepare_reference.R
. Here is an example
for how we would do this, for the case of the quick start example.
./QUILT2_prepare_reference.R \
--genetic_map_file=package/CEU-chr20-final.b38.txt.gz \
--reference_vcf_file=package/ALL.chr20_GRCh38.genotypes.20170504.chr20.2000001.4000000.noNA12878.vcf.gz \
--chr=chr20 \
--regionStart=2000001 \
--regionEnd=4000000 \
--buffer=500000 \
--outputdir=quilt2_output
Input/Output files:
reference_vcf_panel
. QUILT2 now supports VCF/BCF files with phased genotypes as the reference panel, thanks to vcfpp!.genetic_map_file
. File with genetic map information, with 3 white-space delimited columns giving position (1-based), genetic rate map in cM/Mbp, and genetic map in cM. Find the prepared ones in the maps folder.outputdir
. Directory for saving the output. The prepared reference data is saved asRData
that is in theoutputdir/RData
.
Find the more options via ./QUILT2_prepare_reference.R -h
.
Perform diploid imputation
Now given the saved reference data, we can perform imputation as follows. Note
that QUILT2 has option method=diploid
in default.
./QUILT2.R \
--prepared_reference_filename=quilt2_output/RData/QUILT_prepared_reference.chr20.2000001.4000000.RData \
--bamlist=package/bamlist.1.0.txt \
--method=diploid \
--chr=chr20 \
--regionStart=2000001 \
--regionEnd=4000000 \
--buffer=500000 \
--output_filename=quilt2_output/quilt2.diploid.vcf.gz
Note that when running multiple versions of QUILT2 against the same reference
data, it is useful to set outputfilename to change the default filename for
each job, and to keep the temporary directories used independent (which is the
behaviour for default tempdir
).
For a full list of options, query ?QUILT::QUILT
in R, or alternatively, type ./QUILT2.R --help
in terminal
Important parameters that influence run time and accuracy
These parameters are most likely to influence run time and accuracy
nGibbsSamples
. Number of Gibbs samples performed. Run time is approximately linear in this, and increasing it will increase accuracy, but with diminishing returns. Generally less important for higher coverage.n_seek_its
. Number of iterations between Gibbs sampling on small reference panel, and imputing using the full reference panel to find good matches for the small reference panel. Run time is approximately linear in this. Setting it higher will increase accuracy but with diminishing returns.Ksubset
. Size of the small reference panel. Higher values might increase accuracy but will increase run time.
Output of diploid imputation
- VCF with both SNP annotation information (see below) and per-sample genotype information. Per-sample genotype information includes the following entries
- GT (Phased genotypes) Phased genotype, where each allele is the rounded per-haplotype posterior probability (HD below)
- GP (Genotype posteriors) Posterior probabilities of the three genotypes given the data
- DS (Diploid dosage) Posterior expectation of the diploid genotype i.e. the expected number of copies of the alternate allele
- HD (Haploid dosages) Per-haplotype posterior probability of an alternate allele
Ligate the results
If one runs QUILT2 with chunks determined above by quilt_chunk_map
, which
will create overlapping region for successive chunks, then the ligation of
multiple VCF files for diploid results can be done using bcftools
. Note that
this does not change the HD entry (haplotype dosages) which thereafter is
not useable. The GP (genotype posterior) and DS (dosage) entries remain
useable as they are phase invariant.
bcftools concat \
--ligate \
--output-type z \
--output quilt2.diploid.chr20.ligate.vcf.gz \
quilt2_output/quilt2.diploid.chr20.chunk0.vcf.gz \
quilt2_output/quilt2.diploid.chr20.chunk1.vcf.gz \
quilt2_output/quilt2.diploid.chr20.chunk2.vcf.gz
Perform NIPT imputation
To perform imputation for NIPT data with QUILT2, we need to set method=nipt
and provide a file with estimated fetal fraction for each sample.
./QUILT2.R \
--prepared_reference_filename=quilt2_output/RData/QUILT_prepared_reference.chr20.2000001.4000000.RData \
--bamlist=package/bamlist.1.0.txt \
--fflist=package/fflist.1.0.txt \
--method=nipt \
--chr=chr20 \
--regionStart=2000001 \
--regionEnd=4000000 \
--buffer=500000 \
--output_filename=quilt2_output/quilt2.nipt.vcf.gz
Output of NIPT imputation
Since we are imputing both maternal and fetal genotypes, the outputted VCF is totally different from the diploid output!
First of all, the VCF header gives explanation on the FORMAT fields.
##FORMAT=<ID=GT,Number=1,Type=String,Description="Phased genotypes in order of maternal transmitted, maternal untransmitted, and fetal transmitted">
##FORMAT=<ID=MGP,Number=3,Type=Float,Description="Maternal Posterior genotype probability of 0/0, 0/1, and 1/1">
##FORMAT=<ID=MDS,Number=1,Type=Float,Description="Maternal Diploid dosage">
##FORMAT=<ID=FGP,Number=3,Type=Float,Description="Maternal Posterior genotype probability of 0/0, 0/1, and 1/1">
##FORMAT=<ID=FDS,Number=1,Type=Float,Description="Maternal Diploid dosage">
The per-sample genotype information includes the following entries:
- GT (Phased Genotypes). Phased genotype for three haplotypes in order of the maternal transmitted, maternal untransmitted, fetal transmitted.
- MGP (Maternal genotype posteriors). Maternal posterior probabilities of the three genotypes given the data and fetal fraction.
- MDS (Maternal genotype dosage). Maternal posterior expectation of the diploid genotype i.e. the expected number of copies of the alternate allele
- FGP (Fetal genotype posteriors). Fetal posterior probabilities of the three genotypes given the data and fetal fraction.
- FDS (Fetal genotype dosage). Maternal posterior expectation of the diploid genotype i.e. the expected number of copies of the alternate allele
Ligate the results
Here we have two genomes that forms four haplotypes, while the QUILT2-nipt
only outputs 3 haplotypes in GT
. If one does not care about the phased
genotypes of both mother and fetus, then we can just concatenate the chunks
and remove one of the duplicated variants from the overlapping region, which
can be done as follows.
bcftools concat \
quilt2_output/quilt2.nipt.chr20.chunk0.vcf.gz \
quilt2_output/quilt2.nipt.chr20.chunk1.vcf.gz \
quilt2_output/quilt2.nipt.chr20.chunk2.vcf.gz | \
bcftools norm -D
--output-type z \
--output quilt2.nipt.chr20.concat.vcf.gz
Note the above will result in meaningless phased GT across chunks. If we want to have proper ligated phased genotypes across the chunks for both mother and fetus, we need to create two separate VCF for mother and fetus with phased genotypes in each chunk. Then we can perform ligation normally as for the diploid output above.
Evaluation and visualization
Assume we have the truth genotypes data, we then can evaluate and visualize the imputation results conveniently via vcfppR.
library(vcfppR)
res <- vcfcomp(test = imputedvcf, truth = truthvcf,
stats = "r2", region = "chr20",
formats = c("DS","GT"))
par(mar=c(5,5,2,2), cex.lab = 2)
vcfplot(res, col = 2,cex = 2, lwd = 3, type = "b")