Recently it has been revealed that thousands of gene variants in the UK Biobank (UKB) exome data have been overlooked due to errors in sequence processing. Unfortunately, two independent UKB sequence processing pipelines contained errors that effect variant calling. These findings are a reminder to remain vigilant to the possibility of sequence processing errors. A workaround for one of the pipeline errors is available.
A variant call is the conclusion that there is a nucleotide difference compared to a reference at a given position in an individual genome or transcriptome. This is usually accompanied by an estimate of variant frequency and an indication of confidence.
Few investigators have the computational infrastructure to identify and curate genetic variants themselves, instead relying on pre-processed variants in variant call format (VCF). The UKB has released VCFs from two different variant analyses pipelines, the Regeneron Seal Point Balinese (SPB) and the Functionally Equivalent (FE) pipeline developed at five US-based genome centers.
Duplicate reads (DRs) are multiple reads that originate from the same template sequence during library preparation. DRs are a consequence of upstream techniques such as PCR, and usually are not an error of sequencing. Duplicate reads match exactly.
In July 2019 an issue was identified within the SPB pipeline exome data, in which duplicate sequence reads were not marked correctly. The issue is limited to the exome data that have been processed using the SPB pipeline. It is expected that a corrected SPB pipeline will be released in spring 2020.
Due to a failure to adapt the SPB algorithm to a different sequencing platform from the one on which SPB was created a proportion of duplicate reads were not marked and removed. This undermarking of duplicate reads causes the unique-read coverage reported for each sample to be inflated and can create variant errors. False positive variant calls can arise when unmarked duplicate reads carry a variant allele, and false negative calls can arise when the unmarked duplicates carry the reference allele.
In December 2019 an issue with the UKB FE pipeline was reported. Due to errors of read alignment resulting in mapping quality (MAPQ) scores of zero, 598 genes have a high probability of missing variation. Furthermore, additional genes may have partially duplicated or repetitive sequences such that they are missing substantial variation. The authors of the study that first reported the error provide protocol for read realignment that can be used to correct the variant calling errors. The UKB intends to release a corrected FE pipeline in spring 2020.
The first tranche of large-scale UKB exome sequence data for 49,960 study participants was released in March 2019. Clearly all subsequent work stemming from this release will need to take account of the errors in the SPB and FE sequencing pipelines. To say the least the genetic variation is not well represented.