Ghost Genes: Who You Gonna Call?

In 2019, researchers studying a gene involved in inherited heart conditions, called KCNE1, reported a curious thing: they had identified another gene incredibly similar to KNCE1 in the most recent version of the human reference genome (Pantaou et al., 2019). This gene, called KCNE1B, shared over 98% of its sequence with KCNE1. The researchers wondered: did this similar gene also play a role in inherited heart conditions?

To identify whether someone carries a disease-causing mutation, sequenced DNA fragments are mapped onto a reference genome. A reference genome is like a template that helps us identify where in the genome a particular sequence is from. The classic analogy is that the reference genome is like the picture on the front of a jigsaw puzzle. It helps you understand where pieces belong, and similarly, the reference genome helps us understand where a sequenced DNA fragment came from in the genome.

The process of comparing the fragments to the reference and figuring out where in the genome they likely come from is called alignment. Once fragments are aligned, the reference genome acts as a baseline which we can use to describe variation in sequences. If, for example, you have a sequence AAGGA where the reference has the sequence AAA, it would be said that you have an insertion (of GG) relative to the reference genome. The act of identifying differences between a sequenced genome and the reference (variants) is called variant calling.

The human reference genome has evolved over time as genomic technologies have improved. KCNE1B was discovered in human reference genome GRCh38. GRCh38 was first published back in 2013 (although it has continued to receive minor improvements since then). GRCh38 was preceded by GRCh37, published in 2009. GRCh38 boasts a range of improvements over the previous version: it has fewer gaps and is, on the whole, more accurate (Guo et al., 2017; NCBI, 2013).

The interesting thing about KCNE1B is that it doesn’t exist in GRCh37. So did the new and improved human reference genome uncover the existence of a new gene? Genes with high sequence similarity do exist in nature. They’re called paralogues and arise through gene duplication. But there is also another possibility: perhaps KCNE1B doesn’t exist at all.

Reference genomes themselves are created by sequencing DNA fragments and assembling the sequences together by identifying common sequences which suggest fragment overlaps. Errors in assembly can lead to inaccuracies in the reference genome.

One documented type of assembly error is the representation of sequence multiple times in the reference, even though it exists in only one location in the actual genome. This is known as a false duplication. False duplications are thought to arise in one of two ways. The first is sequencing errors. No sequencing technology is perfect: sequences may be incorrectly read as having extra, missing, or changed bases. These errors may cause the assembly software (the assembler) to consider sequences from the same part of the genome to be different from one another, and thus represent the sequence in multiple places in the final assembled genome (Ko et al., 2022).

The other, more common, type of false duplication is called a heterotype duplication. Humans have two copies of every gene, one from each parent (excluding some regions of the X chromosome in men). The sequences of these genes can be the same (homozygous), or they can be different (heterozygous). Just like for sequencing errors, the sequence differences in the paternal and maternal copy of a gene may cause them to be considered distinct genomic regions by the assembler, resulting in duplication (Ko et al., 2022).

KCNE1B was determined to be the result of a false duplication. The consequences of a false gene duplication in the human reference genome are severe. During alignment, a fragment of a duplicated gene is perceived as equally likely to have come from the true gene or the false gene. This lack of certainty leads to substantially reduced variant calling accuracy. This means identifying a disease-causing mutation in a gene like KNCE1 would be highly unlikely, potentially leaving a patient without a genetic diagnosis (Wagner et al., 2022).

KCNE1 is not the only example of a medically important gene that is falsely duplicated in GRCh38. There are also false duplications of CRYAA, a gene involved in inherited cataracts, and CBS, which is mutated in a genetic metabolic disease (Wagner et al., 2022).

Wagner et al. (2022) proposed a quick fix for these genes in GRCh38: “hiding” the duplicated gene copies during alignment to prevent misalignment. But what we really want is a human reference genome free of false gene duplications. How can we achieve that?

A major advancement in genome assembly which was thought to assist in the eliminiation of false duplications is the introduction of accurate long read sequencing (Wegner et al., 2019). In long read sequencing, instead of sequencing lots of short DNA fragments, the DNA is sequenced in fewer, much longer pieces. This has enabled the creation of the first ever complete human genome in 2022 (Nurk et al., 2022). Although genome assemblies generated from long read sequencing do tend to have fewer errors, false duplications are still possible, particularly in repetitive regions of the genome (Li & Durbin, 2024).

Another way to prevent false duplications is sequencing of a mother, father, and child, known as trio sequencing. Trio sequencing is used to determine whether sequences from the child originated from the father or the mother. This information can be used to remove heterotype duplicates from the genome assembly.

The use of hydatidiform moles in the creation of GRCh38 was also intended to reduce heterotype duplications. Hydatidiform moles are eggs lacking maternal DNA fertilised by a sperm. Hydatidiform moles therefore only contain paternal DNA sequences (Guo et al., 2017). Evidently though, this was not enough to prevent all false duplications in GRCh38.

Another opportunity to address false duplications is to improve the software used for assembly. Improving genome assembly algorithms, for example by improving the resolution of maternal and paternal sequences for trio sequencing, may assist in preventing false duplications (Wagner et al., 2022). Additionally, there are a number of tools that can be used on assembled genomes to remove sequences that are suspected to be false duplications (Li & Durbin, 2024).

Apparitions of medically relevant genes like KCNE1B in the human reference genome make it impossible for variants to be called in the gene, and therefore for a genetic diagnosis to be made. Preventing false duplications in future human reference genomes is imperative. A combined effort of improving sequencing technologies, identifying the best types of sequences for building a reference genome, and improving software used in the assembly process is required to achieve this goal.

References

Guo, Y., Dai, Y., Yu, H., et al. (2017) ‘Improvements and impacts of GRCh38 human reference on high throughput sequencing data analysis’, Genomics, 109(2), 83-90, doi: 10.1016/j.ygeno.2017.01.005.

Ko, J. B., Lee, C., Kim, J., et al. (2022) ‘Widespread false gene gains caused by duplication errors in genome assemblies’, Genome Biology, 23, 205, doi: 10.1186/s13059-022-02764-1.

Li, H. & Durbin, R. (2024) ‘Genome assembly in the telomere-to-telomere era’, Nature Reviews Genetics, 25, 658-670, doi: 10.1038/s41576-024-00718-w.

National Center for Biotechnology Information (NCBI) (2013) ‘Introducing the New Human Genome Assembly: GRCh38’, https://ncbiinsights.ncbi.nlm.nih.gov/2013/12/24/introducing-the-new-human-genome-assembly-grch38/, accessed 08.02.2025.

Nurk, S., Koren, S., Rhie, A., et al. (2022) ‘The complete sequence of a human genome’, Science, 376(6588), 44-53, doi: 10.1126/science.abj6987.

Pantou, M. P., Gourzi, P. & Degiannis, D. (2019) ‘The potential presence of the highly similar paralogue gene KCNE1B blurs the genetic basis of KCNE1-LQTS patients’, European Journal of Human Genetics, 27(8), 1175–1177, doi: 10.1038/s41431-019-0389-2.

Wagner, J., Olson, N. D., Harris, L., et al. (2022) ‘Curated variation benchmarks for challenging medically relevant autosomal genes’, Nature Biotechnology, 40(5), 672-680, doi: 10.1038/s41587-021-01158-1.

Wegner, A. M., Peluso, P., Rowell, W. J., et al. (2019) ‘Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome’, Nature Biotechnology, 37, 1155–1162, doi: 10.1038/s41587-019-0217-9.