Lost in translation: the pitfalls of Ensembl Gene annotations between human genome assemblies and their impact on diagnostics

Lay summary of the article published under the DOI: 10.1101/2020.11.12.380295

Published onApr 06, 2023
Lost in translation: the pitfalls of Ensembl Gene annotations between human genome assemblies and their impact on diagnostics

Final summary for translation

Geneticists using old human genome are missing important data.

Genetics researchers rely on reference information about human genes, called “genome assemblies”, when studying gene function in the body. Researchers who compared the most recent human genome assembly with an older version found differences in information about specific genes, which could cause scientists to miss important links between genetics and disease.

Many scientists still use the older assembly because there is more research information for them to compare their results with, and because switching to the newer assembly is time-consuming and costly.  The problem is that many of the software tools these scientists use simply ignore genes that we now know to be important, because of outdated information. As a result, scientists may not have all the relevant information, when running their genetic studies.

The researchers in this study wanted to understand how much information was being lost when scientists used the older assembly, which dates to 2013. In particular, they were interested in how much genetic information has been updated in the new assembly, and what those genes do.  

To find out, they analysed and compared the two gene assemblies, looking for cases where new information had been added about the function of certain genes. They then looked at medical and genetic databases to find any links between those genes and diseases.

The researchers found hundreds of genes that had been recategorised in the new database. Many of these are involved in making proteins in the body. The names used to identify some of these genes have also changed, making it even more difficult for scientists to match up information when working with the older database. A few of these genes were also associated with specific diseases, for example the KIZ gene, which is linked to retinitis pigmentosa – a disease that can cause loss of sight. 

The study shows that it’s vital for researchers to use up-to-date genetic information to make sure diseases are linked to the genes that cause them, particularly where this information is being used in medical diagnoses. New tools for updating information between the two assemblies may also make it easier for scientists to transfer information between them in the future.

This work was a collaboration between researchers from South Africa, Sudan, and Germany. 


Background: The GRCh37 human genome assembly is still widely used in genomics despite the fact an updated human genome assembly (GRCh38) has been available for many years. A particular issue with relevant ramifications for clinical genetics currently is the case of the GRCh37 Ensembl gene annotations which has been archived, and thus not updated, since 2013. These Ensembl GRCh37 gene annotations are just as ubiquitous as the former assembly and are the default gene models used and preferred by the majority of genomic projects internationally. In this study, we highlight the issue of genes with discrepant annotations, that have been recognized as protein coding in the new but not the old assembly. These genes are ignored by all genomic resources that still rely on the archived and outdated gene annotations. Moreover, the majority if not all of these discrepant genes (DGs) are automatically discarded and ignored by all variant prioritization tools that rely on the GRCh37 Ensembl gene annotations.

Methods: We performed bioinformatics analysis identifying Ensembl genes with discrepant annotations between the two most recent human genome assemblies, hg37, hg38, respectively. Clinical and phenotype gene curations have been obtained and compared for this gene set. Furthermore, matching RefSeq transcripts have also been collated and analyzed.

Results: We found hundreds of genes (N=267) that were reclassified as “protein-coding” in the new hg38 assembly. Notably, 169 of these genes also had a discrepant HGNC gene symbol between the two assemblies. Most genes had RefSeq matches (N=199/267) including all the genes with defined phenotypes in Ensembl genes GRCh38 assembly (N=10). However, many protein-coding genes remain missing from the current known RefSeq gene models (N=68)

Conclusion: We found many clinically relevant genes in this group of neglected genes and we anticipate that many more will be found relevant in the future. For these genes, the inaccurate label of “non-protein-coding” hinders the possibility of identifying any causal sequence variants that overlap them. In addition, Important additional annotations such as evolutionary constraint metrics are also not calculated for these genes for the same reason, further relegating them into oblivion.


This summary is a free resource intended to make African research and research that affects Africa, more accessible to non-expert global audiences. It was compiled by ScienceLink's team of professional African science communicators as part of the Masakhane MT: Decolonise Science project. ScienceLink has taken every precaution possible during the writing, editing, and fact-checking process to ensure that this summary is easy to read and understand, while accurately reporting on the facts presented in the original research paper. Note, however, that this summary has not been fact-checked or approved by the authors of the original research paper, so this summary should be used as a secondary resource. Therefore, before using, citing or republishing this summary, please verify the information presented with the original authors of the research paper, or email [email protected] for more information.

