We believe that these efforts will facilitate gene expression data sharing between researchers who may be working on different analytics platforms. Moreover, if the gene names are unintentionally converted to dates by Excel, the web tool allows researchers to rectify these terms back to the correct gene names. We thus developed a Gene Updater web tool that allows researchers to convert the previous gene names to the newly approved gene names recommended by HGNC. However, at present, most of the published gene expression data are not updated to the newly approved gene names, especially in the microarray datasets. This movement was well-received by researchers and data scientists, as changing to the updated gene names would allow sharing of gene expression data without worrying about the automatic conversion of gene symbols to dates in Excel.
To tackle this issue, the HUGO Gene Nomenclature Committee (HGNC) announced in 2017 to update the gene names that may be unintentionally converted to dates in Excel files 11. As many of these datasets are frequently accessed by other data scientists, such errors may be carried over to other scientific publications, resulting in further distortion of downstream data analysis. This problem has become so rampant that approximately one-fifth of the published papers with supplementary Excel gene lists contain erroneous gene name conversions 9,10. For instance, septins (eg SEPT1), which are involved in cell division, are internally converted to SEP-01 in Excel, which cannot be recognized by other databases. As dates are not recognized by these pathway databases, this can result in voids in pathway enrichment analysis.
While Excel is popular and widely used in data analysis, these auto-conversions can affect pathway enrichment analysis, as many of the pathway enrichment tools such as Enrichr 2Gene set enrichment analysis (GSEA) 3.4 and Ingenuity Pathway Analysis 5 rely on gene symbols to query against pathway databases such as Gene Ontology 6.7 and Reactome 8. Similarly, if gene names are copied from another application (eg text processors) and pasted into an Excel spreadsheet without specifying cell formatting, conversion of gene names to dates can occur 1. When gene expression datasets are opened with Excel under default settings (Microsoft Corp., Redmond, WA), a recurring problem where gene names are converted to dates occurs.