Unsupervised genome-wide variant effect predictor based on pretraining of DNA language models

In a recent study published in PNAS, researchers introduced the Genomic Pre-trained Network (GPN), a multispecies model developed to learn genome-wide variant effects by self-supervised pretraining on genomic deoxyribonucleic acid (DNA) sequences.

Study: DNA language models are powerful predictors of genome-wide variant effects. Image Credit: angellodeco/Shutterstock.com


The genome's genetic variations contribute to complex illnesses and agricultural features, yet understanding them remains challenging. Although genome-wide association studies (GWAS) provide biological insights, identifying causative variations remains difficult.

Experiment validation is time-consuming and expensive, emphasizing the requirement of precise, scalable computer approaches to anticipate the impact of genetic variations throughout the whole genome.

Unsupervised-type pretraining using large protein sequence databases has shown effectiveness in extracting complicated information about proteins and learning variation effects in coding areas.

About the study

In the present study, researchers proposed a genome-wide variant impact prediction strategy based on unsupervised DNA language models, which achieved cutting-edge performance in Arabidopsis thaliana, a model organism for plant biology and a source of insight into human disorders.

To pre-train the language model based on a convolutional neural network, the researchers used unaligned genomes from Arabidopsis thaliana and seven related Brassicales species, using the AraGWAS Catalog for reference. The approach was used to anticipate masked nucleotides based on their genetic context.

The scientists averaged GPN's contextual embeddings (512 dimensions) of nucleotides over 100 base pair (bp) windows from the reference genome. They displayed them using Uniform Manifold Approximation and Projection (UMAP) to measure the extent to which the model understood the genomic organization.

A logistic regression classifier was constructed using the averaged embeddings as features to measure GPN's capacity to discriminate genomic regions. Given the context, each genomic place was individually masked, as was the model output distribution over nucleotides.

Sequence logos were produced that could be viewed in the University of California Santa Cruz (UCSC) Genome Browser to make it easier to use these anticipated distributions.

GPN scores were computed for in silico mutagenesis of SNPs across a 1.0-Mb area, and the findings were averaged across variant types. Subsequently, the researchers examined over 10 million single-nucleotide polymorphisms (SNPs) from natural 1001 Genomes Project accessions to estimate the ability of GPN to predict the functional impact of genetic variants in A. thaliana.

Codes were supplied to train the GPN model for every given species based solely on its deoxyribonucleic acid sequence, allowing unsupervised estimation of variation effects throughout the whole genome. The researchers analyzed the enrichment of uncommon vs. common genetic variants within the tail of genome-level score distributions to assess the capabilities of finding potential functional variations.


The GPN model, which was trained without supervision, effectively learned gene structure and DNA patterns in Arabidopsis thaliana, a plant biology model organism closely linked to several agriculturally relevant species that could be used to provide insights into human disorders.

The approach outperformed established conservation methods such as phastCons and phyloP, based on 18 related Brassicales species aligned by whole-genome sequencing (WGS). The internal representation of DNA sequences used by GPN could discriminate genomic areas such as untranslated regions (UTRs), introns, and coding sequences, and its confidence could aid in discovering regulatory grammar, like motifs that bind transcription factors.

GPN had the best accuracy on coding sequences (CDS, 96%) and the lowest accuracy on non-coding ribonucleic acid (ncRNA, 51%), the least common class. The model could identify intergenic, introns, CDS, UTR, and ncRNA genomic regions.

The prediction confidence of the model was associated with the expected functionality of the sites, and start and stop codon motifs were typically accurately predicted.

Using the log-likelihood ratio between the alternative and reference alleles, GPN might determine a pathogenicity or functionality score for each SNP in the genome. The classification of variant types based on the lowest percentile of GPN scores was typically congruent with previously accepted ideas of deleteriousness.

Eight percent and nine percent of repeat variations were ranked before the first decile of missense variants in models with 0.0 and 0.1 down-weighting, respectively. Putative functional SNPs, defined as the lowest 0.1% of GPN scores, are enriched in uncommon variations 5.5-fold.

GPN has the benefit of assigning significantly different scores to genetic variants in strong linkage disequilibrium (LD) with one another in case their surrounding contexts differ.

The GPN-LD technique effectively separated genome-wide association study hits from non-hits, with single-nucleotide polymorphisms with the lowest one percent of GPN-linkage disequilibrium scores being 10-fold more enriched in the GWAS hits than those with the highest 99.0% of GPN-linkage disequilibrium values.

Surprisingly, the model trained with intermediate weights on repetitions performed the best. When evaluating the entire variation set, including places that do not correspond with other Brassicales, the GPN-LD technique produced significantly higher odds ratio values.


Based on the study findings, the genome-wide variant prediction (GPN) technique reliably predicts genome-wide variant effects based only on genomic sequence. It applies to all species and may be used to refine GWAS fine-mapping and polygenic risk scores.

Since GPN is trained on DNA sequences, it may be used for understudied non-model species that lack comprehensive functional genomics data. The model learns from joint nucleotide distributions across similar contexts in the genome rather than whole-genome alignments, which could result in worse noncoding quality.

GPN predictions surrounding splice junctions might aid in identifying the splicing factor binding sites. Future studies could evaluate the impact of down-weighting repetitions based on family or age.

Journal reference:
  • Gonzalo Benegasa, Sanjit Singh Batra, and YunS. Song (2023) DNA language models are powerful predictors of genome-wide variant effects, PNAS 2023, Vol. 120, No. 44, e2311219120, doi: https://doi.org/10.1073/pnas.2311219120. https://www.pnas.org/doi/10.1073/pnas.2311219120

Posted in: Genomics | Medical Science News | Medical Research News

Tags: Codon, DNA, Gene, Genetic, Genome, Genomic, Genomics, Introns, Language, Nucleotide, Nucleotides, Protein, Ribonucleic Acid, Splicing, Transcription, Transcription Factors

Comments (0)

Written by

Pooja Toshniwal Paharia

Dr. based clinical-radiological diagnosis and management of oral lesions and conditions and associated maxillofacial disorders.