MIT decodes SARS-CoV-2 protein genes


Nandkumar M Kamat

Massachusetts Institute Of Technology (MIT), USA has finally cracked the secrets of the SARS-COV-2 genome, the novel coronavirus which has caused the current

COVID-19 pandemic.

In their paper titled ‘SARS-CoV-2 gene content and COVID-19 mutation impact by comparing 44 Sarbecovirus genomes’(Nat Commun 12, 2642 (2021). the authors Irwin Jungreis, Rachel Sealfon, and Manolis Kellis mention that:“ despite its clinical importance, the SARS-CoV-2 gene set remains unresolved, hindering dissection of COVID-19 biology. They used comparative genomics to provide a high-confidence protein-coding gene set, characterise evolutionary constraint, and prioritise functional distances, and quantify protein-coding evolutionary signatures and overlapping constraint.”

This team found –“strong protein-coding signatures for ORFs 3a, 6, 7a, 7b, 8, 9b, and a novel alternate-frame gene, ORF3c, whereas ORFs 2b, 3d/3d-2, 3b, 9c, and 10 lack protein-coding signatures or convincing experimental evidence of protein-coding function.” They also concluded that “no other conserved protein-coding genes remain to be discovered. Mutation analysis suggests ORF8 contributes to within-individual fitness but not person-to-person transmission. Cross-strain and within-strain evolutionary pressures agree, except for fewer-than-expected within-strain mutations in nsp3 and S1, and more-than-expected in nucleocapsid, which shows a cluster of mutations in a predicted B-cell epitope, suggesting immune-avoidance selection. Evolutionary histories of residues disrupted by spike-protein substitutions D614G, N501Y, E484K, and K417N/T provide clues about their biology, and they had catalogued likely-functional co-inherited mutations. Previously reported RNA-modification sites showed no enrichment for conservation.”

The MIT press release mentions that their research team had analysed more than 1,800 mutations that have arisen in SARS-CoV-2 since it was first identified. For each gene, they compared how rapidly that gene has evolved in the past with how much it has evolved since the current pandemic began. They found that in most cases, genes that evolved rapidly for long periods of time before the current pandemic have continued to do so, and those that tended to evolve slowly have maintained that trend.

The researchers also analysed mutations that have arisen in variants of concern, such as the B.1.1.7 strain from England, the P.1 strain from Brazil, and the B.1.351 strain from South Africa. This work is important for specific molecular targeting of the dreaded coronavirus because this team has reported “a high-confidence gene set and evolutionary-history annotations providing valuable resources and insights on SARS-CoV-2 biology, mutations, and evolution.”

For this study, they selected and aligned 44 complete Sarbecovirus genomes (SARS-CoV-2, SARS CoV, and 42 bat-infecting strains) at evolutionary distances well-suited for identifying protein coding genes and non-coding purifying selection, spanning less than three substitutions per four-fold degenerate site on average, and ranging from 1.2 (E) to 4.8 (O-MT/nsp16) and higher. They found that the Betacoronaviruses outside Sarbecovirus (including MERS-CoV) are too distant and SARS-CoV-2/SARS-CoV isolates are too proximal for reliable evolutionary signatures. Evolutionary distances between SARS-CoV-2 and other sarbecoviruses, as measured by nucleotide identity, vary substantially across the genome.

They used “comparative genomics to determine the conserved functional protein-coding genes of SARS-CoV-2, resulting in a new high-confidence evolutionarily and experimentally supported reference gene set, including ORFs 1a, 1ab, S, 3a, 3c, E, M, 6, 7a, 7b, 8, N, and 9b, but excluding 3d, 3b, 9c, and 10, which lack evidence of translation, and 2b and 3d-2, which lack evidence of function.

The team showed that “novel ORF 3c is functional and conserved, and that no other conserved genes remain to be discovered. This whole work in comparative functional genomics was based on powerful computational tools like PhyloCSF, CodAlignView and FRESCo which they claim had been applied for the first time for viral genome. The MIT team has claimed that overall, their “new reference gene set provides a solid foundation for systematically dissecting the function of SARS-CoV-2 proteins and focusing experimental work on high-confidence uncharacterised ORFs, which can be guided in part by their evolutionary dynamics (such as the rapid evolution of part of ORF6, indicating a possible adaptive role, and the contribution of ORF8 to fitness within an individual but not to transmission).”

The MIT team also claims that “in addition, their gene-level, codon-level, and nucleotide-level Sarbecovirus constraint, and the classification of all existing and potential SNVs and known RNA modification sites into likely-functional vs likely-neutral based on their evolutionary history, provide important foundations for elucidating SARS-CoV-2 biology, understanding its evolutionary dynamics, prioritising candidate driver mutations among co-inherited mutations, and prioritising candidate regions for vaccine design and refinement.” What they mean is that this is their pioneer work in understanding the functional genomics and structural proteomics of SARS-COV-2 and they have provided important leads to molecular virologists, immunologists and the vaccine and drug designers using their powerful computational tools- PhyloCSF, CodAlignView and FRESCo.

MIT says that the data generated by their research team ‘could help other scientists focus their attention on the mutations that appear most likely to have significant effects on the virus’ infectivity. They have made the annotated gene set and their mutation classifications available in the University of California at Santa Cruz Genome Browser for other researchers who wish to use it.

This is the web link of the genome browser for anyone curious to find out more about this work and use it