Tag Archives: syntax

Reading a virus like a book

Teaching grammar and syntax to artificial intelligence (AI) algorithms (specifically natural language processing (NLP) algorithms) has helped researchers understand and predict viral mutations more speedily. This facility is especially useful at a time when the Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) virus seems to be mutating into more easily transmissible variants.

Will Douglas Heaven’s Jan. 14, 2021 article for the Massachusetts Institute of Technology’s MIT Technology Review describes the work that links AI, grammar, and mutating viruses (Note: Links have been removed),

Galileo once observed that nature is written in math. Biology might be written in words. Natural-language processing (NLP) algorithms are now able to generate protein sequences and predict virus mutations, including key changes that help the coronavirus evade the immune system.

The key insight making this possible is that many properties of biological systems can be interpreted in terms of words and sentences. “We’re learning the language of evolution,” says Bonnie Berger, a computational biologist at the Massachusetts Institute of Technology [MIT].

In the last few years, a handful of researchers—including teams from geneticist George Church’s [Professor of Health Sciences and Technology at Harvard University and MIT, etc.] lab and Salesforce [emphasis mine]—have shown that protein sequences and genetic codes can be modeled using NLP techniques.

In a study published in Science today, Berger and her colleagues pull several of these strands together and use NLP to predict mutations that allow viruses to avoid being detected by antibodies in the human immune system, a process known as viral immune escape. The basic idea is that the interpretation of a virus by an immune system is analogous to the interpretation of a sentence by a human.

Berger’s team uses two different linguistic concepts: grammar and semantics (or meaning). The genetic or evolutionary fitness of a virus—characteristics such as how good it is at infecting a host—can be interpreted in terms of grammatical correctness. A successful, infectious virus is grammatically correct; an unsuccessful one is not.

Similarly, mutations of a virus can be interpreted in terms of semantics. Mutations that make a virus appear different to things in its environment—such as changes in its surface proteins that make it invisible to certain antibodies—have altered its meaning. Viruses with different mutations can have different meanings, and a virus with a different meaning may need different antibodies to read it.

Instead of millions of sentences, they trained the NLP model on thousands of genetic sequences taken from three different viruses: 45,000 unique sequences for a strain of influenza, 60,000 for a strain of HIV, and between 3,000 and 4,000 for a strain of Sars-Cov-2, the virus that causes covid-19. “There’s less data for the coronavirus because there’s been less surveillance,” says Brian Hie, a graduate student at MIT, who built the models.

The overall aim of the approach is to identify mutations that might let a virus escape an immune system without making it less infectious—that is, mutations that change a virus’s meaning without making it grammatically incorrect.

But it’s also just the beginning. Treating genetic mutations as changes in meaning could be applied in different ways across biology. “A good analogy can go a long way,” says Bryson [Bryan Bryson, a biologist at MIT].

If you have time, I recommend reading Heaven’s Jan. 14, 2021 article in its entirety as it’s well written with clear explanations. As for the article’s mentions of George Church and Salesforce, the former could be expected while the latter is not (by me, I speak for no one else).

I find it fascinating that a company which describes itself (from What is Salesforce?) as providing “… customer relationship management, or CRM. It gives all your departments — including marketing, sales, commerce, and service — a shared view of your customers … ” seems to be conducting investigations into one (or more?) areas of biology.

For those who’d like to dive into the science as described in Heaven’s article, here’s a link to and a citation for the paper,

Learning the language of viral evolution and escape by Brian Hie, Ellen D. Zhong, Bonnie Berger, Bryan Bryson. Science 15 Jan 2021: Vol. 371, Issue 6526, pp. 284-288 DOI: 10.1126/science.abd7331

This paper appears to be open access (or it is, at least for now).

There is also a preprint version available on bioRxiv, which is an open access repository.

Evolution of literature as seen by a classicist, a biologist and a computer scientist

Studying intertextuality shows how books are related in various ways and are reorganized and recombined over time. Image courtesy of Elena Poiata.

I find the image more instructive when I read it from the bottom up. For those who prefer to prefer to read from the top down, there’s this April 5, 2017 University of Texas at Austin news release (also on EurekAlert),

A classicist, biologist and computer scientist all walk into a room — what comes next isn’t the punchline but a new method to analyze relationships among ancient Latin and Greek texts, developed in part by researchers from The University of Texas at Austin.

Their work, referred to as quantitative criticism, is highlighted in a study published in the Proceedings of the National Academy of Sciences. The paper identifies subtle literary patterns in order to map relationships between texts and more broadly to trace the cultural evolution of literature.

“As scholars of the humanities well know, literature is a system within which texts bear a multitude of relationships to one another. Understanding what is distinctive about one text entails knowing how it fits within that system,” said Pramit Chaudhuri, associate professor in the Department of Classics at UT Austin. “Our work seeks to harness the power of quantification and computation to describe those relationships at macro and micro levels not easily achieved by conventional reading alone.”

In the study, the researchers create literary profiles based on stylometric features, such as word usage, punctuation and sentence structure, and use techniques from machine learning to understand these complex datasets. Taking a computational approach enables the discovery of small but important characteristics that distinguish one work from another — a process that could require years using manual counting methods.

“One aspect of the technical novelty of our work lies in the unusual types of literary features studied,” Chaudhuri said. “Much computational text analysis focuses on words, but there are many other important hallmarks of style, such as sound, rhythm and syntax.”

Another component of their work builds on Matthew Jockers’ literary “macroanalysis,” which uses machine learning to identify stylistic signatures of particular genres within a large body of English literature. Implementing related approaches, Chaudhuri and his colleagues have begun to trace the evolution of Latin prose style, providing new, quantitative evidence for the sweeping impact of writers such as Caesar and Livy on the subsequent development of Roman prose literature.

“There is a growing appreciation that culture evolves and that language can be studied as a cultural artifact, but there has been less research focused specifically on the cultural evolution of literature,” said the study’s lead author Joseph Dexter, a Ph.D. candidate in systems biology at Harvard University. “Working in the area of classics offers two advantages: the literary tradition is a long and influential one well served by digital resources, and classical scholarship maintains a strong interest in close linguistic study of literature.”

Unusually for a publication in a science journal, the paper contains several examples of the types of more speculative literary reading enabled by the quantitative methods introduced. The authors discuss the poetic use of rhyming sounds for emphasis and of particular vocabulary to evoke mood, among other literary features.

“Computation has long been employed for attribution and dating of literary works, problems that are unambiguous in scope and invite binary or numerical answers,” Dexter said. “The recent explosion of interest in the digital humanities, however, has led to the key insight that similar computational methods can be repurposed to address questions of literary significance and style, which are often more ambiguous and open ended. For our group, this humanist work of criticism is just as important as quantitative methods and data.”

The paper is the work of the Quantitative Criticism Lab (www.qcrit.org), co-directed by Chaudhuri and Dexter in collaboration with researchers from several other institutions. It is funded in part by a 2016 National Endowment for the Humanities grant and the Andrew W. Mellon Foundation New Directions Fellowship, awarded in 2016 to Chaudhuri to further his education in statistics and biology. Chaudhuri was one of 12 scholars selected for the award, which provides humanities researchers the opportunity to train outside of their own area of special interest with a larger goal of bridging the humanities and social sciences.

Here’s another link to the paper along with a citation,

Quantitative criticism of literary relationships by Joseph P. Dexter, Theodore Katz, Nilesh Tripuraneni, Tathagata Dasgupta, Ajay Kannan, James A. Brofos, Jorge A. Bonilla Lopez, Lea A. Schroeder, Adriana Casarez, Maxim Rabinovich, Ayelet Haimson Lushkov, and Pramit Chaudhuri. PNAS Published online before print April 3, 2017, doi: 10.1073/pnas.1611910114

This paper appears to be open access.

AI (artificial intelligence) and logical dialogue in Japanese

Hitachi Corporation has been exciting some interest with its announcement of the latest iteration of its artificial intelligence programme’s and its new ability to speak Japanese (from a June 5, 2016 news item on Nanotechnology Now),

Today, the social landscape changes rapidly and customer needs are becoming increasingly diversified. Companies are expected to continuously create new services and values. Further, driven by recent advancements in information & telecommunication and analytics technologies, interest is growing in technology that can extract valuable insight from big data which is generated on a daily basis.

Hitachi has been developing a basic AI technology that analyzes huge volumes of English text data and presents opinions in English to help enterprises make business decisions. The original technology required rules of grammar specific to the English language to be programmed, to extract sentences representing reasons and grounds for opinions. This process represented a hurdle in applying system to Japanese or any other language as it required dedicated programs correlated to the linguistic rules of the target language.

By applying deep learning, this issue was eliminated thus enabling the new technology to recognize sentences that have high probability of being reasons and grounds without relying on linguistic rules. More specifically, the AI system is presented with sentences which represent reasons and grounds extracted from thousands of articles. Learning from the rules and patterns, the system becomes discriminating of sentences which represent reasons and grounds in new articles. Hitachi added an attention mechanism” which support deep learning to estimate which words and phrases are worthy of attention in texts like news articles and research reports. The “attention mechanism” helps the system to grasp the points that require attention, including words and phrases related to topics and values. This method enables the system to distinguish sentences which have a high probability of being reasons and grounds from text data in any language.

They have plans for this technology,

The technology developed will be core technology in achieving a multi-lingual AI system capable of offering opinion. Hitachi will pursue further research to realize AI systems supporting business decision making by enterprises worldwide.

The June 2, 2016 Hitachi news release which originated the news item can be found here.