Tag Archives: computational biology

Reading a virus like a book

Teaching grammar and syntax to artificial intelligence (AI) algorithms (specifically natural language processing (NLP) algorithms) has helped researchers understand and predict viral mutations more speedily. This facility is especially useful at a time when the Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) virus seems to be mutating into more easily transmissible variants.

Will Douglas Heaven’s Jan. 14, 2021 article for the Massachusetts Institute of Technology’s MIT Technology Review describes the work that links AI, grammar, and mutating viruses (Note: Links have been removed),

Galileo once observed that nature is written in math. Biology might be written in words. Natural-language processing (NLP) algorithms are now able to generate protein sequences and predict virus mutations, including key changes that help the coronavirus evade the immune system.

The key insight making this possible is that many properties of biological systems can be interpreted in terms of words and sentences. “We’re learning the language of evolution,” says Bonnie Berger, a computational biologist at the Massachusetts Institute of Technology [MIT].

In the last few years, a handful of researchers—including teams from geneticist George Church’s [Professor of Health Sciences and Technology at Harvard University and MIT, etc.] lab and Salesforce [emphasis mine]—have shown that protein sequences and genetic codes can be modeled using NLP techniques.

In a study published in Science today, Berger and her colleagues pull several of these strands together and use NLP to predict mutations that allow viruses to avoid being detected by antibodies in the human immune system, a process known as viral immune escape. The basic idea is that the interpretation of a virus by an immune system is analogous to the interpretation of a sentence by a human.

Berger’s team uses two different linguistic concepts: grammar and semantics (or meaning). The genetic or evolutionary fitness of a virus—characteristics such as how good it is at infecting a host—can be interpreted in terms of grammatical correctness. A successful, infectious virus is grammatically correct; an unsuccessful one is not.

Similarly, mutations of a virus can be interpreted in terms of semantics. Mutations that make a virus appear different to things in its environment—such as changes in its surface proteins that make it invisible to certain antibodies—have altered its meaning. Viruses with different mutations can have different meanings, and a virus with a different meaning may need different antibodies to read it.

Instead of millions of sentences, they trained the NLP model on thousands of genetic sequences taken from three different viruses: 45,000 unique sequences for a strain of influenza, 60,000 for a strain of HIV, and between 3,000 and 4,000 for a strain of Sars-Cov-2, the virus that causes covid-19. “There’s less data for the coronavirus because there’s been less surveillance,” says Brian Hie, a graduate student at MIT, who built the models.

The overall aim of the approach is to identify mutations that might let a virus escape an immune system without making it less infectious—that is, mutations that change a virus’s meaning without making it grammatically incorrect.

But it’s also just the beginning. Treating genetic mutations as changes in meaning could be applied in different ways across biology. “A good analogy can go a long way,” says Bryson [Bryan Bryson, a biologist at MIT].

If you have time, I recommend reading Heaven’s Jan. 14, 2021 article in its entirety as it’s well written with clear explanations. As for the article’s mentions of George Church and Salesforce, the former could be expected while the latter is not (by me, I speak for no one else).

I find it fascinating that a company which describes itself (from What is Salesforce?) as providing “… customer relationship management, or CRM. It gives all your departments — including marketing, sales, commerce, and service — a shared view of your customers … ” seems to be conducting investigations into one (or more?) areas of biology.

For those who’d like to dive into the science as described in Heaven’s article, here’s a link to and a citation for the paper,

Learning the language of viral evolution and escape by Brian Hie, Ellen D. Zhong, Bonnie Berger, Bryan Bryson. Science 15 Jan 2021: Vol. 371, Issue 6526, pp. 284-288 DOI: 10.1126/science.abd7331

This paper appears to be open access (or it is, at least for now).

There is also a preprint version available on bioRxiv, which is an open access repository.

I hear the proteins singing

Points to anyone who recognized the paraphrasing of the title for the well-loved, Canadian movie, “I heard the mermaids singing.” In this case, it’s all about protein folding and data sonification (from an Oct. 20, 2016 news item on phys.org),

Transforming data about the structure of proteins into melodies gives scientists a completely new way of analyzing the molecules that could reveal new insights into how they work – by listening to them. A new study published in the journal Heliyon shows how musical sounds can help scientists analyze data using their ears instead of their eyes.

The researchers, from the University of Tampere in Finland, Eastern Washington University in the US and the Francis Crick Institute in the UK, believe their technique could help scientists identify anomalies in proteins more easily.

An Oct. 20, 2016 Elsevier Publishing press release on EurekAlert, which originated the news item, expands on the theme,

“We are confident that people will eventually listen to data and draw important information from the experiences,” commented Dr. Jonathan Middleton, a composer and music scholar who is based at Eastern Washington University and in residence at the University of Tampere. “The ears might detect more than the eyes, and if the ears are doing some of the work, then the eyes will be free to look at other things.”

Proteins are molecules found in living things that have many different functions. Scientists usually study them visually and using data; with modern microscopy it is possible to directly see the structure of some proteins.

Using a technique called sonification, the researchers can now transform data about proteins into musical sounds, or melodies. They wanted to use this approach to ask three related questions: what can protein data sound like? Are there analytical benefits? And can we hear particular elements or anomalies in the data?

They found that a large proportion of people can recognize links between the melodies and more traditional visuals like models, graphs and tables; it seems hearing these visuals is easier than they expected. The melodies are also pleasant to listen to, encouraging scientists to listen to them more than once and therefore repeatedly analyze the proteins.

The sonifications are created using a combination of Dr. Middleton’s composing skills and algorithms, so that others can use a similar process with their own proteins. The multidisciplinary approach – combining bioinformatics and music informatics – provides a completely new perspective on a complex problem in biology.

“Protein fold assignment is a notoriously tricky area of research in molecular biology,” said Dr. Robert Bywater from the Francis Crick Institute. “One not only needs to identify the fold type but to look for clues as to its many functions. It is not a simple matter to unravel these overlapping messages. Music is seen as an aid towards achieving this unraveling.”

The researchers say their molecular melodies can be used almost immediately in teaching protein science, and after some practice, scientists will be able to use them to discriminate between different protein structures and spot irregularities like mutations.

Proteins are the first stop, but our knowledge of other molecules could also benefit from sonification; one day we may be able to listen to our genomes, and perhaps use this to understand the role of junk DNA [emphasis mine].

About 97% of our DNA (deoxyribonucleic acid) has been known for some decades as ‘junk DNA’. In roughly 2012, that was notion was challenged as Stephen S. Hall wrote in an Oct. 1, 2012 article (Hidden Treasures in Junk DNA; What was once known as junk DNA turns out to hold hidden treasures, says computational biologist Ewan Birney) for Scientific American.

Getting back to  2016, here’s a link to and a citation for ‘protein singing’,

Melody discrimination and protein fold classification by  Robert P. Bywater, Jonathan N. Middleton. Heliyon 20 Oct 2016, Volume 2, Issue 10 DOI: 10.1016/j.heliyon.2016.e0017

This paper is open access.

Here’s what the proteins sound like,

Supplementary Audio 3 for file for Supplementary Figure 2 1r75 OHEL sonification full score. [downloaded from the previously cited Heliyon paper]

Joanna Klein has written an Oct. 21, 2016 article for the New York Times providing a slightly different take on this research (Note: Links have been removed),

“It’s used for the concert hall. It’s used for sports. It’s used for worship. Why can’t we use it for our data?” said Jonathan Middleton, the composer at Eastern Washington University and the University of Tampere in Finland who worked with Dr. Bywater.

Proteins have been around for billions of years, but humans still haven’t come up with a good way to visualize them. Right now scientists can shoot a laser at a crystallized protein (which can distort its shape), measure the patterns it spits out and simulate what that protein looks like. These depictions are difficult to sift through and hard to remember.

“There’s no simple equation like e=mc2,” said Dr. Bywater. “You have to do a lot of spade work to predict a protein structure.”

Dr. Bywater had been interested in assigning sounds to proteins since the 1990s. After hearing a song Dr. Middleton had composed called “Redwood Symphony,” which opens with sounds derived from the tree’s DNA, he asked for his help.

Using a process called sonification (which is the same thing used to assign different ringtones to texts, emails or calls on your cellphone) the team took three proteins and turned their folding shapes — a coil, a turn and a strand — into musical melodies. Each shape was represented by a bunch of numbers, and those numbers were converted into a musical code. A combination of musical sounds represented each shape, resulting in a song of simple patterns that changed with the folds of the protein. Later they played those songs to a group of 38 people together with visuals of the proteins, and asked them to identify similarities and differences between them. The two were surprised that people didn’t really need the visuals to detect changes in the proteins.

Plus, I have more about data sonification in a Feb. 7, 2014 posting regarding a duet based on data from Voyager 1 & 2 spacecraft.

Finally, I hope my next Steep project will include  sonification of data on gold nanoparticles. I will keep you posted on any developments.