Machine Learning in Molecular Sciences

Synced wrote about the 2017 Summer School on Machine Learning in the Molecular Sciences held at NYU Shanghai on June 12-16, sponsored by the NYU-ECNU Center for Computational Chemistry at NYU Shanghai. Summer School Committee Professor Mark Tuckerman, Professor Zhang Yingkai and Professor Zhang Zenghui shared their insights. The original full news in Chinese can be found HERESynced (机器之心) is China’s first media entity that focuses on reporting machine intelligence and related technologies.

Below please find the English translation of the News.


Machine Learning in Molecular Sciences: Would this Single Spark Start a Prairie Fire?

Succeeding computer vision, speech recognition, and natural language processing, have you ever wondered which would be the next to ride the waves of deep learning? Will the field of natural sciences - where the world’s brightest minds gather together - be the next to feel the impact? If so, how would those minds apply it? Are they worried about being replaced by neural networks? We discussed these questions with three professors from the area of molecular sciences at the 2017 Summer School on Machine Learning in the Molecular Sciences at NYU Shanghai.

Machine learning has been spreading into the field of natural sciences like wildfire for a while. Actually, "machine learning" has become such a hot word this summer, probably second only to "physics", that at physics thesis defense presentations in various universities in China, a presentation without machine learning would make it stand out from its peers.” commented by a secretary of the defense committee from Fudan University in Shanghai.

In addition to applying machine learning to the research in natural sciences, scientists, with an innate spirit of altruism also take it into its next level: could we impart explanatory knowledge to machines, especially deep learning models? That is, using our own knowledge, could we help machines to learn and work in a more efficient, expository, and beautiful manner? As an illustration of efforts as such, physicists and mathematicians at MIT jointly published a paper last year entitled "Why does deep and cheap learning work so well,” exploring the possibilities of turning physical properties such as symmetry and locality into simple neural networks.

This June, the NYU-ECNU Center for Computational Chemistry at NYU Shanghai pioneered in hosting the 2017 Summer School on Machine Learning in the Molecular Sciences. The five-day course was rigorous: it introduced basic machine learning methods, the computational research topics in the fields of chemistry, biology, and materials science, and unveiled how machine learning methods could be applied to solve key problems in these disciplines. The lecturers were very diverse in background: they came from a number of areas including theoretical chemistry, statistical physics, computer science, and bioinformatics; in fact, many of the research areas were typical "interdisciplinary", lying at the boundary between one or more of these areas. The Organizing Committee of the Summer School includes John Zenghui Zhang, Professor of Chemistry at NYU and NYU Shanghai and Director of the NYU-ECNU Center for Computational Chemistry at NYU Shanghai, and Yingkai Zhang, Professor at the Department of Chemistry at NYU. Their research area - theoretical calculations of large biomolecules/process - is itself an interdisciplinary area combing physics, chemistry, and biology. The other committee member and lead organizer, Mark Tuckerman, is a Professor at both the NYU Department of Chemistry and Courant Institute of Mathematical Sciences. He is a theoretical chemist and also an applied mathematician.

As a media focusing on machine learning in China, Synced (in Chinese 机器之心) was invited to interview the Organizing Committee of the Summer School to discuss with the three molecular scientists this single spark in the backdrop of molecular sciences. During the interview we had an in-depth discussion about how machine learning models could be helpful for researchers in molecular sciences, how to resolve the lack of interpretability in machine learning, and how to perceive and predict the possible impact of machine learning on the development of molecular sciences.

Machine learning in molecular sciences

Molecular science is a multidisciplinary area that studies the structure and function of molecules. Key topics in the area include molecular interactions, the study and prediction of various physical and chemical properties of molecules, such as chemical bonds forming and breaking, structure and function of biomolecules, molecular recognition, and complex material formation through intermolecular interactions. Molecular science is the basis of chemistry, biology, materials, medicine, and numerous other disciplines.

The key intersection of molecular science and machine learning lies in the field of computer science. Computer science is a discipline that parallels with theoretical science and experimental science. It offers insights into experimental observations using simulation as a means to help researchers to interpret experimental observations in real time and, subsequently, to construct an intuitive yet predictive model for them.

However, traditional simulation methods cannot be practically applied to many systems possessing a high degree of complexity.  In those cases, machine learning can further help researchers to abstract, simplify, and describe the complexity through characterizing data in lower dimensional representations.

"Machine-readable" molecular structure

The first topic we were interested in is the input of the model.  Specifically, how can one express the molecular structure as a vector or matrix that is understandable to a computer?

Professor Tuckerman told us that creating abstract representations of molecular structures is an active research area.  Scientists seek optimal yet simple molecular representations for machine learning models. "To paraphrase Einstein, a representation should be made as simple as possible, but no simpler.” said Professor Tuckerman. It is clear that the simpler the representation is, the more efficient it will be.

Therefore, the answer to this question depends on the object of study. “If it is composed of dozens of atoms comprising a small molecule, a simple matrix representation should suffice, i.e.,  we might reduce a molecule to a matrix that is defined as a function of the interatomic distances within the molecule. If the molecule is more complex and possibly highly flexible (so-called soft matter such as proteins and other biomolecules), then researchers might need to build a representation from certain coarse-grained characteristics of the molecule in order to capture key aspects of its structure.” said Professor Tuckerman. At present, “descriptor” is widely used term in academic circles to characterize the structure of such complex systems. The definition of a “molecular descriptor” was given by chemists Todeschini and Consonni, as a way to “encode the chemical information of the molecule into a set of meaningful numbers."

“There are thousands of descriptors that can be used to represent molecules.” said Professor Yingkai Zhang. According to the complexity of different systems, descriptors can be classified by dimensions: a one-dimensional descriptor is more like statistics wherein researchers might count the number of carbons, hydrogens, and other types of atoms in big molecules, and numbers of chemical bonds, and from these numbers, construct such a one-dimensional descriptor. Two-dimensional descriptors characterize graph invariants, while three-dimensional descriptors characterize symmetries.

Researchers can assemble these descriptors together into a computer-readable input. In the case of a one-dimensional descriptor, as in natural language analysis, where one might build a "bag of words" model, in molecular science, there could be a "bag of atoms" or a  "bag of bonds."

The choice of descriptors depends primarily on the structure of the machine learning model to be employed. For example, the neural network itself is complex enough, and thus, it is better to use the most basic measures to represent molecules and to let the network infer the potential structure relationships.

The output of the model and optimization functions

Professor Yingkai Zhang pointed out two optimization goals for machine learning that could be used in molecular sciences.

"The study of the binding affinity, or binding strength, between molecules is a critical issue," Said Professor Yingkai Zhang. "The output can be a two-dimensional vector, in which one component represents the probability of binding and the other represents the probability of not binding. This  constitutes a classification problem well known in machine learning."

Professor John Zenghui Zhang made an interesting analogy: "The binding affinity of molecules is like the degree of intimacy of a couple. After observing how much they have in common, and how much they know each other, you will be able to predict if their relationship will last long.”

Another good example is solubility, which is important in the pharmaceutical industry. Solubility can be measured experimentally, but more often researchers wish to predict the solubility of molecules possessing particular structures before a molecule is produced, and machine learning has made good progress in this aspect.

What are the advantages of machine learning compared to traditional methods?

Professor Tuckerman believes that the advantages of machine learning are efficiency and extensibility. On the one hand, if you wish to predict the formation energy of a new molecule, traditionally scientists need to solve extremely complex quantum mechanical equations at high computational overhead. Now, however, researchers can use machine learning to bypass quantum mechanical calculations for very efficient and reasonably accurate estimates.

On the other hand, research in molecular science is very broad in its scope - from small molecules composed of a few dozens of atoms to very large molecules having complex conformational landscapes.  Thus, when one attempts to extend a fine-grained description that works well for small molecules to much larger, flexible molecules using an analogous coarse-grained description, that description often become too crude, resulting in a decrease in accuracy. This is where machine learning can step in: machine learning techniques that can model reasonably precise calculations on small molecules, can often be extended to large proteins or other big molecules to get equally accurate and effective results.

Professor Tuckerman gave us a concrete example: "We provide with model with some amount of information -- the information does not necessarily need to be high dimensional -- using a few key variables to represent the structure of the molecule in a coarse-grained manner. To be more specific, we can use one collective coordinate to describe whether a subdomain is adjacent to another, and a second to describe whether two subdomains are on an adhesion plaque, and if they are on the plaque, the two domains will approach each other; conversely, if they are in a natural state, the two domains will be farther away from each other.  Whatever variables we choose, this set of variables is used to describe the structure of the molecule as input, and a model used for prediction can be obtained through training. The trained model can predict the impact of changing environment (such as whether the protein is part of a plaque) on the protein’s structure from a set of new, unexplored structures. This kind of information can be used to make targeted adjustments to the model if needed or predict new phenomena. In this example, the process does not require a complex, exhaustive descriptor. "

“Black box” is only a measure to solve bigger mystery

Professor Tuckerman is excited by the concept of neural networks. “I am fascinated by neural networks!  There’s a mysterious quality to the way they work, although clearly some complex mechanism underlies a network’s performance.  As a mathematician, you cannot help but wonder what new theorems about neural networks are waiting to be proved.”

Professor John Zenghui Zhang said, the biggest difference between the hard molecular science and machine learning models like neural networks is that, molecular science tries to discover logical relationships between objects and the theoretical principles behind different phenomena, while neural networks try to simulate a complex system without much input about the underlying physics.  Of course, any machine learning model must, nevertheless, obey basic physical principles. The simple fact, however, is that it often becomes increasingly difficult to extract simple laws (“as simple as possible but no simpler”) to explain complex behavior. 

Prior to the application of neural networks in natural sciences, researchers made use of closely related tools, e.g., protein interaction networks. Living systems include numerous protein-protein interactions, which form a complicated network. By building protein interaction networks, researches have been able to understand how changes to one of the protein-protein interactions influences the whole network. If each interaction is transferred to an input signal, then based on the model, researchers could discover or elucidate certain biological functions, understand the causes of disease, or identify promising druggable targets.

Physical models that reveal principles are always preferred and are ever-pursuable goals for scientists.  Whether via computer simulations or deep learning, scientist try to simulate, by models of machine learning, phenomena that lie in some extremely complex, non-intuitive (possible even counter-intuitive) systems that cannot be easily interpreted using simple physical models. The ultimate goal is, nevertheless, to understand the hidden fundamental principles rather than to leave those aspects as “black boxes”.

To help, not to replace

At the end of our interview, we raised a question that is asked at nearly all interviews on deep learning: “How would deep learning affect your field? Do you think that deep learning will overturn traditional research methods, and even replace researchers who apply the traditional methods, just as a single spark can start a prairie fire?”

For example, the community of natural language has experienced two “subversive” waves. The first one took place in 1990s, when algorithms based on statistics subverted those based on principles. It was also when Fred Jelinek from IBM, a famous scholar studying statistical algorithms, came up with his well-known quote: “Anytime a linguist leaves the group the recognition rate goes up". However, the territory of statistics has been significantly impacted by deep learning since 2010. A few well known researchers in the area recently raised the criticism that some scholars who apply deep learning method are not equipped with basic knowledge of linguistics, blindly apply models of deep learning, and create exaggerated titles for their research and/or publications in order to attract more attention.

With the growing impact of deep learning, will the fields of natural sciences face similar challenges?

The three scientists gave the same answer that natural sciences will not be “overwhelmed” by machine learning. Machine learning helps scientists, but will not replace them.

Professor Tuckerman said, the reason why machine learning has generated so much outstanding research output thus far is scientists invested considerable prior to applying machine learning, such as precisely defining questions to be addressed, appropriately describing objects of research, designing physically and chemically motivated descriptors, and clearly understanding which characteristics might be described by a common “parent law”, thus guaranteeing the validity of the machine learning models even when laws are not very clear. All such work requires scientists to understand their research objectives thoroughly as well as related research areas.

Professor John Zenghui Zhang added, machine learning, especially deep learning, could be rapidly introduced in areas such as computer vision and natural language analysis because their research objectives could be straightforwardly digitalized. Nonetheless, in some other areas including financial area and molecular sciences, there are enormously complex variables, and thus it takes effort to digitalize and integrate each piece of information properly.

“We scientists are not likely to lose our jobs.” commented Professor John Zenghui Zhang with a confident smile.

 

Written by Lulu Qiu, Synced | Translated by Office of Research, NYU Shanghai
*The original article was in Chinese and composed by Synced.