Research Topics

Development of "SubMo-GNN", a method of selecting diverse molecules based on graph neural networks and submodular function maximization

Face Photo
Sakaue, Shinsaku Project Assistant Professor
Belongs
Graduate School of Information Science and Technology, the University of Tokyo
Biography
To Website
Keywords
molecular selection graph neural network submodular function greedy algorithm

Molecules play a vital role in our society, serving as the foundation for a wide range of products, including pharmaceuticals, materials, and foods. The development of new molecules is a continuous process with the potential for an enormous number of possible combinations of atoms and bonds. As a result, it is critical to focus on important and diverse molecules to efficiently acquire new chemical knowledge from the vast number of possibilities. In this study, we developed a selection method of significant and diverse molecules based on their properties and structures, utilizing machine learning to learn molecular representation and a molecular selection method based on mathematical knowledge.

Background: Exploring the Large Molecular World

Molecules are composed of multiple atoms and bonds connecting them, and the possible combinations of these components are vast. Some estimates suggest that around 1060 different molecules exist solely for drug discovery. The Chemical Abstracts Service (CAS) of the American Chemical Society currently contains approximately 70 million molecules as of February 8, 2023. With such a massive number of molecules, it is challenging to efficiently discover molecules with novel properties or identify key molecules for new reactions by examining each one individually.

To efficiently explore the vast space of molecules, therefore, it is necessary to develop methods that can select a small number of essential and diverse molecules from a given large list of options, without examining similar types of molecules repeatedly. While many existing methods are designed to select diverse molecular structures, selecting important and diverse molecules for the acquisition of new chemical knowledge requires a broader approach that considers the molecular properties as well. In this study, we aimed to develop a molecular selection method that incorporates both properties and structures of molecules.

Approach: A New Molecular Selection Method Based on Molecular Representation Learning and Mathematical Knowledge

To select essential and diverse molecules from a large list of options using a computer, the first step is to determine how to represent the molecules in a digital format. In recent years, graph neural networks (GNNs) have emerged as a popular method for learning vector representations of molecules. Research has shown that the molecular vectors generated by GNNs capture important information about both the properties and structures of molecules. For this reason, we employed GNNs in our study. However, even with the vector representations of molecules, the process of selecting diverse molecular vectors remains a critical challenge.

In the above method, the significance of molecules is represented by the length of the vectors, and molecules with dissimilar properties and structures have vectors that form large angles. Based on this observation, our study proposes a novel method for selecting corresponding molecules, in which the volume of the parallelotope spanned by the vectors becomes larger. However, it is known to be challenging to efficiently select a limited number of vectors that form the maximum volume parallelotope, particularly when dealing with a large number of vectors. To overcome this obstacle, we utilized the "submodularity" property of the logarithm of the volume, which allows the efficient greedy method to select vectors that approximately maximize the volume of the parallelotope.

Figure 1. A high-level sketch of our method

Research Results: Selection of Diverse Molecules in Terms of Both Properties and Structure

Through the use of “GNN for learning vector representation of molecules” and “the greedy algorithm for approximate maximization of the logarithm of the volume”, we can efficiently select a subset of molecules that are both diverse and significant in terms of both properties and structures from a given list of a large number of molecules. To demonstrate the effectiveness of our approach, we present the distribution of the energy value representing the molecular properties (HOMO) in the figure below. The blue histograms represent the original distribution of a given list of molecules, which is generally centered around -0.25. The red histograms, from left to right, show the results of the proposed method, an existing structure-based method, and a method of selecting molecules uniformly and randomly from the given molecular list. The results show that the proposed method can select molecules with varying HOMO values, outperforming other methods in terms of diversity.

The ability to select molecules with diverse properties is a significant achievement. We believe that this development will contribute to the efficient acquisition of chemical knowledge in the future. The research was conducted by a group led by Mr. Tomohiro Nakamura, who was a fourth-year undergraduate student at the time.

Figure 2. Comparison of HOMO value distributions of selected molecules