reading notes of《Molecular Graph Representation Learning and Generation for Drug Discovery》
Abstract
- Deep learning models are powerful because they learn the important statistical features of the problem–but only with the correct inductive biases. We tackle this important problem in the context of two molecular problems: representation and generation.
- Canonical success of deep learning is deeply rooted in its ability to map the input domain into a meaningful representation space. This is especially poignant for molecular problems, where the “right” relations between molecules is nuanced and complex.
1.Introduction
- Within these methods, fingerprint techniques are widely popular, and can be broadly categorized into several types including structure-based [30], topological [1], circular [8] and pharmacophore fingerprints [91].
- However, the problem still lies within the deterministic nature of the generating method: if these predefined rules do not capture the right representation for the task, they will not work well. For instance, property cliffs, a phenomenon in which similar molecules exhibit different properties, remain a challenging problem for many small molecule problems.
- While sometimes effective,simple paradigm of GNN may not always incorporate the right kind of biases for molecular tasks. For instance, this local neighborhood aggregation can fail to capture long-range dependencies that are important when considering properties of molecules.
1.1.Machine Learning Applications for Drug Discovery
- During the discovery phase, high throughput screening (HTS) is conducted on large libraries of molecules, which yields candidate molecules, known as hits. These hit molecules then undergo more screening and optimization to generate a smaller set of lead molecules. The selection of hit and lead compounds is the ideal frontier for machine learning methods to pave new improvements.
- Prior to machine learning, QSAR methods were broadly applied to virtual screening. In its most basic form, QSAR methods use a variety of hand-engineered descriptors, such as simple features including atom and bond counts, molecular weight and ring information; more complex descriptors include higher-order topological features and physicochemical properties.
1.2.Thesis Overview
- In Chapter 2, I’ll introduce the different rep- resentations of molecules, and new models for their improvement. In the following chapter (Chapter 3), I will talk about another new graph neural network paradigm that borrows ideas from prototype learning. Chapter 4 will talk about retrosynthesis, and how we can produce accurate and diverse synthesis suggestions. Lastly, Chapter 5 will introduce a new method for molecular optimization.