SMILES strings are sequences of characters therefore, they can be analyzed by machine-learning methods suitable for text processing, namely with convolutional and recurrent neural networks. Within this approach, there was no need to derive a 2D/3D configuration of the molecule with subsequent calculation of descriptors keeping the quality of the models at the same level as with classical descriptors or even better.
One of the first works exploiting direct SMILES input as descriptors used fragmentation of strings into groups of overlapping substrings forming a SMILES-like set or a hologram of a molecule. 1, thus, making SMILES one of the best representation for QSAR studies. It contains all information about the compound sufficient to derive the entire configuration (3D-structure) and has a direct connection to the nature of fragmental descriptors, Fig. SMILES notation allows for the writing of any complex formula of an organic compound in a string facilitating storage and retrieval information about molecules in databases. In this setting, the whole molecule as a SMILES-strings (Simplified Molecular Input Line Entry System) or a graph serves as the input to the neural network. The rise of deep learning allows us to bypass tiresome expert and domain-wise feature construction by delegating this task to a neural network that can extract the most valuable traits of the raw input data required for modeling the problem at hand.
Thus, feature selection in conjunction with a suitable machine learning method was key to success. Even a small database of compounds contains thousands of fragmental descriptors and some feature selection algorithm has traditionally been used to find a proper subset of descriptors for better quality, and to speed up the whole modeling process. Moreover, there is a theoretical proof that one can successfully build any QSAR model with them. Amongst numerous families of descriptors, the fragment descriptors that count occurrences of a subgraph in a molecule graph, hold a distinctive status due to simplicity in the calculation. In the past, most QSAR works heavily relied on descriptors that represent in a numerical way some features of a complex graph structure of a compound. Quantitative Structure–Activity (Property) Relationship (QSAR/QSPR) approaches find a nonlinear function, often modelled as an artificial neural network (ANN), that estimates the activity/property based on a chemical structure. OCHEM environment ( ) hosts the on-line implementation of the method proposed. The repository also has a standalone program for QSAR prognosis which calculates individual atoms contributions, thus interpreting the model’s result.
TRANSFORMER EN VIVO CODE
The source code and the embeddings needed to train a QSAR model are available on. We discuss the reasons for such effectiveness and draft future directions for the development of the method. That both the augmentation and transfer learning are based on embeddings allows the method to provide good results for small datasets. The proposed Transformer-CNN method uses SMILES augmentation for training and inference, and thus the prognosis is based on an internal consensus. Using a CharNN architecture upon the embeddings results in higher quality interpretable QSAR/QSPR models on diverse benchmark datasets including regression and classification tasks. We present SMILES-embeddings derived from the internal encoder state of a Transformer model trained to canonize SMILES as a Seq2Seq problem.