MACHINE LEARNING SYSTEMS FOR AUTOMATED PHARMACEUTICAL MOLECULE IDENTIFICATION

Aspects of the present disclosure provide systems, methods, and computer-readable storage media that leverage artificial intelligence and machine learning to identify molecules or compounds for use in pharmaceuticals. In aspects, one or more machine learning (ML) models may be trained to identify molecules based on pharmaceutical data that indicates properties of previously-identified pharmaceutical molecules, such as physiochemical structure, side effects, toxicity, solubility, and the like. The ML models may include generative models, such as generative adversarial networks or variational autoencoders. The trained ML models may be used to identify new (e.g., previously-unidentified) molecules, or the trained ML models may be provided to client devices for use in molecule identification (e.g., drug discovery).

Skip to: Description  ·  Claims  · Patent History  ·  Patent History
Description
CROSS-REFERENCE TO RELATED APPLICATIONS

The present application claims priority to Indian Provisional Patent Application No. 202041041670, filed on Sep. 25, 2020, entitled “DRUG DISCOVERY AND SEARCH USING MACHINE LEARNING,” and the present application is related to co-pending U.S. patent application Ser. No. ______ (Atty. Dkt. No. ACNT.P0028US), entitled “MACHINE LEARNING SYSTEMS FOR AUTOMATED PHARMACEUTICAL MOLECULE SCREENING AND SCORING,” filed Jan. 21, 2021, the contents of each of which are expressly incorporated herein in their entirety.

TECHNICAL FIELD

The present disclosure relates to systems and methods for leveraging machine learning and artificial intelligence to automatically identify molecules and compounds for use in pharmaceutical products.

BACKGROUND

Pharmaceuticals are one of the largest and most profitable industries in the world, as illustrated by the worldwide pharmaceutical market being worth approximately 1.3 trillion dollars in 2019 according to some estimates. In addition to researching and manufacturing new drugs (e.g., pharmaceuticals) to cure or treat new diseases or conditions, pharmaceutical companies spend significant resources researching and “discovering” (e.g., identifying) new drugs for known diseases that have increased efficacy, fewer side effects, and fewer harmful drug interactions. For example, a pharmaceutical company may try to optimize and improve an already manufactured drug for a specific disease with the goal of improving efficacy and reducing side effects.

Designing (e.g., discovering or identifying) new drugs is typically a manually-intensive process. To design a new drug for a particular disease, a human drug expert (e.g., a chemist, biochemist, researcher, etc.) may consider a known molecule or compound used in a currently-available drug for treating the particular disease, and the human drug expert may identify multiple candidate molecules based on the known molecule. For example, the human drug expert may decide to add an additional element to, remove an element from, or modify the physiochemical structure of, the known molecule or compound based on their experience and knowledge to design candidate molecules. The candidate molecules may be visually screened by the human drug expert, and a selected subset of candidate molecules that pass the visual screening may be further screened using lab experiments or other testing. Thus, the drug design (also referred to as drug discovery or drug identification) process is limited by the knowledge and experience of the human drug expert. Additionally, the human drug expert may focus their attention on the particular disease to be treated, which may result in the human drug expert failing to explore or consider molecules or compounds that are not widely known as useful in treating the particular disease, which may limit the search space for the candidate molecules.

Conventional drug discovery may be a long and expensive process. For example, each new drug, from discovery to launch, typically takes approximately twelve to fifteen years and cost approximately 1.2 billion dollars. Additionally, the drug discovery process includes many different steps such as discovery, optimization, preclinical trials, phased clinical trials, registration, and eventual launch. During many or all of these steps, a significant portion of the candidate molecules or compounds are filtered out or otherwise rejected. For example, by some estimations, only approximately 1.8% of newly identified molecules or compounds are successfully tested and implemented into pharmaceuticals released to consumers. Thus, the typical drug discovery process is neither efficient nor cost-effective.

SUMMARY

Aspects of the present disclosure provide systems, methods, and computer-readable storage media for automated identification of molecules or compounds using machine learning for use in pharmaceutical products such as drugs, medicine, remedies, and the like. The molecules may be identified by a drug discovery platform with minimal user input as compared to other drug discovery systems. To facilitate automated identification of “new” molecules (e.g., previously unidentified molecules), the drug discovery platform may train and leverage artificial intelligence and machine learning based on pharmaceutical data acquired from a variety of sources, such as publically available drug information databases, third party drug information databases, proprietary databases, and the like. The pharmaceutical data may include multiple different forms or formats of drug-related data for a large quantity of previously-identified drugs (e.g., previously identified molecules or compounds that make up the drugs). For example, the pharmaceutical data may include physiochemical data that indicates physiochemical properties of the previously-identified molecules, such as the elements included in the molecules, the physiochemical structure of the molecules, the molecular weight of the molecules, the isomerization of the molecules, etc. The pharmaceutical data may also, or in the alternative, include other types of data, such as drug impact data, side effect data, toxicity data, and solubility data, as non-limiting examples. The pharmaceutical data may be processed and transformed into a form that may be used as training data. For example, if the pharmaceutical data includes simplified molecular-input line-entry system (SMILES)-formatted data that represents molecular structure as a string of letters and characters, natural language processing may be performed on the SMILES-formatted data to convert the strings to numerical data for vectorization into training data. Such training data may be used to train the artificial intelligence or machine learning to automatically identify molecules that are distinct from the previously-identified molecules associated with the pharmaceutical data.

In aspects, a computing device (e.g., a server or other device that implements a drug discovery platform) may acquire pharmaceutical data from one or more databases, such as the publically available Zinc database (“Zinc15” or “Zinc12”), chEMBL database, PubChem database, and SIDER database (“SIDER Side Effect Resource”), as non-limiting examples. The pharmaceutical data may indicate properties (e.g., physiochemical properties, impact on a human body, side effects, toxicity, solubility, etc.) associated with multiple previously-identified pharmaceutical molecules. The computing device may convert at least a portion of the pharmaceutical data to training data. For example, the computing device may perform natural language processing on text data or SMILES-formatted data to convert the text data or SMILES-formatted data into numerical data. As another example, the computing device may convert categorical values to numerical data, such as binary data or encoded numerical data (e.g., using a one-hot encoding, as a non-limiting example). The converted numerical data may be vectorized or otherwise grouped to generate the training data. In some implementations, the computing device may perform pre-processing, such as filtering, outlier removal, filling in missing entries, dimensionality reduction, or the like on the pharmaceutical data prior to converting the pharmaceutical data to the training data.

After generating the training data, the computing system may train one or more machine learning models based on the training data. Such training may configure the machine learning models to identify new (e.g., previously-unidentified) pharmaceutical molecules. The machine learning models may include regenerative models that generate new values (e.g., molecules) based on underlying similarities between values indicated by the training data (e.g., the previously-identified molecules). In some implementations, the machine learning models include generative adversarial networks (GANs), variational autoencoders (VAEs), or both, which may be implemented using neural networks or other deep learning structures. In some implementations, the machine learning models are trained to identify molecules having one or more particular properties, such as a particular atomic weight, a particular molecular weight, a particular expected side effect, a particular solubility, or the like. After training the machine learning models, the computing device may use the machine learning models to identify one or more molecules for testing and potential trial. For example, the computing device may initiate display of a graphical user interface (GUI) that includes text, images, graphics, or a combination thereof, that indicate the identified molecules, such as molecule names, names of elements that make up the molecules, two-dimensional graphical representations of the molecules, SMILES representations of the molecules, predicted properties of the molecules, and the like. Additionally or alternatively, the computing device may operate as a training device that trains the machine learning models and provides the machine learning models (or data indicative of the configuration of the machine learning models) to client devices for molecule identification at the client devices.

The present disclosure describes systems that provide improvements compared to other drug discovery systems. For example, the present disclosure describes systems that train machine learning models to automatically identify molecules that have not been previously identified. Using artificial intelligence and machine learning to identify pharmaceutical molecules based on pharmaceutical data associated with large quantities of previously-identified molecules, some of which are not related to the same type of drug, may result in identification of a wider variety of new (e.g., previously-unidentified) pharmaceutical molecules. At least some of these molecules would not be identified by a human drug expert (e.g., a chemist or biochemist) manually designing new molecules. To illustrate, the machine learning models are trained to identify molecules based on underlying similarities between multiple drugs, and many of these underlying similarities may not be apparent to the human drug expert. Thus, the identified molecules may be more similar to successful drugs (even if the drugs are not used to treat the same disease or condition), and therefore are more likely to be useful in producing new drugs than molecules that are manually identified by the human drug expert. Additionally, automated identification of the molecules may be faster than molecule identification by other systems that require substantial user interaction and decision making by the human drug expert. By increasing the likelihood of identifying useful molecules in a shorter period of time, the systems and methods described herein may substantially reduce the costs and shorten the development cycle associated with discovering and launching new drugs (e.g., pharmaceuticals).

In a particular aspect, a method for pharmaceutical molecule identification using machine learning includes obtaining, by one or more processors, pharmaceutical data indicating properties of previously-identified pharmaceutical molecules from one or more databases. The pharmaceutical data includes molecular physiochemical data, drug impact data, side effect data, toxicity data, solubility data, or a combination thereof. The method also includes performing, by the one or more processors, natural language processing (NLP) on at least a portion of the pharmaceutical data to convert the at least a portion of the pharmaceutical data to training data. The training data includes vectorized representations of the properties of the previously-identified pharmaceutical molecules. The method further includes training, by the one or more processors, one or more machine learning (ML) models based on the training data to configure the one or more ML models to identify additional pharmaceutical molecules. The additional pharmaceutical molecules are distinct from the previously-identified pharmaceutical molecules.

In another particular aspect, a system for pharmaceutical molecule identification using machine learning includes a memory and one or more processors communicatively coupled to the memory. The one or more processors are configured to obtain pharmaceutical data indicating properties of previously-identified pharmaceutical molecules from one or more databases. The pharmaceutical data includes molecular physiochemical data, drug impact data, side effect data, toxicity data, solubility data, or a combination thereof. The one or more processors are also configured to perform NLP on at least a portion of the pharmaceutical data to convert the at least a portion of the pharmaceutical data to training data. The training data includes vectorized representations of the properties of the previously-identified pharmaceutical molecules. The one or more processors are further configured to train one or more ML models based on the training data to configure the one or more ML models to identify additional pharmaceutical molecules. The additional pharmaceutical molecules are distinct from the previously-identified pharmaceutical molecules.

In another particular aspect, a non-transitory computer-readable storage medium stores instructions that, when executed by one or more processors, cause the one or more processors to perform operations for pharmaceutical molecule identification using machine learning. The operations include obtaining pharmaceutical data indicating properties of previously-identified pharmaceutical molecules from one or more databases. The pharmaceutical data includes molecular physiochemical data, drug impact data, side effect data, toxicity data, solubility data, or a combination thereof. The operations also include performing NLP on at least a portion of the pharmaceutical data to convert the at least a portion of the pharmaceutical data to training data. The training data includes vectorized representations of the properties of the previously-identified pharmaceutical molecules. The operations further include training one or more ML models based on the training data to configure the one or more ML models to identify additional pharmaceutical molecules. The additional pharmaceutical molecules are distinct from the previously-identified pharmaceutical molecules.

In the context of the present disclosure the terms “molecule” and “compound” can be used interchangeably. Non-limiting examples of molecules and compounds can include small molecules and biologics. In one non-limiting aspect, small molecules can be chemically derived such as by being manufactured through chemical synthesis or isolated from another material having the small molecule. In one non-limiting aspect, biologics can include a material or substance extracted from, synthesized by, or manufactured from living organisms (e.g., microorganisms, plants, animals, cells, etc.). Non-limiting examples of biologics can include sugars, polymers, peptides, proteins, enzymes, or nucleic acids or combinations thereof.

The foregoing has outlined rather broadly the features and technical advantages of the present disclosure in order that the detailed description that follows may be better understood. Additional features and advantages will be described hereinafter which form the subject of the claims of the disclosure. It should be appreciated by those skilled in the art that the conception and specific aspects disclosed may be readily utilized as a basis for modifying or designing other structures for carrying out the same purposes of the present disclosure. It should also be realized by those skilled in the art that such equivalent constructions do not depart from the scope of the disclosure as set forth in the appended claims. The novel features which are disclosed herein, both as to organization and method of operation, together with further objects and advantages will be better understood from the following description when considered in connection with the accompanying figures. It is to be expressly understood, however, that each of the figures is provided for the purpose of illustration and description only and is not intended as a definition of the limits of the present disclosure.

BRIEF DESCRIPTION OF THE DRAWINGS

For a more complete understanding of the present invention, reference is now made to the following descriptions taken in conjunction with the accompanying drawing, in which:

FIG. 1 is a block diagram of an example of a system for pharmaceutical molecule identification using machine learning according to one or more aspects;

FIG. 2 is a block diagram of another example of a system for pharmaceutical molecule identification using machine learning according to one or more aspects;

FIG. 3 is a flow diagram illustrating an example of a method for identifying pharmaceutical molecules and for identifying uses for pharmaceutical molecules according to one or more aspects; and

FIG. 4 is a flow diagram illustrating an example of a method for pharmaceutical molecule identification using machine learning according to one or more aspects.

It should be understood that the drawings are not necessarily to scale and that the disclosed aspects are sometimes illustrated diagrammatically and in partial views. In certain instances, details which are not necessary for an understanding of the disclosed methods and apparatuses or which render other details difficult to perceive may have been omitted. It should be understood, of course, that this disclosure is not limited to the particular aspects illustrated herein.

DETAILED DESCRIPTION

Aspects of the present disclosure provide systems, methods, and computer-readable storage media for automated identification of molecules or compounds using machine learning for use in pharmaceutical products such as drugs, medicine, remedies, cosmetics, and the like. The techniques described herein support identification, using artificial intelligence and machine learning techniques, of molecules that may have not previously been identified, tested, and/or studied for a given pharmaceutical application (e.g., identification of new molecules for existing or new uses, disease states, or conditions and/or identification of existing molecules for new uses, disease states, or conditions). The artificial intelligence and machine learning techniques described herein may be trained using a variety of pharmaceutical data associated with previously identified drugs, such as physiochemical data (e.g., data indicating the elements and structures of previously-identified molecules), drug impact data, side effect data, toxicity data, solubility data, other drug-related information, and the like, as non-limiting examples. The pharmaceutical data used for training may be obtained from a variety of sources, such as publicly available drug information databases such as the Zinc database, third-party databases (e.g., drug vendor or manufacturer databases, university databases, government agency databases, and the like), proprietary databases, or a combination thereof. Natural language processing may be performed on text data or particularly-formatted data, such as simplified molecular-input line-entry system (SMILES)-formatted data, to generate training data for training generative machine learning model(s) to identify pharmaceutical molecules. Using artificial intelligence and machine learning to identify pharmaceutical molecules based on pharmaceutical data associated with large quantities of previously-identified molecules, some of which are not related to the same type of drug, may result in identification of pharmaceutical molecules that would not be identified by a human (e.g., a chemist or biochemist) using existing drug discovery processes. To illustrate, because the artificial intelligence and machine learning are able to determine underlying similarities between more drugs, many of which may not be apparent to a human, the identified molecules may be more similar to successful drugs, and thus more likely to be useful in producing new drugs, than molecules that are manually identified by a human. Additionally, automated identification of the molecules may be faster than molecule identification by other systems that require substantial user interaction and decision making. By increasing the likelihood of identifying useful molecules in a shorter period of time, the systems and methods described herein may substantially reduce the costs and shorten the development cycle associated with discovering and launching new drugs (e.g., pharmaceuticals). Although described in the context of pharmaceutical products (e.g., drugs), the techniques of the present disclosure may be applied to identify molecules for use in other types of products, such as health products and supplements, personal hygiene products, cosmetic products, biotech products, chemical products, and the like.

Referring to FIG. 1, an example of a system for pharmaceutical molecule identification (e.g., drug discovery) using machine learning according to one or more aspects is shown as a system 100. The system 100 may be configured to train machine learning model(s) to identify “new” pharmaceutical molecules or compounds (e.g., previously-unidentified molecules or compounds for use in drugs or other pharmaceutical products) using information associated with previously-identified pharmaceutical molecules or compounds. In some implementations, the system 100 may use the trained machine learning model(s) to identify one or more pharmaceutical molecules for production and testing. Additionally or alternatively, the trained machine learning model(s) may be provided to other devices, such as client device(s), for use in pharmaceutical molecule identification. As shown in FIG. 1, the system 100 includes a computing device 102, a display device 130, one or more databases 150, a client device 162, a drug production system 164, and one or more networks 170. In some implementations, one or more of the display device 130, the client device 162, or the drug production system 164 may be optional, or the system 100 may include additional components, such as a user device, as a non-limiting example.

The computing device 102 (e.g., a pharmaceutical molecule identification device or a drug identification device) may include or correspond to a desktop computing device, a laptop computing device, a personal computing device, a tablet computing device, a mobile device (e.g., a smart phone, a tablet, a personal digital assistant (PDA), a wearable device, and the like), a server, a virtual reality (VR) device, an augmented reality (AR) device, an extended reality (XR) device, a vehicle (or a component thereof), an entertainment system, other computing devices, or a combination thereof, as non-limiting examples. The computing device 102 includes one or more processors 104, a memory 106, one or more communication interfaces 120, a data processing and transformation engine 122 a training engine 124, one or more machine learning (ML) models 126, and an identification engine 128. It is noted that functionalities described with reference to the computing device 102 are provided for purposes of illustration, rather than by way of limitation and that the exemplary functionalities described herein may be provided via other types of computing resource deployments. For example, in some implementations, computing resources and functionality described in connection with the computing device 102 may be provided in a distributed system using multiple servers or other computing devices, or in a cloud-based system using computing resources and functionality provided by a cloud-based environment that is accessible over a network, such as the one of the one or more networks 170. To illustrate, one or more operations described herein with reference to the computing device 102 may be performed by one or more servers or a cloud-based system that communicates with one or more client or user devices.

The one or more processors 104 may include one or more microcontrollers, application specific integrated circuits (ASICs), field programmable gate arrays (FPGAs), central processing units (CPUs) having one or more processing cores, or other circuitry and logic configured to facilitate the operations of the computing device 102 in accordance with aspects of the present disclosure. The memory 106 may include random access memory (RAM) devices, read only memory (ROM) devices, erasable programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), one or more hard disk drives (HDDs), one or more solid state drives (SSDs), flash memory devices, network accessible storage (NAS) devices, or other memory devices configured to store data in a persistent or non-persistent state. Software configured to facilitate operations and functionality of the computing device 102 may be stored in the memory 106 as instructions 108 that, when executed by the one or more processors 104, cause the one or more processors 104 to perform the operations described herein with respect to the computing device 102, as described in more detail below. Additionally, the memory 106 may be configured to store data, such as training data 110, one or more identified molecules 112, selected properties 114, and additional training data 116. Exemplary aspects of the training data 110, the identified molecules 112, the selected properties 114, and the additional training data 116 are described in more detail below.

The one or more communication interfaces 120 may be configured to communicatively couple the computing device 102 to the one or more networks 170 via wired or wireless communication links established according to one or more communication protocols or standards (e.g., an Ethernet protocol, a transmission control protocol/internet protocol (TCP/IP), an Institute of Electrical and Electronics Engineers (IEEE) 802.11 protocol, an IEEE 802.16 protocol, a 3rd Generation (3G) communication standard, a 4th Generation (4G)/long term evolution (LTE) communication standard, a 5th Generation (5G) communication standard, and the like). In some implementations, the computing device 102 includes one or more input/output (I/O) devices that include one or more display devices, a keyboard, a stylus, one or more touchscreens, a mouse, a trackpad, a microphone, a camera, one or more speakers, haptic feedback devices, or other types of devices that enable a user to receive information from or provide information to the computing device 102. In some implementations, the computing device 102 is coupled to the display device 130, such as a monitor, a display (e.g., a liquid crystal display (LCD) or the like), a touch screen, a projector, a virtual reality (VR) display, an augmented reality (AR) display, an extended reality (XR) display, or the like. Although shown as external to the computing device 102 in FIG. 1, in some other implementations, the display device 130 is included in or integrated in the computing device 102.

The data processing and transformation engine 122 may be configured to obtain pharmaceutical data 132 from the databases 150 and to process, filter, or otherwise transform the pharmaceutical data 132 for use by other components of the computing device 102. For example, the pharmaceutical data 132 may indicate properties of previously-identified pharmaceutical molecules, and the data processing and transformation engine 122 may process and otherwise convert the pharmaceutical data 132 (or a portion thereof) to a common format that may be used for analysis and training data generation. To illustrate, the data processing and transformation engine 122 may be configured to perform one or more pre-processing operations, one or more formatting operations, one or more conversion operations, one or more filtering operations, or a combination thereof, on the pharmaceutical data 132 to convert the pharmaceutical data 132 to a target format, to reduce a size or complexity of the pharmaceutical data 132, to eliminate particular values that do not provide sufficient information, to add in missing values, or a combination thereof.

The training engine 124 may be configured to generate the training data 110 based on the processed pharmaceutical data 132. For example, the training engine 124 may extract a particular set of features from the pharmaceutical data 132 and group the extracted features, such as in one or more vectors, to generate the training data 110. In some implementations, the particular set of features are determined based on feature analysis of the pharmaceutical data 132 and are predetermined for all types of molecule identification, or the particular set of features may be based on a type of molecule to be identified, a particular disease or condition for which molecules are to be identified, particular properties associated with identified molecules, user input, or a combination thereof. To extract the features, the training engine 124 may be configured to extract numerical features from numerical data, to extract categorical features from text or numerical data and convert the categorical features to numerical features, to perform natural language processing (NLP) on text data to convert text features into numerical features, or a combination thereof. In some implementations, the training engine 124 may be configured to scale or otherwise transform extracted features to a format that is useable to train ML models. After extracting the features, the training engine 124 may group or otherwise format the extracted features, such as performing vectorization on the extracted features, to generate the training data 110.

After generating the training data 110, the training engine 124 may be configured to train the one or more ML models 126 that are accessible to the training engine 124 (e.g., via storage at the memory 106 or other storage devices) based on the training data 110. The one or more ML models 126 may be trained to identify “new” molecules (e.g., previously-unidentified molecules) based on properties of previously-identified pharmaceutical molecules indicated by the training data 110. As used herein, new or previously-unidentified pharmaceutical molecules encompass small molecules and/or biologics that may have a therapeutic effect such that the molecule may be used as an ingredient in a drug or other medicinal product (e.g., pharmaceuticals). In some non-limiting aspects, the molecules may include multiple atoms of the same element or compounds (e.g., molecules made of atoms from different elements). Such previously-unidentified molecules may include different combinations of elements than the previously-identified molecules, different structures of known combinations of elements, or different combinations of elements and different structures than the previously-identified molecules. Additionally or alternatively, the previously-unidentified molecules may include molecules that have not previously been identified as having a pharmaceutical effect. In some implementations, the one or more ML models 126 may be trained to identify molecules having particular properties (e.g., physiochemical structures, toxicity, solubility, etc.) based on the training data 110, such as by weighting or labeling training data based on relationships to the particular properties.

In some implementations, the one or more ML models 126 (referred to herein as the ML models 126) may include a single ML model or multiple ML models configured to identify molecules. In some implementations, the ML models 126 may include or correspond to generative ML models. For example, the ML models 126 may include generative adversarial networks (GANs), such as multi-objective GANs, objective reinforced GANs, conditional deep GANs, and the like, variational autoencoders (VAEs), such as standard VAEs, multi-objective VAEs, and the like, or a combination thereof. Generative modeling is an unsupervised learning task that involves automatically discovering and learning patterns or relationships in input data in such a way that a model can be used to generate or output new examples that plausibly could have been drawn from the input data set. GANs can be used to frame the problem as a supervised learning problem with two sub-models: a generator model that is trained to generate new examples, and a discriminator model that is trained to classify examples as either real (e.g., from the input data set) or fake (e.g., from the generator model). The two models, typically convolutional neural networks, are trained together in a zero-sum game, until the discriminator is fooled by the generator a particular percentage of the time. VAEs may be configured to learn efficient data codings in an unsupervised manner, such as by encoding higher-dimensionality input data as probability distributions of latent variables, and decoding the probability distributions of the latent variables to create slightly different versions of the input data. In some implementations, the ML models 126 (e.g., the GANs, the VAEs, or both) may be implemented as neural networks. In other implementations, the ML models 126 may be implemented as other types of ML models or constructs, such as decision trees, random forests, regression models, Bayesian networks (BNs), dynamic Bayesian networks (DBNs), naive Bayes (NB) models, Gaussian processes, hidden Markov models (HMMs), and the like.

The identification engine 128 may be configured to identify the identified molecules 112 that are distinct from the previously-identified molecules. For example, the identification engine 128 may provide input data to the ML models 126 to cause the ML models 126 to identify pharmaceutical molecules that may have an underlying similarity to molecules indicated by the input data, and therefore may be more likely than random molecules to have a pharmaceutical effect or to exhibit particular properties. To generate the input data, the identification engine may sample previously-identified molecules used as ingredients in drugs to remedy a particular disease or condition, previously-identified molecules that exhibit particular properties, a random sampling of previously-identified molecules, or a sampling based on other parameters.

The databases 150 may include one or more databases, or other storage devices, configured to maintain and provide access to stored pharmaceutical data. The databases 150 may include publically available drug information databases (e.g., databases maintained by information or standards organizations or government agencies such as the Food and Drug Administration (FDA), the Center for Disease Control (CDC), and the like), third-party drug information databases (e.g., databases maintained by pharmaceutical vendors or researchers, universities, and the like), proprietary databases (e.g., databases maintained by an entity that operates the computing device 102), other databases, or a combination thereof. Particular, non-limiting examples of publically available or accessible databases include the ZINC database (“Zinc15” or “Zinc12”), the chEMBL database, the PubChem database, the SIDER database (“SIDER Side Effect Resource”), the Binding Database (“BindingDB”), the DrugBank database (“DrugBank Online”), and the like.

The databases 150 are configured to store the pharmaceutical data 132 that indicates properties, such as physiochemical properties, efficacy, human interactions, side effects, and the like, associated with multiple previously-identified pharmaceutical molecules. In some implementations, the databases 150 are configured to store (e.g., the pharmaceutical data 132 includes) physiochemical data 152, drug impact data 154, side effect data 156, toxicity data 158, solubility data 160, other pharmaceutical data, or a combination thereof. The physiochemical data 152 may indicate physiochemical structures of the previously-identified molecules, such as the elements, and shape or structure of the elements, that form the previously-identified molecules. In some implementations, at least a portion of the physiochemical data 152 may be formatted in accordance with the simplified molecular-input line-entry system (SMILES). SMILES is a line notation for describing the structure of chemical species using short ASCII strings that include letters and numbers indicating elements (and their respective quantity) and other symbols (e.g., −=# $ : / \) that represent different types of bonds between the elements. As an example, a molecule of carbon dioxide may be represented as O═C═O in the SMILES notation. The SMILES notation is designed such that a molecule represented by a SMILES notation can be easily converted to a two-dimensional or three-dimensional model of the respective molecule.

The drug impact data 154 may indicate impacts and effects of the previously-identified molecules on the human body (e.g., on a recipient of the drug), such as changes experienced with respect to symptoms of a disease or condition, effects on functioning of the body, effects on organs or other body parts, and the like. The side effect data 156 may indicate side effects associated with the previously-identified molecules on the human body, such as effects on functioning of the body or parts of the body that are unrelated to treatment of the disease or condition. The toxicity data 158 may indicate measurements of the toxic effects of the previously-identified drugs on the human body, such as the LD50 (e.g., the median lethal dose), as a non-limiting example. The solubility data 160 may indicate measurements of the solubility of the previously-identified drugs in solvents, such as water and organic solvents (methanol, ethanol, propanol, acetone, ethyl acetate, hexane, heptane, dichloromethane, tetrahydrofuran, acetonitrile, dimethylformanide, toluene, dimethysulfoxide, etc.), or a combination thereof. The above-described types of pharmaceutical data are not intended to be limiting, and in other implementations, other types of pharmaceutical data may be stored by the databases 150, such as affinity data, selectivity data, efficacy data, metabolic stability data, oral bioavailability data, and the like.

The client device 162 may include or correspond to a computer device used by a client of the entity that operates the computing device 102 to perform molecule identification (e.g., drug discovery). For example, the client device 162 may be operated by a pharmaceutical company, a university, a research institution, or the like, that is engaged in drug discovery. The client device 162 may include or correspond to a computing device, such as a desktop computer or a laptop computer, a server, a mobile device (e.g., a smart phone, a tablet computer, a wearable device, a personal digital assistant (PDA), or the like), an audio/visual device, an entertainment device, a control device, a vehicle (or a component thereof), a VR device, an AR device, an XR device, or the like. The client device 162 may be configured to receive the trained ML models 126 (or configuration data associated with the trained ML models 126) for use in the drug discovery process.

The drug production system 164 may include one or more automated or semi-automated equipment or devices configured to perform operations of drug formulation. For example, the drug production system 164 may include or correspond to agitators, blowers, boilers, centrifuges, chillers, cooling towers, dryers, homogenizers, mixers, ovens, and the like. Components of the drug production system 164 may include processors, memories, interfaces, motors, sensors, and the like that are configured to enable fully or semi-automated performance of one or more operations, in addition to communication with other components of the drug production system 164 or other devices. In some implementations, the drug production system 164 may be configured to receive instructions from the computing device 102 for initiating one or more operations.

During operation of the system 100, the computing device 102 may obtain the pharmaceutical data 132 from the databases 150. For example, the computing device 102 may query the databases 150 and receive the pharmaceutical data 132 (or a portion thereof). As another example, the computing device 102 may manually pull the pharmaceutical data 132 (or a portion thereof) from the databases 150 using one or more pull commands. As another example, the computing device 102 may extract the pharmaceutical data 132 (or a portion thereof) from websites or other publically accessible information displays that are supported by the databases 150, such as using a crawler or other data mining techniques. As described above, the pharmaceutical data 132 may indicate properties of multiple previously-identified pharmaceutical molecules. In some implementations, the pharmaceutical data 132 may include the physiochemical data 152, the drug impact data 154, the side effect data 156, the toxicity data 158, the solubility data 160, other types of pharmaceutical data, or a combination thereof.

The data processing and transformation engine 122 may process and transform the pharmaceutical data 132, such as transforming different types of data included in the pharmaceutical data 132 to a common format or type. In some implementations, the data processing and transformation engine 122 may perform pre-processing on the pharmaceutical data 132. Performing the pre-processing may reduce complexity of feature extraction to be performed on the pharmaceutical data 132, reduce the memory footprint associated with the pharmaceutical data 132, clean up the pharmaceutical data 132, format the pharmaceutical data 132, or a combination thereof. For example, the pre-processing may include performing statistical analysis on the pharmaceutical data 132 to remove or modify an outlier from the pharmaceutical data 132, removing an entry from the pharmaceutical data 132 that is associated with a variance that fails to satisfy a variance threshold, formatting the pharmaceutical data 132, approximating a missing entry of the pharmaceutical data 132 (e.g., using interpolation or other statistical modeling techniques), other pre-processing operations, or a combination thereof. Additionally or alternatively, the data processing and transformation engine 122 may perform dimensionality reduction on the pharmaceutical data 132 (or extracted features) to reduce a memory footprint associated with the pharmaceutical data 132 and to reduce processing complexity of the feature extraction performed by the training engine 124. The dimensionality reduction may project the pharmaceutical data 132 onto a lower-dimension feature space, such as by primary component analysis, singular value decomposition, or the like.

The training engine 124 may generate the training data 110 based on the processed pharmaceutical data 132 from the data processing and transformation engine 122. Generating the training data 110 may include extracting a predetermined set of features from the pharmaceutical data 132, which may include performing one or more operations to convert the pharmaceutical data 132 to a different type of data from which features that are acceptable to the ML models 126 may be extracted. In some implementations, the training engine 124 may extract numerical features from the pharmaceutical data 132. For example, the numerical features may include toxicity measurements, solubility measurements, atomic weights, or the like. The training engine 124 may scale or otherwise transform the extracted numerical features, such as performing a normalization transformation, a standardization transformation, a power transformation, a quantile transformation, or a combination thereof, on the extracted numerical features. Additionally or alternatively, the training engine 124 may extract numerical features from non-numerical features in the pharmaceutical data 132. As an example, the training engine 124 may convert categorical features or binary features to integer values, such as ‘1’ or ‘0’ for ‘yes’ and ‘no,’ respectively, or create integer values from multiple different categories, such as using a one-hot encoding. As another example, the training engine 124 may perform NLP on text data of the pharmaceutical data 132 to convert the text data into numerical features. The NLP may include tokenization, removing stop words, stemming, lemmatization, bag of words processing, other NLP, or a combination thereof. In some implementations, at least a portion of the pharmaceutical data 132, such as the physiochemical data 152, may be SMILES-formatted text data. For example, physiochemical structures of the previously-identified molecules may be represented by strings of characters according to the SMILES notation, such as O═C=O for carbon dioxide. In such implementations, the training engine 124 may perform NLP on the SMILES-formatted strings to convert the SMILES-formatted strings to numerical features, such as numbers of various elements, numbers of various types of bonds, correspondence between the bonds and the elements, etc. As other example, the training engine 124 may perform NLP on text data included in the pharmaceutical data 132 to extract numerical features corresponding to other textual information, such as drug impact or efficacy information associated with the previously-identified molecules, side effects associated with the previously-identified molecules, and the like. After extracting the features, the training engine 124 may vectorize or otherwise group the extracted features to a format that may be processed by the ML models 126 to generate the training data 110.

After generating the training data 110, the training engine 124 may train the ML models 126 to identify the identified molecules 112 based on the training data 110. In some implementations, training the ML models 126 may include segmenting the training data 110 into a training set and a test set. The training engine 124 may provide the training set to the ML models 126 to train the ML models 126 to identify molecules (e.g., the identified molecules 112) based on underlying similarities between the previously-identified molecules that are derived from the training set. In addition to, or as part of the training, the training engine 124 may adjust one or more parameters or hyper-parameters associated with the ML models. In some implementations, the training engine 124 may train the ML models 126 to identify the identified molecules 112 that have (or are predicted or likely to have) particular properties. To illustrate, the computing device 102 may obtain the selected properties 114 and generate the training data 110 and train the ML models 126 such that the identified molecules 112 have (or are predicted to have) the selected properties 114. In some implementations, the computing device 102 may receive the selected properties 114 from an I/O device or a user device that receives user input indicating the selected properties 114. Additionally or alternatively, the computing device 102 may determine the selected properties 114 based on a target diseases or condition for which molecules are to be identified. As an example, the selected properties 114 may include particular physiochemical structures (e.g., particular elements, particular types of bonds, or the like), a particular solubility, lack of a particular side effect, or the like. To train the ML models 126 to identify molecules having the selected properties 114, the training engine 124 may assign greater weighting values to portions of the training data 110 that are associated with previously-identified molecules that have the selected properties 114 than to portions of the training data 110 that are associated with previously-identified molecules that do not have the selected properties 114, as a non-limiting example.

In some implementations, after the training engine 124 trains the ML models 126, the identification engine 128 may access the ML models 126 to identify one or more previously-unidentified molecules (e.g., the identified molecules 112). The computing device 102 may generate an output 134 that indicates the identified molecules 112. The output 134 may be displayed to a user, provided to another device, or used to initiate performance of one or more operations. As an example, the computing device 102 may provide the output 134 to the display device 130 to cause the display device 130 to display a graphical user interface (GUI). The GUI may include text indicating the identified molecules 112 (e.g., names of the identified molecules 112, SMILES strings indicating the physiochemical structure of the identified molecules 112, and the like), visual representations of the identified molecules (e.g., 2D or 3D representations of the molecular structure), other text or multimedia content representing the identified molecules 112, or a combination thereof. Additionally or alternatively, the GUI may include text, graphical, or multimedia content that indicates properties of the identified molecules 112, such as a list of side effects, solubility measurements, toxicity measurements, likely impacted organs, and the like, and/or comparisons of the properties of the identified molecules 112 to properties of previously-identified molecules, such as graphs, charts, or the like. As another example, the computing device 102 may provide the output 134 to another device, such as the client device 162 or a user device. As another example, the computing device 102 may provide the output 134 to the drug production system 164 to initiate performance of one or more operations at the drug production system 164. To illustrate, the output 134 may include or correspond to one or more instructions that cause the drug production system 164 to perform one or more operations to facilitate formation of the identified molecules 112. For example, the one or more instructions may initiate mixing of chemicals in a mixer, activating a heater or a cooler to change a state of a chemical, retrieving of one or more samples from a vault or other storage location, or the like.

Additionally or alternatively, the computing device 102 may provide the trained ML models 126 to the client device 162. For example, after training the ML models 126, the computing device 102 may generate configuration information that indicates the parameters, the hyper-parameters, and any other configuration of the trained ML models 126, and the computing device 102 may provide the configuration information to the client device 162 to enable the client device 162 to implement the trained ML models 126 at the client device 162 for identifying molecules as part of drug discovery performed at the client device 162. In some implementations, the computing device 102 may be configured to train the ML models 126 but not to perform molecule identification, instead leaving the molecule identification to be performed by the client device 162. In such implementations, the computing device 102 does not include the identification engine 128.

In some implementations, the training engine 124 may further train the ML models 126 based on results associated with the identified molecules 112. To illustrate, the training engine 124 may receive testing data associated with tests of the identified molecules 112, and the training engine 124 may generate the additional training data 116 based on the testing data and the identified molecules 112 using the techniques described above for the training data 110. For example, the testing data may indicate properties of the identified molecules 112, such as solubility or toxicity of the identified molecules 112, as well as observations from clinical testing of drugs formed from the identified molecules, such as observed success (or failure) in treating a particular disease or condition, effects on the patients, side effects experienced by the patients, and the like. In this manner, the ML models 126 may be dynamically updated to improve the utility of molecules identified by the ML models 126 based on new information.

As described above, the system 100 supports training of the ML models 126 to automatically identify the identified molecules 112 (e.g., pharmaceutical molecules that have not been previously identified). Using artificial intelligence and machine learning to identify the identified molecules 112 based on the pharmaceutical data 132, which may be associated with large quantities of previously-identified pharmaceutical molecules used as ingredients in related and unrelated drugs, may result in identification of a wider variety of new pharmaceutical molecules (e.g., previously-unidentified pharmaceutical molecules). At least some of these molecules would not be identified by a human drug expert (e.g., a chemist or biochemist) using existing drug discovery processes. To illustrate, the ML models 126 may be trained to identify the identified molecules 112 based on underlying similarities between multiple previously-identified molecules, and many of these underlying similarities may not be apparent to the human drug expert. Thus, the identified molecules 112 may be more similar to successful drugs (even if the drugs are not used to treat the same disease or condition), and therefore are more likely to be useful in producing new drugs than molecules that are manually identified by the human drug expert. The ML models 126 may also be trained based on testing results associated with the identified molecules 112 to improve the quality of the molecule identification performed by the ML models 126. Additionally, automated identification of the identified molecules 112 by the ML models 126 may be faster than molecule identification by other systems that require substantial user interaction and decision making by the human drug expert. By increasing the likelihood of identifying useful molecules in a shorter period of time, the system 100 may substantially reduce the costs and shorten the development cycle associated with discovering and launching new drugs.

Referring to FIG. 2, another example of a system for pharmaceutical molecule identification using machine learning according to one or more aspects is shown as a system 200. In some implementations, the system 200 may include or correspond to the system 100 of FIG. 1. As shown in FIG. 2, the system 200 (also referred to as a drug discovery platform) includes data sources 202, a data import layer 210, a data storage layer 220, a data transformation layer 230, an artificial intelligence/machine learning (AI/ML) engine 240, an access layer 250, an application programming interface (API) management layer 260, other device 270, and a message orchestration and logging layer 280.

The data sources 202 include multiple data sources, such as databases, for accessing pharmaceutical data for use in training ML models to identify pharmaceutical molecules. In the particular implementation illustrated in FIG. 2, the data sources 202 may include a drug bank 204, the ZINC database 206, a binding database 208, and the chEMBL database 209. In other implementations, the data sources 202 may include other data sources, such as other publically available databases, third party databases, proprietary databases, or a combination thereof, as further described with reference to FIG. 1. The drug bank 204 may include a database of drugs, such as those released by an operator or client of the system 200, or a third party. The drug bank 204 may store information associated with the drugs, such as physiochemical structures of molecules used as ingredients, efficacy data, side effects associated with the drugs, and the like. The ZINC database 206 is a publically available database that maintains pharmaceutical data for multiple pharmaceutical molecules. For example, the ZINC database 206 may store SMILES-formatted ligand structures, molecular weights, partition coefficients (Log P values), druglikeness metrics (QED values), molecular ring structure data, hydrogen bond (H-bond) donor and acceptor data, target class data, and the like. The binding database 208 may store data that indicates binding information for pharmaceutical molecules to various proteins, which is useful in screening the newly identified molecules for their success in treating different diseases or conditions, as further described herein with reference to FIG. 3. For example, the binding database 208 may store ligand names, SMILES-formatted ligand structures, target names, half maximal inhibitory concentration (IC50) values, and the like. The chEMBL database 209 is a publically available database that maintains pharmaceutical data for multiple pharmaceutical molecules, similar to the ZINC database 206.

The data import layer 210 may be configured to import (e.g., obtain) pharmaceutical data from the data sources 202 for use as training data. The data import layer 210 may be configured to request and receive the pharmaceutical data from the data sources 202, to extract the pharmaceutical data from information supported by the data sources 202, to pull the pharmaceutical data from the data sources 202, or a combination thereof. For example, the data import layer 210 may include Python scripts 212, a crawler 214, and manual pull logic 216. The Python scripts 212 may be executable scripts in Python (or another scripting language) that, when executed by the data import layer 210, cause the data import layer 210 to request and/or query the data sources 202 for various pharmaceutical data. In some implementations, the Python scripts 212 may be configured to interact with one or more application programming interfaces (APIs) of the data sources 202 to receive the pharmaceutical data. The crawler 214 may include or correspond to a web crawler, or other data mining application, that is configured to extract pharmaceutical data from websites (or other sources) that are supported by the data sources 202. The manual pull logic 216 may be configured to perform one or more pull operations with respect to the data sources 202 to retrieve pharmaceutical data.

The data storage layer 220 may be configured to store the imported (e.g., obtained) pharmaceutical data from the data sources 202. For example, the data storage layer 220 may store the pharmaceutical data as one or more datasets, such as a first dataset 222, a second dataset 224, and a third dataset 226, as shown in FIG. 2. In other implementations, the pharmaceutical data may be stored as fewer than three datasets or more than three datasets. The datasets 222-226 may correspond to different types of data (e.g., physiochemical data, side effects data, toxicity data, etc.), different types of drugs or targeted diseases, different properties (e.g., particular molecular structures, particular solubilities, etc.), or may be segregated in other manners. In some implementations, the datasets 222-226 may be stored at one or more cloud storage locations for further analysis and retained in different source folders for downstream component analysis.

The data transformation layer 230 may be configured to pre-process and transform the stored pharmaceutical data (e.g., the datasets 222-226) into a format that can be used as training data to ML models. The data transformation layer 230 may include a first data flow 232, custom Python scripts 234, and a second data flow 236. In other implementations, the data transformation layer 230 may include a single data flow or more than two data flows, different types of scripts for processing and transforming data, or a combination thereof. The first data flow 232 and the second data flow 236 may correspond to particular datasets, such as the first dataset 222 and the second dataset 224, respectively. The custom Python scripts 234 may be configured to perform pre-processing operations, transformation operations, feature extraction operations, training data generation operations, or a combination thereof. For example, the custom Python scripts 234 may be configured to perform statistical analysis on the pharmaceutical data to remove or modify an outlier from the pharmaceutical data, remove an entry from the pharmaceutical data that is associated with a variance that fails to satisfy a variance threshold, format the pharmaceutical data, approximate a missing entry of the pharmaceutical data (e.g., using interpolation or other statistical modeling techniques), perform other pre-processing operations, or a combination thereof. Additionally or alternatively, the custom Python scripts 234 may be configured to perform dimensionality reduction on the pharmaceutical data to reduce a memory footprint associated with the pharmaceutical data and to reduce processing complexity of the feature extraction. The dimensionality reduction may project the pharmaceutical data onto a lower-dimension feature space, such as by primary component analysis, singular value decomposition, or the like. The custom Python scripts 234 may be configured to extract numerical features from the processed pharmaceutical data, or perform operations to convert text data to numerical features. For example, the custom Python scripts 234 may perform NLP on text data to convert the text data into numerical features. The NLP may include tokenization, removing stop words, stemming, lemmatization, bag of words processing, other NLP, or a combination thereof. After extracting the features, the custom Python scripts 234 may vectorize or otherwise group the extracted features to a format that may be processed by ML models to generate training data.

The AI/ML engine 240 may be configured to train and support one or more ML models to identify new (e.g., previously-unidentified) pharmaceutical models. In some implementations, the AI/ML engine 240 may support one or more multi-objective GANs 242, one or more objective-reinforced GANs 244, one or more conditional deep GANs 246, one or more VAEs 248, and one or more multi-objective VAEs 249. In other implementations, the AI/ML engine 240 may support fewer ML models, more ML models, or different ML models. In some implementations, the one or more multi-objective GANs 242, the one or more objective-reinforced GANs 244, the one or more conditional deep GANs 246, the one or more VAEs 248, the one or more multi-objective VAEs 249, or a combination thereof, may be implemented using neural networks (e.g., convolutional neural networks, deep neural networks, neural networks with hidden layers, and the like). In some other implementations, the one or more multi-objective GANs 242, the one or more objective-reinforced GANs 244, the one or more conditional deep GANs 246, the one or more VAEs 248, the one or more multi-objective VAEs 249, or a combination thereof, may be implemented using other types of ML models or structures, such as decision trees, random forests, regression models, BNs, DBNs, NB models, Gaussian processes, HMMs, and the like.

The AI/ML engine 240 may be configured to receive training data from the data transformation layer 230 and to provide the training data to the ML models 242-249 to train the ML models 242-249 to identify pharmaceutical molecules, as described with reference to FIG. 1. In some implementations, training the ML models 242-249 may include keeping aside a portion of the received data as test data to test performance of the trained ML models (e.g., to identify whether additional training should be performed, or to identify which of multiple ML models performs the best). Additionally or alternatively, the AI/ML engine 240 may be configured to train the ML models 242-249 to identify pharmaceutical molecules having (or predicted to have) one or more selected properties. For example, one or more properties, such as a particular physiochemical structure, a particular molecular weight, a particular toxicity, an particular side effect (or lack thereof), or that like, may be selected and the training data may be generated to enable training for identification of pharmaceutical molecules having or predicted to have) the properties, such as by weighting portions of the training data that correspond to the properties or using other techniques. The selected properties may be indicated by user input, determined based on a particular starting molecule, a particular target disease, or other parameters.

In some implementations, different ML models of the ML models 242-249 may be trained differently (e.g., using different training data) than others of the ML models 242-249. For example, the multi-objective GANs 242 may be trained using multiple discriminators each associated with its own loss function using multiple objective optimization techniques, such as multiple gradient descent (MGD), hypervolume maximization (HVM), or the like. Similar training may be performed for the multi-objective VAEs 249. As another example, the objective-reinforced GANs 244 may be trained using reinforcement learning (RL) to bias the objective-reinforced GANs 244 to achieving particular metrics (e.g., objectives). As another example, the conditional deep GANs 246 may be trained to identify particular labels of molecules, such as molecules having selected properties.

In some implementations, one or more of the ML models 242-249 may be configured to operate as a language generation model, where the language is a description of molecules and/or molecular properties. For example, molecules may be described using strings according to SMILES notation, and the ML models 242-249 may be configured to perform SMILES to property (e.g., molecular properties) prediction, SMILES to latent space mapping, latent space to SMILES mapping, latent space to property prediction, or a combination thereof. In some implementations, some or all of these operations may be implemented by performing multi-objective, semi-supervised learning using ML models such as VAEs, GANs, or the like. To illustrate, the ML models may perform generative processes that include generating an input variable x from a generative distribution Pθ(x|y, z), which is parameterized by θ conditioned on an output variable and a latent variable z. y may be treated as an additional latent variable when x is not labeled, which may require introducing the distribution over y. The prior distributions over y and z may be assumed to be p(y)=N(y|μy,Σy) and P(z)=N(Z|0, 1). A variational inference may be used to address the intractability of the exact posterior inference of the model, such that the posterior distributions over y and z may be approximated by qϕ(y|x)=N(y|uϕ(x)), diag (σ2ϕ(x)) and qϕ(z|x, y)=N(z|uϕ(x, y)), diag (σ2ϕ(x,y)), both of which may be parameterized with φ. For the semi-supervised learning scenario where some values of y are missing, the missing values may be predicted by qϕ(y|x).

In some implementations, objective functions for training the ML models may be based on the following functions. The conditional loss function, log p(x, y), is given by Equation 1 below.

Equation 1 - Example Conditional Loss Function log p ( x , y ) 𝔼 q ( z | x , y ) [ log p θ ( x | y , z ) + log p ( y ) + log p ( z ) - log q ( z | x , y ) ] = 𝔼 q ( z | x , y ) [ log p θ ( x | y , z ) + log p ( y ) - 𝒟 K L ( q ( z x , y ) p ( z ) ) ] = - £ ( x , y )

The unconditioned loss function, log p(x), is given by Equation 2 below.

Equation 2 - Example Unconditioned Loss Function log p ( x ) 𝔼 q ( y , z | x ) [ log p θ ( x | y , z ) + log p ( y ) + log p ( z ) - log q ( y , z | x ) ] = 𝔼 q ( y , z | x ) [ log p θ ( x | y , z ) ] - 𝒟 K L ( q ( y x ) p ( y ) ) - 𝔼 q ( y | x ) [ 𝒟 K L ( q ( z x , y ) p ( z ) ) ] = - μ ( x )

The final cost function, τ, is given by Equation 3 below.

τ = ( x , y ) p 1 _ £ ( x , y ) + ( x ) p μ _ μ ( x ) + β * ( x , y ) p 1 _ y - 𝔼 q ( y | x ) [ y ] 2

Equation 3—Example Final Cost Function

The property prediction model, ŷ, is given by Equation 4 below.


ŷ˜(μ(x), diag (σ2(x)))

Equation 4—Example Property Prediction Model

The molecule generation (e.g., identification) functions are given by Equations 5 and 6 below.


{circumflex over (x)}=argxmax log pθ(x|y,z)

Equation 5—Example Molecule Generation Function


pθ(x|y,z)=Πjpθ(x(j)|x(1), . . . ,x(j−1),y,z)

Equation 6—Example Molecule Generation Function

In some implementations, to identify new molecules, one or more of the ML models 242-249 may be configured to perform a beam search or a beam stack search (e.g., an improved beam search). A beam search is a greedy approach for generating new molecules where initialization of the algorithm is using a random array of a particular size. In some implementations, the array includes float values that are normally distributed using Gaussian distribution. The beam search may be a heuristic approach that retains only the most promising β nodes (instead of all nodes) at each iteration of the search for further branching. β is referred to as the Beam Width. The beam search may be an optimization of a best-first search that uses reduced memory requirements. In some implementations, the beam search may be performed according to the following pseudocode.

OPEN = {initial state} while OPEN is not empty do {  Remove the best node from OPEN, call it n  If n is the goal state, back trace path to n (through recorded parents) and  return path  Create n's successors  Evaluate each successor, add it to OPEN, and record its parent  If |OPEN| > β, take the best β nodes (according to heuristic) and   remove the others from OPEN } done

The access layer 250 may be configured to support one or more APIs for enabling interaction between the AI/ML engine 240 (or other components of the system 200) and the other devices 270 and/or user devices. The access layer 250 may include one or more generated model APIs 252 and one or more other APIs 254. The generated model APIs 252 may enable interaction between the ML models maintained by the AI/ML engine 240, such as the multi-objective GANs 242, the objective-reinforced GANs 244, the conditional deep GANs 246, the VAEs 248, the multi-objective VAEs 249, or a combination thereof, with the other devices 270. The other APIs 254 may enable interaction between other components of the system 200 and external devices, such as user devices. The API management layer 260 may be configured to manage operation of the APIs supported by the access layer 250 (e.g., the generated model APIs 252 and the other APIs 254).

The other devices 270 may include devices that interact with the system 200 (e.g., the drug discovery platform), such as client devices, servers, and the like. For example, the other devices 270 may include a front end client 272, a front end server 274, and a back end server 276. The front end client 272 may be configured to enable client interaction with the ML models maintained by the AI/ML engine 240 to enable pharmaceutical molecule identification at the front end client 272. In some other implementations, the AI/ML engine 240 may train the ML models and provide configuration information associated with the ML models to the front end client 272 such that the front end client 272 stores and operates the ML models to perform pharmaceutical molecule identification. The front end server 274 and the back end server 276 may store data used to support the AI/ML engine 240 (or other components of the system 200), such as training data, processed pharmaceutical data, results data, input data, and the like.

The message orchestration and logging layer 280 may be configured to generate and transmit messages, such as to user devices, and to log the messages. For example, the message orchestration and logging layer 280 may be configure to transmit messages and/or to initiate display of GUIs that enable user interaction with the molecule identification process, such as providing user input indicating target diseases, target starting molecules, selected properties to be associated with identified molecules, and the like, or viewing information regarding the identified molecules, such as names, molecular structures, expected properties, results data, or comparisons of the identified molecules to previously-identified molecules associated with the data sources 202. In some implementations, the message orchestration and logging layer 280 may provide a single point of access for users of the system 200.

As described above, the system 200 supports training of ML models (e.g., the multi-objective GANs 242, the objective-reinforced GANs 244, the conditional deep GANs 246, the VAEs 248, the multi-objective VAEs 249, or a combination thereof) to automatically identify pharmaceutical molecules. Using artificial intelligence and machine learning to identify the pharmaceutical molecules based on the pharmaceutical data from the data sources 202 may result in identification of a wider variety of new (e.g., previously-unidentified) pharmaceutical molecules and more likely to be successful pharmaceutical molecules (e.g., pharmaceutical molecules that have a higher likelihood of treating target diseases or conditions), than other drug discovery systems that rely substantially on user input and knowledge of a human drug expert.

Referring to FIG. 3, a flow diagram of an example of a method for identifying pharmaceutical molecules and for identifying uses for pharmaceutical molecules according to one or more aspects is shown as a method 300. In some implementations, the operations of the method 300 may be stored as instructions that, when executed by one or more processors (e.g., the one or more processors of a computing device or a server), cause the one or more processors to perform the operations of the method 300. In some implementations, the method 300 may be performed by one or more components of a system configured to perform pharmaceutical molecule identification (e.g., drug discovery), such as one or more components of the system 100 of FIG. 1, one or more components of the system 200 of FIG. 2, one or more components of a system configured to identify uses for pharmaceutical molecules (e.g., to screen and rank pharmaceutical molecules), or a combination thereof.

The method 300 includes collecting and selecting molecule and drug data, at 302. For example, the system may obtain pharmaceutical data from one or more databases or data sources, as described with reference to FIGS. 1 and 2. Additionally, the system may obtain binding data from one or more binding databases, the Delaney dataset (e.g., a standard regression dataset containing structures and water solubility data for multiple compounds), other types of drug-protein relation data, or a combination thereof. The binding data may indicate the likelihood of previously-identified molecules binding to one or more proteins, which may indicate which diseases or other conditions are treatable by the previously-identified molecules.

The method 300 includes training one or more generative ML models, at 304. For example, the system may train one or more generative ML models, such as VAEs or GANs, to identify “new” (e.g., previously-unidentified) pharmaceutical molecules, as described with reference to the ML models 126 of FIG. 1 and the ML models 242-249 of FIG. 2. The method 300 includes identifying one or more previously-unidentified molecules, at 306. For example, the system may access the trained generative ML models to identify pharmaceutical molecules that are not previously-identified based on the obtained pharmaceutical data. In some implementations, identifying the new pharmaceutical molecules may include conditional identification of molecules, at 308. For example, the generative ML models may be trained to identify particular types of molecules, such as molecules having (or expected to have) selected properties or molecules that are to be used to cure or treat particular diseases or conditions, as non-limiting examples. Additionally or alternatively, identifying the new pharmaceutical molecules may include unconditional identification of molecules, at 310. For example, the generative ML molecules may be trained to identify new pharmaceutical molecules without any constraints, instead based only on the underlying similarities between the previously-identified molecules that are derived from the training data.

The method 300 predicting a cluster to which one or more pharmaceutical molecules are assigned, at 312. Each of the clusters may correspond to one or more proteins to which molecules assigned to the cluster are likely (or have been successfully observed) to bind to. Binding molecules to particular proteins may indicate which disease or conditions the molecules may be used to treat. To illustrate, the system may train one or more ML models to perform unsupervised learning to cluster previously-identified molecules into clusters corresponding to binding proteins based on training data generated based on the obtained pharmaceutical data, particular the data obtained from the binding databases. Input data indicating one or more molecules (e.g., feature vectors generated from strings that combine various structural or other properties of the molecules) may be provided to the trained ML models to predict the cluster assignment using sparse subspace clustering (SSC). Clustering molecules in this manner may be referred to as limiting the search space for proteins/diseases associated with the molecules, which may be desirable due to the large chemical search space, which may be on the order of 10{circumflex over ( )}60. In some implementations, the clustering may include density-based spatial clustering of applications with noise (DBSCAN), K-means clustering, K-means for large-scale clustering (LSC-K), longest common subsequence (LCS) clustering, longest common cyclic subsequence (LCCS) clustering, or the like, in order to cluster large volumes of high dimensional data. In some implementations, newly-identified molecules from the generative ML models may be used as input data to the ML models that perform the clustering to predict the clusters assigned to the newly-identified molecules. Additionally or alternatively, previously-identified molecules may be used as input data to the ML models that perform the clustering to predict other possible proteins that could be bound to the previously-identified molecules, thereby predicting other diseases that the previously-identified molecules could be used to treat. Thus, the cluster prediction may identify potential diseases to be treated by newly-identified molecules as well as additional diseases that may be treated by already-released drugs.

The method 300 includes generating cluster data, at 314. The cluster data may indicate the members of each cluster, closest molecules to the cluster for target identification, proteins associated with the clusters, or a combination thereof. Additionally or alternatively, the system may determine scores for each molecule in a cluster to which a particular molecule is assigned, the scores may be used to filter the cluster into a subset of higher-scored molecules, and the cluster data may indicate the scores, the subset, or the combination thereof. To illustrate, each molecule assigned to the cluster may be scored using one or more scoring metrics, and the scores for a respective molecule may be averaged to generate an average score (or other aggregated score) for each molecule. The average scores may be compared to one or more thresholds to identify a subset of higher-scored candidate molecules, to identify one or more particular proteins to which the subset is most likely to bind to, or a combination thereof. In some implementations, the scoring may be performed based on Tanimoto indices or coefficients, cosine similarity values, laboratory control sample (LCS) data, Library for the Enumeration of Modular Natural Structures (LEMONS) data, or the like.

The method 300 includes storing the cluster data in a database, at 316. The cluster data may include data representing members of the clusters, proteins associated with the clusters, scores associated with members of the clusters, other cluster data, or the combination thereof.

The method 300 includes performing conjoint analysis on the subset of molecules, at 318. The conjoint analysis may indicate which properties or characteristics of pharmaceutical molecules are most sought after by one or more clients, such as pharmaceutical companies, universities, private research firms, and the like. To illustrate, the conjoint analysis may include providing users with multiple questions that prompt the user to choose between potential molecules having combinations of different properties (as opposed to simply prompting the user to choose desired properties), and analyzing user input to the questions to calculate preference scores for the properties. Although described as being based on user input, in some other implementations, the conjoint analysis may be performed based on extracted or other data mined information, such as from company press releases indicating new drugs or areas of research, market valuations of particular drugs or potential drugs for curing particular diseases, other information, or the like. In some implementations, one or more ML models may be trained to predict preference scores for input candidate molecules based on training data derived from user responses or other historical information associated with previously-identified molecules or released drugs.

The method 300 includes ranking the subset of molecules based on the conjoint analysis, at 320. For example, based on scores determined during the conjoint analysis, the subset of molecules may be ranked and, optionally, further filtered based on one or more thresholds. The method 300 concludes by output recommendations for one or more molecules for us in drug testing and production, at 322. Due to the clustering and ranking, the recommended molecules may be more likely to result in useful or marketable drugs, and therefore more likely to result in shorter testing/development cycles and increased revenue to the clients.

Referring to FIG. 4, a flow diagram of an example of a method for pharmaceutical molecule identification using machine learning according to one or more aspects is shown as a method 400. In some implementations, the operations of the method 400 may be stored as instructions that, when executed by one or more processors (e.g., the one or more processors of a computing device or a server), cause the one or more processors to perform the operations of the method 400. In some implementations, the method 400 may be performed by a computing device, such as the computing device 102 of FIG. 1 (e.g., a computing device configured for pharmaceutical molecule identification or drug discovery), one or more components of the system 200 of FIG. 2, or a combination thereof.

The method 400 includes obtaining pharmaceutical data indicating properties of previously-identified pharmaceutical molecules from one or more databases, at 402. The pharmaceutical data includes molecular physiochemical data, drug impact data, side effect data, toxicity data, solubility data, or a combination thereof. For example, the pharmaceutical data may include or correspond to the pharmaceutical data 132 of FIG. 1, which may include the physiochemical data 152, the drug impact data 154, the side effect data 156, the toxicity data 158, the solubility data 160, or a combination thereof.

The method 400 also includes performing NLP on at least a portion of the pharmaceutical data to convert the at least a portion of the pharmaceutical data to training data, at 404. The training data includes vectorized representations of the properties of the previously-identified pharmaceutical molecules. For example, the data processing and transformation engine 122 of FIG. 1 may perform NLP on a portion of the pharmaceutical data 132 to generate the training data 110. The method 400 further includes training, by the one or more processors, one or more ML models based on the training data to configure the one or more ML models to identify additional pharmaceutical molecules, at 406. The additional pharmaceutical molecules are distinct from the previously-identified pharmaceutical molecules. For example, the one or more ML models and the additional pharmaceutical molecules may include or correspond to the ML models 126 and the identified molecules 112, respectively, of FIG. 1.

In some implementations, the method 400 may also include generating an output that indicates one or more molecules identified by the one or more ML models. For example, the output may include or correspond to the output 134 of FIG. 1. In some such implementations, the method 400 may also include initiating, based on the output, display of a GUI that indicates the one or more molecules. For example, the output 134 may be provided to the display device 130 of FIG. 1 to cause display of a GUI that indicates the identified molecules 112. Additionally or alternatively, the method 400 may also include providing the one or more ML models to a client device for pharmaceutical model identification by the client device. For example, configuration information (e.g., parameters, hyper-parameters, and the like) associated with the ML models 126 may be provided to the client device 162 of FIG. 1 to enable pharmaceutical molecule identification at the client device 162. Additionally or alternatively, generating the output may include transmitting an instruction to an automated or semi-automated system to cause the automated or semi-automated system to initiate development of samples of the one or more molecules. For example, the output 134 of FIG. 1 may include one or more instructions that are provided to the drug production system 164 to cause performance of one or more operations by the drug production system 164 to develop samples of the identified molecules 112.

In some implementations, the method 400 may further include generating additional training data based on one or more molecules identified by the one or more ML models, testing data associated with the one or more molecules, or a combination thereof, and training the one or more ML models based on the additional training data. For example, the additional training data may include or correspond to the additional training data 116 of FIG. 1. Additionally or alternatively, the pharmaceutical data may include SMILES-formatted data, and the NLP may be performed on the SMILES-formatted data to generate the training data. For example, the data processing and transformation engine 122 or the training engine 124 may perform NLP on at least a portion of the pharmaceutical data 132 of FIG. 1 to generate the training data 110. Additionally or alternatively, a first subset of the pharmaceutical data may be associated with previously-identified pharmaceutical molecules having one or more particular properties, a second subset of the pharmaceutical data may be associated with previously-identified pharmaceutical molecules that do not have the one or more particular properties, and the one or more ML models may be trained to identify the additional pharmaceutical molecules having the one or more particular properties. For example, a first portion of the pharmaceutical data 132 may be associated with previously-identified molecules that have the selected properties 114, a second portion of the pharmaceutical data 132 may be associated with previously-identified molecules that do not have the selected properties 114, and the ML models 126 may be trained to conditionally identify the identified molecules 112 such that the identified molecules 112 have (or are predicted to have) the selected properties 114.

In some implementations, the one or more databases may include the ZINC database, the chEMBL database, the PubChem database, or a combination thereof. For example, the databases 150 may include one or more publically available molecular information databases, such as the ZINC database 206, the chEMBL database 209, the PubChem database, or a combination thereof. Additionally or alternatively, the method 400 may also include performing pre-processing on the pharmaceutical data prior to performing the NLP, performing dimensionality reduction on the pharmaceutical data prior to performing the NLP, or a combination thereof. For example, the data processing and transformation engine 122 of FIG. 1 may perform pre-processing, such as formatting, outlier removal, missing entry replacement, dimensionality reduction, other pre-processing, or a combination thereof.

In some implementations, the one or more ML models may include one or more GANs, one or more VAEs, or a combination thereof. For example, the ML models 126 may include or correspond to GANs, VAEs, or both, such as the multi-objective GANs 242, the objective-reinforced GANs 244, the conditional deep GANs 246, the VAEs 248, and the multi-objective VAEs 249 described with reference to FIG. 2. Additionally or alternatively, obtaining the pharmaceutical data may include receiving a portion of the pharmaceutical data from the one or more databases, pulling a portion of the pharmaceutical data from the one or more databases, extracting a portion of the pharmaceutical data from information presented by the one or more databases, or a combination thereof. For example, the computing device 102 may obtain the pharmaceutical data 132 by querying and receiving at least a portion of the pharmaceutical data 132 (similar to the Python scripts 212 of FIG. 2), extracting at least a portion of the pharmaceutical data 132 from websites or other documents supported by the databases 150 (similar to the crawler 214 of FIG. 2), performing one or more pull operations to retrieve at least a portion of the pharmaceutical data 132 (similar to the manual pull logic 216 of FIG. 2), or a combination thereof.

In some implementations, the one or more databases may include one or more publically-available databases, one or more proprietary databases, one or more third-party databases, or a combination thereof. For example, the ML models 126 of FIG. 1 may include publically available databases (e.g., the ZINC database, the chEMBL database, the PubChem database, and the like), proprietary databases (e.g., pharmaceutical information databases maintained and operated by an operator of the computing device 102 or the client device 162), third-party databases (e.g., databases maintained and operated by other drug companies, universities, government agencies, and the like), or a combination thereof. Additionally or alternatively, the one or more ML models may include a GAN, a VAE, and a multi-objective VAE. For example, the AI/ML engine 240 of FIG. 2 may train and support one or more GANs (e.g., the multi-objective GANs 242, the objective-reinforced GANs 244, the conditional deep GANs 246, or a combination thereof), the VAEs 248, and the multi-objective VAEs 249.

In some implementations, the method 400 may further include initiating display of a GUI that indicates one or more molecules identified by the one or more ML models, providing the one or more ML models to a client device for pharmaceutical model identification by the client device, or a combination thereof. For example, the output 134 of FIG. 1 may be provided to the display device 130 to cause display of a GUI that indicates the identified molecules 112 at the display device 130, or configuration data associated with the trained ML models 126 may be provided to the client device 162 to enable configuration and use of ML models at the client device 162 for performing molecule identification (e.g., drug discovery). Additionally or alternatively, the method 400 may further include receiving a user input indicating one or more particular properties and training the one or more ML models to identify the additional pharmaceutical molecules having the one or more particular properties. For example, the computing device 102 of FIG. 1 may receive a user input (e.g., from a user device) indicating the selected properties 114 for use in training the ML models 126 such that the identified molecules 112 have (or are predicted to have) the selected properties 114.

It is noted that other types of devices and functionality may be provided according to aspects of the present disclosure and discussion of specific devices and functionality herein have been provided for purposes of illustration, rather than by way of limitation. It is noted that the operations of the method 300 of FIG. 3 and the method 400 of FIG. 4 may be performed in any order, or that operations of one method may be performed during performance of another method, such as the method 400 of FIG. 4 including one or more operations of the method 300 of FIG. 3. It is also noted that the method 300 of FIG. 3 and the method 400 of FIG. 4 may also include other functionality or operations consistent with the description of the operations of the system 100 of FIG. 1 and/or the system 200 of FIG. 2.

Those of skill in the art would understand that information and signals may be represented using any of a variety of different technologies and techniques. For example, data, instructions, commands, information, signals, bits, symbols, and chips that may be referenced throughout the above description may be represented by voltages, currents, electromagnetic waves, magnetic fields or particles, optical fields or particles, or any combination thereof.

The functional blocks and modules described herein (e.g., the functional blocks and modules in FIGS. 1-4) may comprise processors, electronics devices, hardware devices, electronics components, logical circuits, memories, software codes, firmware codes, etc., or any combination thereof. In addition, features discussed herein relating to FIGS. 1-4 may be implemented via specialized processor circuitry, via executable instructions, and/or combinations thereof.

As used herein, various terminology is for the purpose of describing particular implementations only and is not intended to be limiting of implementations. For example, as used herein, an ordinal term (e.g., “first,” “second,” “third,” etc.) used to modify an element, such as a structure, a component, an operation, etc., does not by itself indicate any priority or order of the element with respect to another element, but rather merely distinguishes the element from another element having a same name (but for use of the ordinal term). The term “coupled” is defined as connected, although not necessarily directly, and not necessarily mechanically; two items that are “coupled” may be unitary with each other. The terms “a” and “an” are defined as one or more unless this disclosure explicitly requires otherwise. The term “substantially” is defined as largely but not necessarily wholly what is specified—and includes what is specified; e.g., substantially 90 degrees includes 90 degrees and substantially parallel includes parallel—as understood by a person of ordinary skill in the art. In any disclosed aspect, the term “substantially” may be substituted with “within [a percentage] of” what is specified, where the percentage includes 0.1, 1, 5, and 10 percent; and the term “approximately” may be substituted with “within 10 percent of” what is specified. The phrase “and/or” means and or. To illustrate, A, B, and/or C includes: A alone, B alone, C alone, a combination of A and B, a combination of A and C, a combination of B and C, or a combination of A, B, and C. In other words, “and/or” operates as an inclusive or. Additionally, the phrase “A, B, C, or a combination thereof” or “A, B, C, or any combination thereof” includes: A alone, B alone, C alone, a combination of A and B, a combination of A and C, a combination of B and C, or a combination of A, B, and C.

The terms “comprise” and any form thereof such as “comprises” and “comprising,” “have” and any form thereof such as “has” and “having,” and “include” and any form thereof such as “includes” and “including” are open-ended linking verbs. As a result, an apparatus that “comprises,” “has,” or “includes” one or more elements possesses those one or more elements, but is not limited to possessing only those elements. Likewise, a method that “comprises,” “has,” or “includes” one or more steps possesses those one or more steps, but is not limited to possessing only those one or more steps.

Any implementation of any of the apparatuses, systems, and methods can consist of or consist essentially of—rather than comprise/include/have—any of the described steps, elements, and/or features. Thus, in any of the claims, the term “consisting of” or “consisting essentially of” can be substituted for any of the open-ended linking verbs recited above, in order to change the scope of a given claim from what it would otherwise be using the open-ended linking verb. Additionally, it will be understood that the term “wherein” may be used interchangeably with “where.”

Further, a device or system that is configured in a certain way is configured in at least that way, but it can also be configured in other ways than those specifically described. Aspects of one example may be applied to other examples, even though not described or illustrated, unless expressly prohibited by this disclosure or the nature of a particular example.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps (e.g., the logical blocks in FIGS. 1-4) described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure. Skilled artisans will also readily recognize that the order or combination of components, methods, or interactions that are described herein are merely examples and that the components, methods, or interactions of the various aspects of the present disclosure may be combined or performed in ways other than those illustrated and described herein.

The various illustrative logical blocks, modules, and circuits described in connection with the disclosure herein may be implemented or performed with a general-purpose processor, a digital signal processor (DSP), an ASIC), a field programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general-purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.

The steps of a method or algorithm described in connection with the disclosure herein may be implemented directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CDROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such that the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. In the alternative, the processor and the storage medium may reside as discrete components in a user terminal.

In one or more exemplary designs, the functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. Computer-readable storage media may be any available media that can be accessed by a general purpose or special purpose computer. By way of example, and not limitation, such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code means in the form of instructions or data structures and that can be accessed by a general-purpose or special-purpose computer, or a general-purpose or special-purpose processor. Also, a connection may be properly termed a computer-readable medium. For example, if the software is transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, or digital subscriber line (DSL), then the coaxial cable, fiber optic cable, twisted pair, or DSL, are included in the definition of medium. Disk and disc, as used herein, includes compact disc (CD), laser disc, optical disc, digital versatile disc (DVD), hard disk, solid state disk, and blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.

The above specification and examples provide a complete description of the structure and use of illustrative implementations. Although certain examples have been described above with a certain degree of particularity, or with reference to one or more individual examples, those skilled in the art could make numerous alterations to the disclosed implementations without departing from the scope of this disclosure. As such, the various illustrative implementations of the methods and systems are not intended to be limited to the particular forms disclosed. Rather, they include all modifications and alternatives falling within the scope of the claims, and examples other than the one shown may include some or all of the features of the depicted example. For example, elements may be omitted or combined as a unitary structure, and/or connections may be substituted. Further, where appropriate, aspects of any of the examples described above may be combined with aspects of any of the other examples described to form further examples having comparable or different properties and/or functions, and addressing the same or different problems. Similarly, it will be understood that the benefits and advantages described above may relate to one aspect or may relate to several implementations.

The claims are not intended to include, and should not be interpreted to include, means plus- or step-plus-function limitations, unless such a limitation is explicitly recited in a given claim using the phrase(s) “means for” or “step for,” respectively.

Although the aspects of the present disclosure and their advantages have been described in detail, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit of the disclosure as defined by the appended claims. Moreover, the scope of the present application is not intended to be limited to the particular implementations of the process, machine, manufacture, composition of matter, means, methods and steps described in the specification. As one of ordinary skill in the art will readily appreciate from the present disclosure, processes, machines, manufacture, compositions of matter, means, methods, or steps, presently existing or later to be developed that perform substantially the same function or achieve substantially the same result as the corresponding aspects described herein may be utilized according to the present disclosure. Accordingly, the appended claims are intended to include within their scope such processes, machines, manufacture, compositions of matter, means, methods, or steps.

Claims

1. A method for pharmaceutical molecule identification using machine learning, the method comprising:

obtaining, by one or more processors, pharmaceutical data indicating properties of previously-discovered pharmaceutical molecules from one or more databases, wherein the pharmaceutical data includes molecular physiochemical data and one or more of: drug impact data, side effect data, toxicity data, solubility data, or a combination thereof;
performing, by the one or more processors, natural language processing (NLP) on at least a portion of the pharmaceutical data to convert the at least a portion of the pharmaceutical data to training data, wherein the training data comprises vectorized representations of the properties of the previously-discovered pharmaceutical molecules;
training, by the one or more processors, one or more machine learning (ML) models based on the training data to configure the one or more ML models to identify additional pharmaceutical molecules, wherein the additional pharmaceutical molecules are distinct from the previously-discovered pharmaceutical molecules; and
generating, by the one or more processors models; and further training, by the one or more processors, the one or more ML models based an output that indicates one or more molecules identified by the one or more ML models, wherein at least a portion of the output comprises simplified molecular-input line-entry system (SMILES) representations of the one or more molecules identified by the one or more ML models.

2. The method of claim 1, further comprising:

generating, by the one or more processors and after training the one or more ML models based on the training data, additional training data based on testing data that indicates properties of at least one of the additional pharmaceutical molecules identified by the one or more ML models; and
further training, by the one or more processors, the one or more ML models based on the additional training data.

3. The method of claim 1, further comprising initiating, by the one or more processors and based on the output, display of a graphical user interface (GUI) that indicates the one or more molecules.

4. The method of claim 1, further comprising providing, by the one or more processors, the one or more ML models to a client device for pharmaceutical model identification by the client device.

5. The method of claim 1, wherein:

generating the output comprises transmitting an instruction to an automated or semi-automated system; and
the instruction is executable by the automated or semi-automated system to cause formation of samples of the one or more molecules.

6. (canceled)

7. The method of claim 1, wherein:

the pharmaceutical data comprises simplified molecular-input line-entry system (SMILES)-formatted data, and
the NLP is performed on the SMILES-formatted data to generate the training data.

8. The method of claim 1, wherein:

a first subset of the pharmaceutical data is associated with previously-discovered pharmaceutical molecules having one or more particular properties,
a second subset of the pharmaceutical data is associated with previously-discovered pharmaceutical molecules that do not have the one or more particular properties, and
the one or more ML models are trained to identify the additional pharmaceutical molecules having the one or more particular properties.

9. The method of claim 1, wherein the one or more molecules identified by the one or more ML models comprise different combinations of elements than the previously-discovered pharmaceutical molecules, different molecular structures than the previously-discovered pharmaceutical molecules, or different combinations of elements and different molecular structures than the previously-discovered pharmaceutical molecules.

10. The method of claim 1, further comprising:

performing, by the one or more processors, pre-processing on the pharmaceutical data prior to performing the NLP;
performing, by the one or more processors, dimensionality reduction on the pharmaceutical data prior to performing the NLP; or
a combination thereof.

11. The method of claim 1, wherein the one or more ML models comprise one or more generational adversarial networks (GANs), one or more variational autoencoders (VAEs), or a combination thereof.

12. The method of claim 1, wherein:

obtaining the pharmaceutical data comprises receiving a portion of the pharmaceutical data from the one or more databases, pulling a portion of the pharmaceutical data from the one or more databases, extracting a portion of the pharmaceutical data from information presented by the one or more databases, or a combination thereof; and
the one or more databases comprise the ZINC database, the chEMBL database, the PubChem database, or a combination thereof.

13. A system for pharmaceutical molecule identification using machine learning, the system comprising:

a memory; and
one or more processors communicatively coupled to the memory, the one or more processors configured to: obtain pharmaceutical data indicating properties of previously-discovered pharmaceutical molecules from one or more databases, wherein the pharmaceutical data includes molecular physiochemical data and one or more of: drug impact data, side effect data, toxicity data, solubility data, or a combination thereof; perform natural language processing (NLP) on at least a portion of the pharmaceutical data to convert the at least a portion of the pharmaceutical data to training data, wherein the training data comprises vectorized representations of the properties of the previously-discovered pharmaceutical molecules; train one or more machine learning (ML) models based on the training data to configure the one or more ML models to identify additional pharmaceutical molecules, wherein the additional pharmaceutical molecules are distinct from the previously-discovered pharmaceutical molecules; and generate an output that indicates one or more molecules identified by the one or more ML models, wherein at least a portion of the output comprises simplified molecular-input line-entry system (SMILES) representations of the one or more molecules identified by the one or more ML models.

14. The system of claim 13, wherein the one or more ML models comprises one or more generative models configured to generate the additional pharmaceutical molecules, the additional pharmaceutical molecules comprising new examples of molecules that have common relationships as the previously-discovered pharmaceutical molecules.

15. The system of claim 13, further comprising one or more interfaces configured to enable communication with the one or more databases, a display device, a client device, a drug production system, or a combination thereof.

16. The system of claim 13, wherein the one or more databases comprise one or more publically-available databases, one or more proprietary databases, one or more third-party databases, or a combination thereof.

17. The system of claim 13, wherein:

the one or more ML models comprise a generative adversarial network (GAN), a variational autoencoder (VAE), and a multi-objective VAE;
the GAN is configured to be trained using reinforcement learning to bias achievement of particular objectives; and
the multi-objective VAE is configured to be trained using multiple discriminators that are each associated with a respective loss function.

18. A non-transitory computer-readable storage medium storing instructions that, when executed by one or more processors, cause the one or more processors to perform operations for pharmaceutical molecule identification using machine learning, the operations comprising:

obtaining pharmaceutical data indicating properties of previously-discovered pharmaceutical molecules from one or more databases, wherein the pharmaceutical data includes molecular physiochemical data and one or more of: drug impact data, side effect data, toxicity data, solubility data, or a combination thereof;
performing natural language processing (NLP) on at least a portion of the pharmaceutical data to convert the at least a portion of the pharmaceutical data to training data, wherein the training data comprises vectorized representations of the properties of the previously-discovered pharmaceutical molecules;
training one or more machine learning (ML) models based on the training data to configure the one or more ML models to identify additional pharmaceutical molecules, wherein the additional pharmaceutical molecules are distinct from the previously-discovered pharmaceutical molecules; and
generating an output that indicates one or more molecules identified by the one or more ML models, wherein at least a portion of the output comprises simplified molecular-input line-entry system (SMILES) representations of the one or more molecules identified by the one or more ML models.

19. The non-transitory computer-readable storage medium of claim 18, wherein the operations further comprise:

initiating display of a graphical user interface (GUI) that indicates one or more molecules identified by the one or more ML models;
providing the one or more ML models to a client device for pharmaceutical model identification by the client device; or
a combination thereof.

20. The non-transitory computer-readable storage medium of claim 18, wherein the operations further comprise:

receiving a user input indicating one or more particular properties; and
training the one or more ML models to identify the additional pharmaceutical molecules having the one or more particular properties.

21. The method of claim 5, wherein the instruction is executable by the automated or semi-automated system to initiate mixing of one or more chemicals, to activate a heater or cooler to change a state of a chemical, to cause retrieval of one or more chemicals from a storage location, or a combination thereof.

Patent History
Publication number: 20220101972
Type: Application
Filed: Jan 21, 2021
Publication Date: Mar 31, 2022
Inventors: Dhruv Bajpai (Bangalore), Praveen Viswanathan (Jodhpur), Ashish Ambasta (Bangalore), Vighnesh Paramasivam (Coimbatore)
Application Number: 17/154,417
Classifications
International Classification: G16H 20/10 (20060101); G16H 70/40 (20060101); G06N 20/00 (20060101);