ARTIFICIAL INTELLIGENCE ENGINE FOR GENERATING CANDIDATE DRUGS

Info

Publication number: 20210249108
Type: Application
Filed: Jan 29, 2021
Publication Date: Aug 12, 2021
Applicant: PEPTILOGICS, INC. (Pittsburgh, PA)
Inventors: Francis LEE (Cambridge, MA), Jonathan D. STECKBECK (Cranberry Township, PA), Hannes HOLSTE (Los Angeles, CA)
Application Number: 17/162,479

Abstract

An artificial intelligence engine for generating drug compounds is disclosed. In one embodiment, a method may include generating a biological context representation of a set of drug compounds. The biological context representation includes a first data structure having a first format. The method may also include translating, by the artificial intelligence engine, the first data structure having the first format to a second data structure having a second format. The method may also include generating, based on the second data structure having the second format, a set of candidate drug compounds. The method may also include classifying a candidate drug compound from the set of candidate drug compounds as a selected candidate drug compound.

Description

Description

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims the benefit of U.S. Provisional Application Ser. No. 62/975,470 filed Feb. 12, 2020 titled “Artificial Intelligence Engine for Generating Candidate Drugs.” The provisional application is incorporated by reference herein as if reproduced in full below.

TECHNICAL FIELD

This disclosure relates generally to drug discovery. More specifically, this disclosure relates to an artificial intelligence engine for generating candidate drugs.

BACKGROUND

Therapeutics may refer to a branch of medicine concerned with the treatment of disease and the action of remedial agents (e.g., drugs). Therapeutics includes, but is not limited to, the field of ethical pharmaceuticals. Entities in the therapeutics industry may discover, develop, produce, and market drugs for use as medications to be administered or self-administered to patients. Goals of administering or self-administering the drugs may include curing the patient of a disease, causing an active disease to enter a state of remission, vaccinating the patient by stimulating the immune system to better protect against the disease, and/or alleviating, mitigating or ameliorating a symptom. Existing drug discoveries may be based on any combination of human design, high-throughput screening, synthetic products and natural substances.

SUMMARY

In general, the present disclosure provides an artificial intelligence engine for generating candidate drug.

In one aspect, a method may include generating a biological context representation of a set of drug compounds. The biological context representation includes a first data structure having a first format. The method may also include translating, by the artificial intelligence engine, the first data structure having the first format to a second data structure having a second format. The method may also include generating, based on the second data structure having the second format, a set of candidate drug compounds. The method may also include classifying a candidate drug compound from the set of candidate drug compounds as a selected candidate drug compound.

In another aspect, a method may include receiving, from an artificial intelligence engine, a candidate drug compound generated by the artificial intelligence engine. The method may also include generating a view comprising the candidate drug compound overlaid on a representation of a design space. The view may present a topographical heatmap of the representation of the design space, and the topographical heatmap may include the candidate drug compound overlaid on indicators ranging from an at least one least active property to an at least one most active property. The method may also include presenting the view on a display screen of a computing device.

In another aspect, a system may include a memory device storing instructions and a processing device communicatively coupled to the memory device. The processing device may execute the instructions to perform one or more operations of any method disclosed herein.

In another aspect, a tangible, non-transitory computer-readable medium may store instructions and a processing device may execute the instructions to perform one or more operations of any method disclosed herein.

Other technical features may be readily apparent to one skilled in the art from the following figures, descriptions, and claims.

Before undertaking the DETAILED DESCRIPTION below, it may be advantageous to set forth definitions of certain words and phrases used throughout this patent document. The term “couple” and its derivatives refer to any direct or indirect communication between two or more elements, independent of whether those elements are in physical contact with one another. The terms “transmit,” “receive,” and “communicate,” as well as derivatives thereof, encompass both direct and indirect communication. The terms “transmit,” “receive,” and “communicate,” as well as derivatives thereof, encompass both communication with remote systems and communication within a system, including reading and writing to different portions of a memory device. The terms “include” and “comprise,” as well as derivatives thereof, mean inclusion without limitation. The term “or” is inclusive, meaning and/or. The phrase “associated with,” as well as derivatives thereof, means to include, be included within, interconnect with, contain, be contained within, connect to or with, couple to or with, be communicable with, cooperate with, interleave, juxtapose, be proximate to, be bound to or with, have, have a property of, have a relationship to or with, or the like. The term “translate” may refer to any operation performed wherein data is input in one format, representation, language (computer, purpose-specific, such as drug design or integrated circuit design), structure, appearance or other written, oral or representable instantiation and data is output in a different format, representation, language (computer, purpose-specific, such as drug design or integrated circuit design), structure, appearance or other written, oral or representable instantiation, wherein the data output has a similar or identical meaning, semantically or otherwise, to the data input. Translation as a process includes but is not limited to substitution (including macro substitution), encryption, hashing, encoding, decoding or other mathematical or other operations performed on the input data. The same means of translation performed on the same input data will consistently yield the same output data, while a different means of translation performed on the same input data may yield different output data which nevertheless preserves all or part of the meaning or function of the input data, for a given purpose. Notwithstanding the foregoing, in a mathematically degenerate case, a translation can output data identical to the input data. The term “controller” means any device, system or part thereof that controls at least one operation. Such a controller may be implemented in hardware or a combination of hardware and software and/or firmware. The functionality associated with any particular controller may be centralized or distributed, whether locally or remotely. The phrase “at least one of,” when used with a list of items, means that different combinations of one or more of the listed items may be used, and only one item in the list may be needed. For example, “at least one of: A, B, and C” includes any of the following combinations: A, B, C, A and B, A and C, B and C, and A and B and C.

Moreover, various functions described below can be implemented or supported by one or more computer programs, each of which is formed from computer readable program code and embodied in a computer readable storage medium. The terms “application” and “program” refer to one or more computer programs, software components, sets of instructions, procedures, functions, objects, classes, instances, related data, or a portion thereof adapted for implementation in a suitable computer readable program code. The phrase “computer readable program code” includes any type of computer code, including source code, object code, and executable code. The phrase “computer readable storage medium” includes any type of medium capable of being accessed by a computer, such as read only memory (ROM), random access memory (RAM), a hard disk drive, a compact disc (CD), a digital video disc (DVD), solid state drive (SSD), or any other type of memory. A “non-transitory” computer readable storage medium excludes wired, wireless, optical, or other communication links that transport transitory electrical or other signals. A non-transitory computer readable storage medium includes media where data can be permanently stored and media where data can be stored and later overwritten, such as a rewritable optical disc or an erasable memory device.

The terms “candidate drugs” and “candidate drug compounds” may be used interchangeably herein.

Definitions for other certain words and phrases are provided throughout this patent document. Those of ordinary skill in the art should understand that in many if not most instances, such definitions apply to prior as well as future uses of such defined words and phrases.

BRIEF DESCRIPTION OF THE DRAWINGS

For a more complete understanding of this disclosure and its advantages, reference is now made to the following description, taken in conjunction with the accompanying drawings, in which:

FIG. 1 illustrates a high-level component diagram of an illustrative system architecture according to certain embodiments of this disclosure;

FIG. 2 illustrates a data structure storing a biological context representation according to certain embodiments of this disclosure;

FIGS. 3A-3B illustrate a high-level flow diagram according to certain embodiments of this disclosure;

FIG. 4 illustrates example operations of a method for generating and classifying a candidate drug compound according to certain embodiments of this disclosure;

FIGS. 5A-5D provide illustrations of generating a first data structure including a biological context representation of a plurality of drug compounds according to certain embodiments of this disclosure;

FIG. 6 illustrates example operations of a method for translating the first data structure of FIGS. 5A-5D into a second data structure having a second format according to certain embodiments of this disclosure;

FIG. 7 provide illustrations of translating the first data structure of FIGS. 5A-5D into the second data structure having the second format according to certain embodiments of this disclosure;

FIG. 8A-8C provide illustrations of views of a selected candidate drug compound according to certain embodiments of this disclosure;

FIG. 9 illustrates example operations of a method for presenting a view including a selected candidate drug compound according to certain embodiments of this disclosure;

FIG. 10A illustrates example operations of a method for using causal inference during the generation of candidate drug compounds according to certain embodiments of this disclosure;

FIG. 10B illustrates another example of operations of a method for using causal inference during the generation of candidate drug compounds according to certain embodiments of this disclosure; and

FIG. 11 illustrates an example computer system according to certain embodiments of this disclosure.

DETAILED DESCRIPTION

Conventional drug discoveries based on human design, high-throughput screening, and/or natural substances may be inefficient, riven with noise, limited in application, not efficacious, dangerous or poisonous, and/or not defensible. Further, in some instances, there are instances of certain diseases (e.g., instances of prosthetic joint infections) that do not have a corresponding existing therapeutic to treat the certain diseases or which provide temporary results against which the disease is refractory. One reason for the lack of an existing therapeutic may be the conventional drug discovery techniques are incapable of discovering the therapeutic needed to treat the certain diseases. By “treat,” we mean that the disease at hand is cured inter alia, that it is not refractory to treatment. The amount of knowledge, data, assumptions, and queries that is used to discover a therapeutic to treat the certain disease may be unattainable, overwhelming, and/or inefficient, such that conventional drug discovery techniques cannot overcome these obstacles. Improvement is desired in the field of therapeutics.

Accordingly, aspects of the present disclosure generally relate to an artificial intelligence engine for generating candidate drugs. The artificial intelligence (AI) engine may use a combination of rational algorithmic discovery and machine learning models (e.g., generative deep learning methods) to produce enhanced therapeutics that may treat any suitable target disease and/or medical condition. The AI engine may discover, translate, design, generate, create, develop, formulate, classify, and/or test candidate drug compounds that exhibit desired activity (e.g., antimicrobial, immunomodulatory, cytotoxic, neuromodulatory, etc.) in design spaces for target diseases and/or medical conditions. Such candidate drug compounds that exhibit desired activity in a design space may effectively treat the disease and/or medical condition associated with that design space. In some embodiments, a selected candidate drug compound that effectively treats the disease and/or medical condition may be formulated into an actual drug for administration and may be tested in a lab and/or at a clinical stage.

In general, the disclosed embodiments may enable rationally discovering drug compounds for a larger design space at a larger scale, higher accuracy, and/or higher efficiency than conventional techniques. The AI engine may use various machine learning models to discover, translate, design, generate, create, develop, formulate, classify, and/or test candidate drug compounds. Each of the various machine learning models may perform certain specific operations. The types of machine learning models may include various neural networks that perform deep learning, computational biology, and/or algorithmic discovery. Examples of such neural networks may include generative adversarial networks, recurrent neural networks, convolutional neural networks, fully connected neural networks, etc. as described further below; and such networks may also additionally employ methods of or incorporating causal inference, including counterfactuals, in the process of discovery.

In some embodiments, a biological context representation of a set of drug compounds may be generated. The biological context representation may be a continuous representation of a biological setting that is updated as knowledge is acquired and/or data is updated. The biological context representation may be stored in a first data structure having a format (e.g., a knowledge graph) that includes both various nodes pertaining to health artifacts and various relationships connecting the nodes. The nodes and relationships may form logical structures having subjects and predicates. For example, one logical structure between two nodes having a relationship may be “Genes are associated with Diseases” where “Genes” and “Diseases” are the subjects of the logical structure and “are associated with” is the relationship. In such a way, the knowledge graph may encompass actual knowledge, rather than simply statistical inferences, pertaining to a biological setting.

The information in the knowledge graph may be continuously or periodically updated and the information may be received from various sources curated by the AI engine. The knowledge in the biological context representation goes well beyond “dumb” data that just includes quantities of a value because the knowledge represents the relationships between or among numerous different types of data, as well as any or all of direct, indirect, causal, counterfactual or inferred relationships. In some embodiments, the biological context representation may not be stored, and instead, based on the stream of knowledge included in the biological context representation, may be streamed from data sources into the AI engine that generates the machine learning models.

The biological context representation may be used to generate candidate drug compounds by translating the first data format to a second data structure having a second format (e.g., a vector). The second format may be more computationally efficient and/or suitable for generating candidate drug compounds that include sequences of ingredients that provide desired activity in a design space. “Ingredients” as used herein may refer, without limitation, to substances, compounds, elements, activities (such as the application or removal of electrical charge or a magnetic field for a specific maximum, minimum or discrete amount of time), and mixtures. Further, the second format may enable generating views of the levels of activity provided by the sequence of ingredients in a certain design space, as described further below.

At a high level, the AI engine may include at least one machine learning model that is trained to use causal inference to generate candidate drug compounds. One of the challenges with discovering new therapeutics may include determining whether certain ingredients are causal agents with respect to certain activity in a design space. The sheer number of possible sequences of ingredients may be extraordinarily large due to mathematical combinatorics, such that identifying a cause and effect relationship between ingredients and activity may be impossible or, at best, extremely unlikely, to identify without the disclosed embodiments. (For example, in public-key encryption, it is theoretically possible to discover and unlock a private key, but doing this would presently require all the computing power in the world to work longer than the age of the universe: this is an example of what is mathematically possible, but impossible within human time frames and computing power. Identifying a cause-and-effect relationship between ingredients and activity, while a different problem, may be similarly mathematically possible, but impossible within human time frames and computer power.) Based on advances in computing hardware (e.g., graphic processing unit processing cores) and the AI techniques using causal inference described herein, the disclosed embodiments may enable the efficient solving of the task of generating candidate drug compounds at scale.

Causal inference may refer to a process, based on conditions of an occurrence of an effect, of drawing a conclusion about a causal connection. Causal inference may analyze a response of an effect variable when a cause is changed. Causation may be defined thusly: a variable X is a cause of Y if Y “listens” to X and determines its response based on what it “hears.” The process of causal inference in the field of AI may be particularly beneficial for generating and testing candidate drug compounds for certain diseases and/or medical conditions because of the use of what are termed counterfactuals. A counterfactual posits and examines conditions contrary to what has actually occurred in reality. For example, if someone takes aspirin for a headache, the headache may go away. The counterfactual asks what would have happened if the person had not taken aspirin. Would the headache still have gone away or would it have remained or even gotten worse? Accordingly, counterfactuals may refer to calculating alternative scenarios based on past actions, occurrences, results, regressions, regression analyses, correlations, or some combination thereof. A counterfactual may enable determining whether a response should stay the same or instead change if something in a sequence does not occur. For example, one counterfactual may include asking: “Would a certain level of activity be the same if a certain ingredient is not included in a sequence of a candidate drug compound?”

By simulating numerous alternative scenarios to further optimize and hone the accuracy of a sequence of ingredients in the candidate drug compounds, such techniques may enable reducing the number of viable candidate drug compounds. As a result, the embodiments may provide technical benefits, such as reducing resources consumed (e.g., processing, memory, network bandwidth) by reducing a number of candidate drug compounds that may be considered for classification as a selected candidate drug compound by another machine learning model.

In some embodiments, one application for the AI engine to design, discover, develop, formulate, create, and/or test candidate drug compounds may pertain to peptide therapeutics. A peptide may refer to a compound consisting of two or more amino acids linked in a chain. Example peptides may include dipeptides, tripeptides, tetrapeptides, etc. a polypeptide may refer to a long, continuous, and unbranched peptide chain. Peptides may be simple to manufacture at discovery scale, include drug-like characteristics of small molecules, include safety and high specificity of biologics, and/or provide greater administration flexibility than some other biologics.

The disclosed techniques provide numerous benefits over conventional techniques for designing, developing, and/or testing candidate drug compounds. For example, the AI engine may efficiently use a biological context representation of a set of drug compounds and one or more machine learning models to generate a set of candidate drug compounds and classify one of the set of candidate drug compounds as a selected candidate drug compound. Some embodiments may use causal inference to remove a number of potential candidate drug compounds from classification, thereby reducing the computational complexity and processing burden of classifying a selected candidate drug compound.

Further, additional benefits of the embodiments disclosed herein may include using the AI engine to produce algorithmically designed drug compounds that have been validated in vivo and in vitro and that provide (i) a broad-spectrum activity against greater than 900 multi-drug resistant bacteria, (ii) at least a 2-to-10 times improvement in exposure time required to generate a drug resistance profile, (iii) effectiveness across four key animal infection models (both Gram-positive and Gram-negative bacteria), and/or (iv) effectiveness against biofilms.

It should be noted that the embodiments disclosed herein may not only apply to the anti-infective market (e.g., for prosthetic joint infections, urinary tract infections, intra-abdominal or peritoneal infections, otitis media, cardiac infections, respiratory infections including but not limited to sequelae from diseases such as cystic fibrosis, neurological infections (e.g., meningitis), dental infections (including periodontal), other organ infections, digestive and intestinal infections (e.g., C. difficile), other physiological system infections, wound and soft tissue infections (e.g., cellulitis), etc.), but numerous other suitable markets and/or industries. For example, the embodiments may be used in the animal health/veterinary industry, for example, to treat certain animal diseases (e.g., bovine mastitis). Also, the embodiments may be used for industrial applications, such as anti-biofouling, and/or generating optimized control action sequences for machinery. The embodiments may also benefit a market for new therapeutic indications, such as eczema, inflammatory bowel disease, Crohn's Disease, rheumatoid arthritis, asthma, auto-immune diseases and disease processes in general, inflammatory disease progressions or processes, and/or oncology treatments and palliatives. The video game industry may also benefit from the disclosed techniques to improve the AI used for generating sequences of decisions that non-player controlled (NPC) characters make during gameplay. The integrated circuit/chip industry may also benefit from the disclosed techniques to improve the mask works generation and routing processes used for generating the most efficient, highest performance, lowest power, lowest heat generating systems on a chip or solid state devices. Accordingly, it should be understood that the disclosed embodiments may benefit any market and/or industry that is associated with a sequence (e.g., items, objects, decisions, actions, ingredients, etc.) that can be optimized.

FIGS. 1 through 11, discussed below, and the various embodiments used to describe the principles of this disclosure are by way of illustration only and should not be construed in any way to limit the scope of the disclosure.

FIG. 1 illustrates a high-level component diagram of an illustrative system architecture 100 according to certain embodiments of this disclosure. In some embodiments, the system architecture 100 may include a computing device 102 communicatively coupled to a cloud-based computing system 116. Each of the computing device 102 and components included in the cloud-based computing system 116 may include one or more processing devices, memory devices, and/or network interface cards. The network interface cards may enable communication via a wireless protocol for transmitting data over short distances, such as Bluetooth, ZigBee, NFC, etc. Additionally, the network interface cards may enable communicating data over long distances, and in one example, the computing device 102 and the cloud-based computing system 116 may communicate with a network 112. Network 112 may be a public network (e.g., connected to the Internet via wired (Ethernet) or wireless (WiFi)), a private network (e.g., a local area network (LAN) or wide area network (WAN)), or a combination thereof. Network 112 may also comprise a node or nodes on the Internet of Things (IoT).

The computing device 102 may be any suitable computing device, such as a laptop, tablet, smartphone, or computer. The computing device 102 may include a display that is capable of presenting a user interface of an application 118. The application 118 may be implemented in computer instructions stored on the one or more memory devices of the computing device 102 and executable by the one or more processing devices of the computing device 102. The application 118 may present various screens to a user that present various views (e.g., topographical heatmaps) including measures, gradients, or levels of certain types of activity and optimized sequences of selected candidate drug compounds, information pertaining to the selected candidate drug compounds and/or other candidate drug compounds, options to modify the sequence of ingredients in the selected candidate drug compound, and so forth, as described in more detail below. The computing device 102 may also include instructions stored on the one or more memory devices that, when executed by the one or more processing devices of the computing device 102, perform operations of any of the methods described herein.

In some embodiments, the cloud-based computing system 116 may include one or more servers 128 that form a distributed computing architecture. The servers 128 may be a rackmount server, a router computer, a personal computer, a portable digital assistant, a mobile phone, a laptop computer, a tablet computer, a camera, a video camera, a netbook, a desktop computer, a media center, any other device capable of functioning as a server, or any combination of the above. Each of the servers 128 may include one or more processing devices, memory devices, data storage, and/or network interface cards. The servers 128 may be in communication with one another via any suitable communication protocol. The servers 128 may execute an artificial intelligence (AI) engine 140 that uses one or more machine learning models 132 to perform at least one of the embodiments disclosed herein. The cloud-based computing system 128 may also include a database 150 that stores data, knowledge, and data structures used to perform various embodiments. For example, the database 150 may store a knowledge graph containing the biological context representation described further below. Further, the database 150 may store generated candidate drug compounds, selected candidate drug compounds, information pertaining to the selected candidate drug compounds (e.g., activity for certain types of ingredients, sequences of ingredients, test results, correlations, etc.). Although depicted separately from the server 128, in some embodiments, the database 150 may be hosted on one or more of the servers 128.

In some embodiments the cloud-based computing system 116 may include a training engine 130 capable of generating the one or more machine learning models 132. The machine learning models 132 may be trained to discover, translate, design, generate, create, develop, classify, and/or test candidate drug compounds, among other things. The one or more machine learning models 132 may be generated by the training engine 130 and may be implemented in computer instructions executable by one or more processing devices of the training engine 130 and/or the servers 128. To generate the one or more machine learning models 132, the training engine 130 may train the one or more machine learning models 132.

The training engine 130 may be a rackmount server, a router computer, a personal computer, a portable digital assistant, a smartphone, a laptop computer, a tablet computer, a netbook, a desktop computer, any other desired computing device, or any combination of the above. The training engine 130 may be cloud-based, be a real-time software platform, include privacy software or protocols, and/or include security software or protocols.

To generate the one or more machine learning models 132, the training engine 130 may train the one or more machine learning models 132. The training engine 130 may use a base data set of biological context representation (e.g., physical properties data, peptide activity data, microbe data, antimicrobial data, anti-neurodegenerative compound data, pro-neuroplasticity compound data, clinical outcome data, etc.) for a set of drug compounds. For example, the biological context representation may include sequences of ingredients for the drug compounds. The results may include information indicating levels of certain types of activity associated with certain design spaces. In one embodiment, the results may include causal inference information pertaining to whether certain ingredients in the drug compounds are correlated with or determined by certain effects (e.g., activity levels) in the design space.

The one or more machine learning models 132 may refer to model artifacts created by the training engine 130 using training data that includes training inputs and corresponding target outputs. The training engine 130 may find patterns in the training data wherein such patterns map the training input to the target output, and generate the machine learning models 132 that capture these patterns. Although depicted separately from the server 128, in some embodiments, the training engine 130 may reside on server 128. Further, in some embodiments, the artificial intelligence engine 140, the database 150, and/or the training engine 130 may reside on the computing device 102.

As described in more detail below, the one or more machine learning models 132 may comprise, e.g., a single level of linear or non-linear operations (e.g., a support vector machine [SVM]) or the machine learning models 132 may be a deep network, i.e., a machine learning model comprising multiple levels of non-linear operations. Examples of deep networks are neural networks including generative adversarial networks, convolutional neural networks, recurrent neural networks with one or more hidden layers, and fully connected neural networks (e.g., each neuron may transmit its output signal to the input of the remaining neurons, as well as to itself). For example, the machine learning model may include numerous layers and/or hidden layers that perform calculations (e.g., dot products) using various neurons. In some embodiments, one or more of the machine learning models 132 may be trained to use causal inference and counterfactuals.

For example, the machine learning model 132 trained to use causal inference may accept one or more inputs, such as (i) assumptions, (ii) queries, and (iii) data. The machine learning model 132 may be trained to output one or more outputs, such as (i) a decision as to whether a query may be answered, (ii) an objective function (also referred to as an estimand) that provides an answer to the query for any received data, and (iii) an estimated answer to the query and an estimated uncertainty of the answer, where the estimated answer is based on the data and the objective function, and the estimated uncertainty reflects the quality of data (i.e., a measure which takes into account the degree and/or salience of incorrect data and/or missing data). The assumptions may also be referred to as constraints and may be simplified into statements used in the machine learning model 132. The queries may refer to scientific questions for which the answers are desired.

The answers estimated using causal inference by the machine learning model may include optimized sequences of ingredients in selected candidate drug compounds. As the machine learning model estimates answers (e.g., candidate drug compounds), certain causal diagrams may be generated, as well as logical statements, and patterns may be detected. For example, one pattern may indicate that “there is no path connecting ingredient D and activity P,” which may translate to a statistical statement “D and P are independent.” If alternative calculations using counterfactuals contradict or do not support that statistical statement, then the machine learning model 132 and/or the biological context representation may be updated. For example, another machine learning model 132 may be used to compute a degree of fitness which represents a degree to which the data is compatible with the assumptions used by the machine learning model that uses causal inference. There are certain techniques that may be employed by the other machine learning model 132 to reduce the uncertainty and increase the degree of compatibility. The techniques may include those for maximum likelihood, propensity scores, confidence indicators, and/or significance tests, among others.

Using causal inference, a generative adversarial network (GAN) may be used to generate a set of candidate drug compounds. A GAN refers to a class of deep learning algorithms including two neural networks, a generator and a discriminator, that both compete with one another to achieve a goal. For example, regarding candidate drug compound generation, the generator goal may include generating candidate drug compounds, including compatible/incompatible sequences of ingredients, and effective/ineffective sequences of ingredients, etc. that the discriminator classifies as feasible candidate drug compounds, including compatible and effective sequences of ingredients that may produce desired activity levels for a design space. The generator may use causal inference, including counterfactuals, to calculate numerous alternative scenarios that indicate whether a certain result (e.g., activity level) still follows when any element or aspect of a sequence changes. The discriminator goal may include distinguishing candidate drug compounds which include undesirable sequences of ingredients from candidate drug compounds which include desirable sequences of ingredients.

In some embodiments, the generator initially generates candidate drug compounds and continues to generate better candidate drug compounds after each iteration until the generator eventually begins to generate candidate drug compounds that are valid drug compounds which produce certain levels of activity within a design space. A candidate drug compound may be “valid” when it produces a certain level of effectiveness (e.g., above a threshold activity level as determined by a standard (e.g., regulatory entity)) in a design space. In order to classify the candidate drug compounds as a valid drug compound or invalid candidate drug compound, the discriminator may receive real drug compound information from a dataset and the candidate drug compounds generated by the generator. The generator obtains the results from the discriminator and applies the results in order to generate better (e.g., valid) candidate drug compounds.

General details regarding the GAN are now discussed. The two neural networks, the generator and the discriminator, may be trained simultaneously. The discriminator may receive an input and then output a scalar indicating whether a candidate drug compound is an actual and/or viable drug compound. In some embodiments, the discriminator may resemble an energy function that outputs a low value (e.g., close to 0) when input is a valid drug compound and a positive value when the input is not a valid drug compound (e.g., if it includes an incorrect sequence of ingredients for certain activity levels pertaining to a design space).

There are two functions that may be used, the generator function (G(V)), and the discriminator function (D(Y)). The generator function may be denoted as G(V), where V is generally a vector randomly sampled in a standard distribution (e.g., Gaussian). The vector may be any suitable dimension and may be referred to as an embedding herein. The role of the generator is to produce candidate drug candidates so as to train the discriminator function (D(Y)) to output the values indicating the candidate drug candidate is valid (e.g., a low value).

During training, the discriminator is presented with a valid drug compound and adjusts its parameters (e.g., weights and biases) to output a value indicative of the validity of the candidate drug compounds that produce real activity levels in certain design spaces. Next, the discriminator may receive a modified candidate drug compound (e.g., modified using counterfactuals) generated by the generator and adjust its parameters to output a value indicative of whether the modified candidate drug compound provides the same or a different activity level in the design space.

The discriminator may use a gradient of an objective function to increase the value of the output. The discriminator may be trained as an unsupervised “density estimator,” i.e., a contrast function produces a low value for desired data (e.g., candidate drug compounds that include sequences producing desired levels of certain types of activity in a design space) and higher output for undesired data (e.g., candidate drug compounds that include sequences producing undesirable levels of certain types of activity in a design space). The generator may receive the gradient of the discriminator with respect to each modified candidate drug compound it produces. The generator uses the gradient to train itself to produce modified candidate drug compounds that the discriminator determines include sequences producing desired levels of certain types of activity in a design space.

Recurrent neural networks include the functionality, in the context of a hidden layer, to process information sequences and store information about previous computations. As such, recurrent neural networks may have or exhibit a “memory.” Recurrent neural networks may include connections between nodes that form a directed graph along a temporal sequence. Keeping and analyzing information about previous states enables recurrent neural networks to process sequences of inputs to recognize patterns (e.g., such as sequences of ingredients and correlations with certain types of activity level). Recurrent neural networks may be similar to Markov chains. For example, Markov chains may refer to stochastic models describing sequences of possible events in which the probability of any given event depends only on the state information contained in the previous event. Thus, Markov chains also use an internal memory to store at least the state of the previous event. These models may be useful in determining causal inference, such as whether an event at a current node changes as a result of the state of a previous node changing.

The set of candidate drug compounds generated may be input into another machine learning model 132 trained to classify of the set of candidate drug compounds as a selected candidate drug compound. The classifier may be trained to rank the set of candidate drug compounds using any suitable ranking (i.e., for example, non-parametric) technique. For example, in some embodiments, one or more clustering techniques may be used to cluster the set of candidate drug compounds. To classify the selected candidate drug compound, the machine learning model 132 may also perform objective optimization techniques while clustering. To classify the selected candidate drug compound having desired levels of certain types of activity, the objective optimization may include using a minimization and/or maximization function for each candidate drug compound in the clusters.

A cluster may refer to a group of data objects similar to one another within the same cluster, but dissimilar to the objects in the other clusters. Cluster analysis may be used to classify the data into relative groups (clusters). One example of clustering may include K-means clustering where “K” defines the number of clusters. Performing K-means clustering may comprise specifying the number of clusters, specifying the cluster seeds, assigning each point to a centroid, and adjusting the centroid.

Additional clustering techniques may include hierarchical clustering and density based spatial clustering. Hierarchy clustering may be used to identify the groups in the set of candidate drug compounds where there is no set number of clusters to be generated. As a result, a tree-based representation of the objects in the various groups may be generated. Density-based spatial clustering may be used to identify clusters of any shape in a dataset having noise and outliers. This form of clustering also does not require specifying the number of clusters to be generated.

FIG. 2 illustrates a data structure storing a biological context representation 200 according to certain embodiments of this disclosure. Biology is context-dependent and dynamic. For example, the same molecule can manifest multiple, potentially competing, phenotypes. Further, data on an existing drug labeled as antimicrobial can suggest a null behavior in applications against different microbes or even against the same microbes but in different contexts, e.g., temperature, pressure, environmental, contextual, comorbid. To accurately predict candidate drug compounds that provide desirable activity levels in design spaces, the machine learning models 132 are trained to handle evolving knowledge maps of biology and drug compounds. Further, conventional techniques for discovery and generating drug compounds may be ineffective for biological data because such data is non-Euclidian. For example, machine learning models used for computer vision, image classification, and language models compute on Euclidian data and cannot therefore be applied to make useful inferences about non-Euclidian data in biology.

In some embodiments, the biological context representation 200 generated by the disclosed techniques may be used to graphically model the continually or continuously modifying biological and drug compound knowledge. That is, the biology may be represented as graphs within a comprehensive knowledge graph (e.g., biological context representation 200), where the graphs have complex relationships and interdependencies between nodes.

The biological context representation 200 may be stored in a first data structure having a first format. The first format may be a graph, an array, a linked list, or any suitable data format capable of storing the biological context representation. In particular, FIG. 2 illustrates various types of data received from various sources, including physical properties data 202, peptide activity data 204, microbe data 206, antimicrobial compound data 208, clinical outcome data 210, evidence-based guidelines 212, disease association data 214, pathway data 216, compound data 218, gene interaction data 220, anti-neurodegenerative compound data 222, and/or pro-neuroplasticity compound data 224.

These example data may be curated by the AI engine 140 and/or a person having a certain degree (e.g., a degree in data science, molecular biology, microbiology, etc.), certification, license (e.g., a licensed medical doctor (e.g., M.D. or D.O.), and/or credential. Further, the data in the biological context representation 200 may be retrieved from any suitable data source (e.g., digital libraries, websites, databases, files, or the like). These examples are not meant to be limiting. Thus, the example types of data are also not meant to be limiting and other types of data may be stored within the biological context representation without departing from the scope of this disclosure. Further, the various data included in the biological context representation 200 may be linked based on one or more relationships between or among the data, in order to represent knowledge pertaining to the biological context and/or drug compound.

The physical properties data 202 includes physical properties exhibited by the drug compound. The physical properties may refer to characteristics that provide a physical description of the drug such as color, particle size, crystalline structure, melting point, solubility. In some instances, the physical properties data 202 may also include chemical property data, such as the structure, form, and reactivity of a substance. In some embodiments, biological data may also be included (e.g., anti-neurodegenerative compound data, pro-neuroplasticity compound data, anti-cancer data) in the biological context representation 200.

The peptide activity data 204 may include various types of activity exhibited by the drug. For example, the activity may be hormonal, antimicrobial, immunomodulatory, cytotoxic, neurological, and the like. A peptide may refer to a short chain of amino acids linked by peptide bonds.

The microbe data 206 may include information pertaining to cellular structure (e.g., unicellular, multicellular, etc.) of a microscopic organism. The microbes may refer to bacteria, parasites, fungi, viruses, prions, or any combination of these, etc.

The antimicrobial compound data 208 may include information pertaining to agents that kill microbes or stop their growth. This data may include classifications based on the microorganisms against which the antimicrobial compound acts (e.g., antibiotics act against bacteria but not against viruses; anti-virals act against viruses but not against bacteria). The antimicrobial compound may also be classified according to function (e.g., microbicidal, meaning “that which kills, vitiates, inactivates or otherwise impairs the activity of certain microbes”).

The clinical outcome data 210 may include information pertaining to the administration of a drug compound to a subject in a clinical setting. For example, upon or subsequent to administration of the drug compound, the outcome may be a prevented disease, cured disease, treated symptom, etc.

The evidence-based guidelines 212 may include information pertaining to guidelines based upon clinical studies for acceptable treatment and/or therapeutics for certain diseases and/or medical conditions. Evidence-based guidelines data 212 may include data specific to various specialties within healthcare such as, for example, obstetrics, anesthesiology, hepatology, gastroenterology, neurology, pulmonology, orthopaedics, pediatrics, trauma care (including but not limited to burns and post-burn infections), histology, oncology, ophthalmology, endocrinology, rheumatology, internal medicine, surgery, vascular medicine, radiology, psychiatry, cardiology, urology, gynecology, genetics, and dermatology. In the example described herein, the evidence-based guidelines 212 include systematically developed statements to assist practitioner and patient decisions about appropriate health care (e.g., types of drugs to prescribe for treatment) for specific clinical circumstances.

The disease association data 214 may include information about which disease and/or medical condition the drug compounds are associated with. For example, the drug compound Metformin may be associated with the disease type 2 diabetes.

The pathway data 216 may include information pertaining in a design space to the relationships or paths between ingredients (e.g., chemicals) and activity levels.

The compound data 218 may include information pertaining to the compound such as the sequence of ingredients (e.g., type, amount, etc.) in the compound. In the therapeutics industry, for example, the compound data 218 can include data specific to the various types of drug compounds that are designed, defined, developed, and/or distributed.

The gene interaction data 220 may include information pertaining to which gene the drug compound and/or a disease may interact with.

The anti-neurodegenerative compound data 222 may include information pertaining to characteristics of anti-neurodegenerative compounds, such as their physical and chemical properties and activities on portions of tissue. For example, the activity may include anti-inflammatory and/or neuro-protective actions.

The pro-neuroplasticity compound data 224 may include information pertaining to characteristics of pro-neuroplasticity compound, such as their physical and chemical properties and activities on portions of tissue. For example, the activity may enhance the capacity of motor systems by upregulation of neurotrophins.

FIGS. 3A-3B illustrate a high-level flow diagram according to certain embodiments of this disclosure. Regarding FIG. 3A, a flow diagram 300 begins with obtaining heterogeneous datasets, such as the biological context representation 200. Heterogeneous datasets may refer to populations or samples of data that are different (e.g., as opposed to homogenous datasets where the data is the same). The heterogeneous datasets may include compound data (e.g., peptide sequence data), clinical outcome data, and/or activity data (in vitro and in vivo activity), as well as any other suitable data depicted in FIG. 2.

The data structure storing the heterogeneous datasets may be translated to a second data structure having a second format (e.g., a 2-dimensional vector) that the AI engine 140 may use to generate the candidate drug compounds. The next step in the flow diagram 300 includes training the one or more machine learning models 132 using the heterogeneous datasets. The one or more machine learning models 132 (e.g., generative models) may generate a set of candidate drug compounds based on the heterogeneous datasets. As described herein, a machine learning model may use causal inference and counterfactuals when generating the set of candidate drug compounds. Further, a GAN may be used in conjunction with causal inference to generate the set of candidate drug compounds. In some embodiments, a certain number (e.g., over 100,000 candidate drug compounds) of novel candidate drug compounds may be generated in a set. That is, each candidate drug compound in the set of candidate drug compounds is intended to be unique.

The next step in the flow diagram 300 includes inputting the set of candidate drug compounds into one or more machine learning models 132 trained to classify the set of candidate drug compounds. The machine learning models 132 may perform supervised and/or unsupervised filtering. In some embodiments, the machine learning models 132 may perform clustering to rank the various candidate drug compounds to classify one candidate drug compound as a selected candidate drug compound. In some embodiments, the machine learning models 132 may output a subset (e.g., 1,000 to 10,000, or more, or fewer) of candidate drug compounds.

The next step in the flow diagram 300 may include performing experimental validation by validating whether each candidate drug compound in the subset of candidate drug compounds provides the desired level of certain types of activity in a design space. The results of the experimental validation may be fed back into the heterogeneous dataset to reinforce and expand the experimental dataset.

The next step in the flow diagram 300 may include performing peptide drug optimization. The optimizations may include performing gradient descent and/or ascent using the sequence of ingredients in the candidate drug compounds to attempt to increase and/or decrease certain activity levels in a design space. The results of the peptide drug optimization may be fed back into the heterogeneous datasets to reinforce and expand the experimental dataset.

FIG. 3B illustrates another high-level flow diagram 310 according to some embodiments. As depicted, a heterogeneous network of biology may be included in a knowledge graph of a biological context representation 200. Various paths or meta-paths may be expressed between nodes in the biological context representation 200. For example, the meta-paths may include indications for compound upregulates, pathway participates, disease associations, gene interactions, and compound data.

The biological context representation 200 may be translated from a first format (e.g., knowledge graph) to a format (e.g., vector) that may be processed by the AI engine 140. The AI engine 140 may use one or more machine learning models to traverse the knowledge graph by performing random walks until a corpus of random walks is generated, wherein such random walks include the indications associated with the meta-paths representing sequences of ingredients. The corpus of random walks may be referred to as a set of candidate drug compounds. A generative adversarial network using causal inference may be used to generate the set of candidate drug compounds. The set of candidate drug compounds may be stored in a higher-dimensional vector.

The AI engine 140 may compress the higher-dimensional vector of the set of candidate drug compounds into a lower-dimensional vector of the set of candidate drug compounds, depicted as biological embeddings in FIG. 3B. In some embodiments, the lower-dimensional vector may include fewer dimensions (e.g., 2, 3, . . . N) than the higher-dimensional vector (e.g., greater than N). As depicted, the nodes may be organized by the meta-path indicators and by dimension.

To output a subset of candidate drug compounds, the lower-dimensional vector of the set of candidate drug compounds may be input to one or more machine learning models 132 trained to perform classification. The classification techniques may include using clustering to filter out candidate drug compounds that produce undesirable levels of types of activity. In some embodiments, to enable the AI engine 140 to perform the classification, views presenting the levels of types of activity of each candidate drug compound in a design space may be generated using the lower-dimensional vectors. These views may also be presented to a user via the computing device 102. The machine learning models 132 may output a candidate drug candidate classified as a selected candidate drug candidate based on the clustering. For example, the selected candidate drug candidate may include an optimized sequence of ingredients that provides the most desirable levels of a certain type of activity in a design space.

FIG. 4 illustrates example operations of a method 400 for generating and classifying a candidate drug candidate compound according to certain embodiments of this disclosure. The method 400 is performed by processing logic that may include hardware (circuitry, dedicated logic, etc.), software (such as is run on a general purpose computer system or a dedicated machine), or a combination of both. The method 400 and/or each of their individual functions, routines, subroutines, or operations may be performed by one or more processors of a computing device (e.g., any component of FIG. 1, such as server 128 executing the artificial intelligence engine 140). In certain implementations, the method 400 may be performed by a single processing thread. Alternatively, the method 400 may be performed by two or more processing threads, each thread implementing one or more individual functions, routines, subroutines, or operations of the methods. One or more operations of the method 400 may be performed by the training engine 130 of FIG. 1.

For simplicity of explanation, the method 400 is depicted and described as a series of operations. However, operations in accordance with this disclosure can occur in various orders and/or concurrently, and with other operations not presented and described herein. For example, the operations depicted in the method 400 may occur in combination with any other operation of any other method disclosed herein. Furthermore, not all illustrated operations may be required to implement the method 400 in accordance with the disclosed subject matter. In addition, those skilled in the art will understand and appreciate that the method 400 could alternatively be represented as a series of interrelated states via a state diagram or events.

At 402, the processing device may generate a biological context representation 200 of a set of drug compounds. The biological context representation 200 may include a first data structure having a first format (e.g., a knowledge graph). The biological context representation 200 may include, for each drug compound of the set of drug compounds, one or more relationships between or among, without limitation, (i) physical properties data 202, (ii) peptide activity data 204, (iii) microbe data 206, (iv) antimicrobial compound data 208, (v) clinical outcome data 210, (vi) evidence-based guidelines 212, (vii) disease association data 214, (viii) pathway data 216, (ix), compound data 218, (x) gene interaction data 220, (xi) antimicrobial compound data, (xii) pro-neuroplasticity data 224, or some combination thereof.

At 404, the processing device may translate, by the artificial intelligence engine 140, the first data structure having the first format to a second data structure having a second format. The translating may include converting the first data structure having the first format (e.g., knowledge graph) to the second data structure having the second format (e.g., vector) according to a specific set of rules executed by the artificial intelligence engine 140. In some embodiments, the translating may be performed by one or more of the machine learning models 132. For example, a recurrent neural network may perform at least a portion of the translating.

The translating may include obtaining a higher-dimensional vector and compressing the higher-dimensional vector into a lower-dimensional vector (e.g., two-dimensional, three-dimensional, four-dimensional), referred to as an embedding herein. In some embodiments, one or more embeddings may be created from the first data structure having the first format. There may be any suitable number of dimensions of the embeddings. When used for classifying candidate drug compounds, the number of dimensions may be selected based on a desired performance to process the embeddings. The lower-dimensional vector may have at least one fewer dimension than the higher-dimensional vector.

At 406, the processing device may generate, based on the second data structure having the second format, a set of candidate drug compounds. In some embodiments, the generating may be performed by one or more of the machine learning models 132. For example, a generative adversarial network may perform the generating of the set of candidate drug compounds. In some embodiments, the set of candidate drug compounds may be associated with design spaces pertaining to antimicrobial, anti-cancer, anti-biofilm, or the like. A biofilm may include any syntrophic consortium of microorganisms in which cells stick to each other and often also to a surface. These adherent cells may become embedded within an extracellular matrix that is composed of extracellular polymeric substances (EPS). Cancer may refer to a disease caused or correlated with an uncontrolled division of abnormal cells in a part of the body.

At 408, the processing device may classify a candidate drug compound from the set of candidate drug compounds as a selected candidate drug compound. In some embodiments, the classifying may be performed by one or more of the machine learning models 132. For example, a classifier trained using supervised or unsupervised learning may perform the classifying. In some embodiments, the classifier may use clustering techniques to rank and classify the selected candidate drug compound.

In some embodiments, the processing device may generate a set of views including a representation of a design space. The design space may be antimicrobial. The processing device may cause the set of views to be presented on a computing device (e.g., computing device 102). The representation of the design space may pertain to, without limitation, (i) antimicrobial activity, (ii) immunomodulatory activity, (iii) neuromodulatory activity, (iv) cytotoxic activity, or some combination thereof. Each view of the set of views may present an optimized sequence representing the selected candidate drug compound.

The optimized sequence in each view may be generated using any suitable optimization technique. The optimization technique may include maximizing or minimizing an objective function by systematically selecting input values from a domain of values and computing the value using the objective function. The domain of values may include a subset of values from a Euclidean space. The subset of values may satisfy one or more constraints, equalities, and/or inequalities. A value that minimizes or maximizes the objective function may be referred to as an optimal solution. Certain values in the subset may result in a gradient of the objective function being zero. Those certain values may be at stationary points, where a first derivative at those points with respect to time (dt) is zero. The gradient may refer to a scalar-valued differentiable function (e.g., objective function) of several variables, where a point p is a vector whose components are the partial derivatives of the objective function. If the gradient is not a zero vector at a certain point p, then a direction of the gradient is the direction of fastest increase of the objective function at the certain point p.

Gradients may be used in gradient descent, which refers to a first-order iterative optimization algorithm for finding the local minimum of an objective function. To find the local minimum, gradient descent may proceed by performing operations proportional to the negative of the gradient of the objective function at a current point. In some embodiments, the optimized sequence may be found for a candidate drug compound performing gradient descent in the design space. Additionally, gradient ascent, which is the algorithm opposite to gradient descent, may determine a local maximum of the objective function at various points in the design space.

The views generated may include a topographical heatmap, itself including indicators for the least activity at points in the design space and the most activity at points in the design space. The indicator associated with the most activity may represent a local maximum obtained using gradient ascent. The indicator associated with the least activity may represent a local minimum obtained using gradient descent. The optimal sequence may be generated by navigating points between the local minima and local maxima. The optimized sequence may be overlaid on the indicators ranging from at least one least active property to an at least one most active property.

In some embodiments, the processing device may cause the selected candidate drug compound to be formulated. In some embodiments, the processing device may cause the selected candidate drug compound to be created, manufactured, developed, synthesized, or the like. For example, the artificial intelligence engine 140 may control one or more pumps, mixers, heaters, and reaction chambers to synthesis a desired candidate drug sequence in an automated flow process. The synthesis may be controlled in the reaction chamber via one or more parameters, such as solvents, temperature, pressure, and the like. In some embodiments, the processing device may cause the selected candidate drug compound to be presented on a computing device (e.g., computing device 102). The selected candidate drug compound may include one or more active ingredients (e.g., chemicals) at a specified amount.

FIGS. 5A-5D provide illustrations of generating a first data structure including a biological context representation 200 of a plurality of drug compound devices according to certain embodiments of this disclosure. The first data format may include a knowledge graph. The biological context representation 200 may capture an entire biological context by integrating every known association or relationship for each drug compound into a comprehensive knowledge graph.

FIG. 5A presents the biological context representation 200 including biomedical and domain knowledge on peptide activity, microbes, antimicrobial compounds, clinical outcomes, and any relevant information depicted in FIG. 2. A table 500 may include rows representing various categories (A, B, C, D, and E) pertaining to a biological context for each drug compound and columns representing sub-categories (1, 2, 3, 4, and 5). For example, the table includes subcategories for category A: A1 2D fingerprints, A2 3D fingerprints, A3 Scaffolds, A4 Struct. Keys, A5 Physicochem./B: B1 Mech. Of act., B2 Metab. Genes, B3 Crystals, B4 Binding, B5 HTS bioassays/C: C1 S. mol. Roles, C2 S. mol. Path., C3 Signal. Path., C4 Biol. Proc., C5 Interactome/D: D1 Transcript, D2 Can. Cell lines, D3 Ch. Genetics, D4 Morphology, D5 Cell bioassays/E: E1 Therap. Areas, E2 Indications, E3 Side effects, E4 Dis. & Toxicol., E5 Drug-drug inter.

Charts 502, 504, and 506 represent characteristics for each subcategory. The characteristics for chart 502 include the size of molecules, for chart 504 the complexity of variables, and for 506 the correlation with mechanism of action. Another chart 508 may represent the various characteristics of the subcategories using an indicator (such as a range of colors from 0 to 1) to express the values of the characteristics in relation to each other.

FIG. 5B illustrates a different representation 520 of characteristics for several subcategories (e.g., A1, B1, C5, D1, and E3) across different subject matter areas (e.g., neurology and psychiatry, infectious disease, gastroenterology, cardiology, ophthalmology, oncology, endocrinology, pulmonary, rheumatology, and malignant hematology.). Accordingly, the representation 520 provides an even more granular representation of the biological context representation 200 than does the chart 508. Flowchart 530 represents the process for generating candidate drugs as described further herein.

FIG. 5C illustrates a knowledge graph 540 representing the biological context representation 200. The knowledge graph 540 may refer to a cognitive map. In particular, the knowledge graph 540 represents a graph traversed by the AI engine 140, when generating candidate drug compounds having desired levels of certain types of activity in a design space. Individual nodes in the knowledge graph 540 represent a health artifact (health-related information) or relationship (predicate) gleaned and curated from numerous data sources. Further, the knowledge represented in the knowledge graph 540 may be improved over time as the machine learning models discover new associations, correlations, and/or relationships. The nodes and relationships may form logical structures that represent knowledge (e.g., Genes Participates Pathways). FIG. 5D illustrates another representation of the knowledge graph 540 that more clearly identifies all the various relationships among the nodes.

FIG. 6 illustrates example operations of a method 600 for translating the first data structure of FIGS. 5A-5B a second data structure according to certain embodiments of this disclosure. Method 600 includes operations performed by processors of a computing device (e.g., any component of FIG. 1, such as server 128 executing the artificial intelligence engine 140). In some embodiments, one or more operations of the method 600 are implemented in computer instructions that are stored on a memory device and executed by a processing device. The method 600 may be performed in the same or a similar manner as described above in regards to method 400. The operations of the method 600 may be performed in some combination with any of the operations of any of the methods described herein.

The method 600 may include operation 404 from the previously-described method 400 depicted in FIG. 4. For example, at 404 in the method 600, the processing device may translate, by the artificial intelligence engine 140, the first data structure having the first format (e.g., knowledge graph) to the second data structure having the second format (e.g., vector). The method 600 in FIG. 6 includes operations 602 and 604.

At 602, the processing device may obtain a higher dimensional vector from the biological context representation 200. This process is further illustrated in FIG. 7.

At 604, the processing device may compress the higher-dimensional vector to a lower dimensional-vector. The compressing may be performed by a first machine learning model 132 trained to perform deep auto-encoding via a recurrent neural network configured to output the lower-dimensional vector.

At 606, the processing device may train the first machine learning model 132 by using a second machine learning model 132 to recreate the first data structure having the first format. The second machine learning model 132 is trained to perform a decoding operation to recreate the first data structure having the first format. The decoding operation may be performed on the second data structure having the second data format (e.g., two-dimensional vector).

FIG. 7 provides illustrations of translating the first data structure of FIGS. 5A-5B to the second data structure according to certain embodiments of this disclosure. Aggregated biological data may be difficult to model and format correctly for an AI engine to process. Aspects of the present disclosure overcome the hurdle of modeling and formatting the aggregated biological data to enable the AI engine 140 to generate candidate drug compounds accurately and efficiently.

As depicted, a higher-dimensional vector 700 may be obtained from the biological context representation 200. Using a recurrent neural network performing autoencoding, the higher-dimensional vector is compressed to a lower-dimensional vector 702. The recurrent neural network performing autoencoding is trained using another machine learning model 132 that recreates the higher-dimensional vector 704. If the other machine learning model 132 is unable to recreate higher-dimensional vector 704 from the lower-dimensional vector 702, then the other machine learning model 132 provides feedback to the recurrent neural network performing autoencoding in order to update its weights, biases, or any suitable parameters.

FIGS. 8A-8C provide illustrations of views of a selected candidate drug compound according to certain embodiments of this disclosure. As depicted, FIG. 8A illustrates a view 800 including antimicrobial activity, FIG. 8B illustrates a view 802 including immunomodulatory activity, and FIG. 8C illustrates a view 804 including cytotoxic activity. Each view presents a topographical heatmap where one axis is for sequence parameter y and the other axis is for sequence parameter x. Each view includes an indicator ranging from a least active property to a most active property. Further each view includes an optimized sequence 806 for a selected candidate drug compound classified by the classifier (machine learning model 132). These views may be presented to the user on a computing device 102. Further, the selected candidate drug compound 806 may be formulated, generated, created, manufactured, developed, and/or tested.

FIG. 9 illustrates example operations of a method 900 for presenting a view including a selected candidate drug compound according to certain embodiments of this disclosure. Method 900 includes operations performed by processors of a computing device (e.g., any component of FIG. 1, such as computing device 102). In some embodiments, one or more operations of the method 1000 are implemented in computer instructions that are stored on a memory device and executed by a processing device. The method 1000 may be performed in the same or a similar manner as described above in regards to method 400. The operations of the method 1000 may be performed in some combination with any of the operations of any of the methods described herein.

At 902, the processing device may receive, from the artificial intelligence engine 140, a candidate drug compound generated by the artificial intelligence engine 140.

At 904, the processing device may generate a view including the candidate drug compound overlaid on a representation of a design space. The view may present a topographical heatmap of the representation of the design space. The topographical heatmap may include the candidate drug compound overlaid on indicators ranging from an at least one least active property to an at least one most active property.

At 906, the processing device may present the view on a display screen of a computing device (e.g., computing device 102).

FIG. 10A illustrates example operations of a method 1000 for using causal inference during the generation of candidate drug compounds according to certain embodiments of this disclosure. Method 1000 includes operations performed by processors of a computing device (e.g., any component of FIG. 1, such as server 128 executing the artificial intelligence engine 140). In some embodiments, one or more operations of the method 1000 are implemented in computer instructions that are stored on a memory device and executed by a processing device. The method 1000 may be performed in the same or a similar manner as described above in regards to method 400. The operations of the method 1000 may be performed in some combination with any of the operations of any of the methods described herein.

At 1002, the processing device may perform one or more modifications pertaining to the biological context representation 200, the second data structure having the second format, or some combination thereof.

At 1004, the processing device may use causal inference to determine whether the one or more modifications provide one or more desired performance results. In some embodiments, using causal inference may further include using 1006 counterfactuals to calculate alternative scenarios based on past actions, occurrences, results, regressions, regression analyses, correlations, or some combination thereof. The term “calculate” may be used interchangeably with any of the following terms: simulate, emulate, determine, generate, formulate, execute, and/or obtain. A counterfactual may refer to determining whether the desired performance still results if something does not occur during the calculation. For example, in a scenario, a person may improve their health after taking a medication. The counterfactual may be used in causal inference to calculate an alternative scenario to see whether the person's health improved without taking the medication. If the person's health still improved without taking the medication it may be inferred that the medication did not cause the health of the person to improve. However, if the person's health did not improve without taking the medication, it may be inferred that the medication is correlated with causing the health of the person to improve. There may, however, be other factors involved in conjunction with taking the medication that actually cause the health of the person to improve.

FIG. 10B illustrates another example of operations of method 1050 for using causal inference during the generation of candidate drug compounds according to certain embodiments of this disclosure. Method 1050 includes operations performed by processors of a computing device (e.g., any component of FIG. 1, such as server 128 executing the artificial intelligence engine 140). In some embodiments, one or more operations of the method 1050 are implemented in computer instructions that are stored on a memory device and executed by a processing device. The method 1050 may be performed in the same or a similar manner as described above in regards to method 400. The operations of the method 1050 may be performed in some combination with any of the operations of any of the methods described herein.

At 1052, the processing device may generate a set of candidate drug compounds by performing a modification using causal inference based on a counterfactual. For example, the counterfactual may include removing an ingredient from a sequence of ingredients to determine whether a candidate drug compound provides the same level and/or type of activity it previously provided when the ingredient was included in the sequence. If the same level and/or type of activity is still provided after application of the counterfactual (e.g., removal of the ingredient), then the processing device may use causal inference to determine that the ingredient is not correlated with the level and/or type of activity. If the same level and/or type of activity is not present after application of the counterfactual (e.g., removal of the ingredient), then the processing device may use causal inference to determine that the ingredient is correlated with the level and/or type of activity.

At 1054, the processing device may classify a candidate dug compound from the set of candidate drug compounds as a selected candidate drug compound, as previously described herein.

FIG. 11 illustrates example computer system 1100 which can perform any one or more of the methods described herein, in accordance with one or more aspects of the present disclosure. In one example, computer system 1100 may correspond to the computing device 102 (e.g., user computing device), one or more servers 128 of the cloud-based computing system 116, the training engine 130, or any suitable component of FIG. 1. The computer system 1100 may be capable of executing application 118 and/or the one or more machine learning models 132 of FIG. 1. The computer system may be connected (e.g., networked) to other computer systems in a LAN, an intranet, an extranet, or the Internet. The computer system may operate in the capacity of a server in a client-server network environment. The computer system may be a personal computer (PC), a tablet computer, a wearable (e.g., wristband), a set-top box (STB), a personal Digital Assistant (PDA), a mobile phone, a camera, a video camera, or any device capable of executing a set of instructions (sequential or otherwise) that specify actions to be taken by that device. Further, while only a single computer system is illustrated, the term “computer” shall also be taken to include any collection of computers that individually or jointly execute a set (or multiple sets) of instructions to perform any one or more of the methods discussed herein.

The computer system 1100 includes a processing device 1102, a main memory 1104 (e.g., read-only memory (ROM), flash memory, solid state drives (SSDs), dynamic random access memory (DRAM) such as synchronous DRAM (SDRAM)), a static memory 1106 (e.g., flash memory, solid state drives (SSDs), static random access memory (SRAM)), and a data storage device 1108, which communicate with each other via a bus 1110.

Processing device 1102 represents one or more general-purpose processing devices such as a microprocessor, central processing unit, or the like. More particularly, the processing device 1102 may be a complex instruction set computing (CISC) microprocessor, reduced instruction set computing (RISC) microprocessor, very long instruction word (VLIW) microprocessor, or a processor implementing other instruction sets or processors implementing a combination of instruction sets. The processing device 1102 may also be one or more special-purpose processing devices such as an application specific integrated circuit (ASIC), a system on a chip, a field programmable gate array (FPGA), a digital signal processor (DSP), network processor, or the like. The processing device 1102 is configured to execute instructions for performing any of the operations and steps discussed herein.

The computer system 1100 may further include a network interface device 1112. The computer system 1100 also may include a video display 1114 (e.g., a liquid crystal display (LCD), a light-emitting diode (LED), an organic light-emitting diode (OLED), a quantum LED, a cathode ray tube (CRT), a shadow mask CRT, an aperture grille CRT, a monochrome CRT), one or more input devices 1116 (e.g., a keyboard and/or a mouse), and one or more speakers 1118 (e.g., a speaker). In one illustrative example, the video display 1114 and the input device(s) 1116 may be combined into a single component or device (e.g., an LCD touch screen).

The data storage device 1116 may include a computer-readable medium 1120 on which the instructions 1122 embodying any one or more of the methods, operations, or functions described herein is stored. The instructions 1122 may also reside, completely or at least partially, within the main memory 1104 and/or within the processing device 1102 during execution thereof by the computer system 1100. As such, the main memory 1104 and the processing device 1102 also constitute computer-readable media. The instructions 1122 may further be transmitted or received over a network via the network interface device 1112.

While the computer-readable storage medium 1120 is shown in the illustrative examples to be a single medium, the term “computer-readable storage medium” should be taken to include a single medium or multiple media (e.g., a centralized or distributed database, and/or associated caches and servers) that store the one or more sets of instructions. The term “computer-readable storage medium” shall also be taken to include any medium that is capable of storing, encoding or carrying a set of instructions for execution by the machine and that cause the machine to perform any one or more of the methodologies of the present disclosure. The term “computer-readable storage medium” shall accordingly be taken to include, but not be limited to, solid-state memories, optical media, and magnetic media.

None of the description in this application should be read as implying that any particular element, step, or function is an essential element that must be included in the claim scope. The scope of patented subject matter is defined only by the claims. Moreover, none of the claims is intended to invoke 35 U.S.C. § 112(f) unless the exact words “means for” are followed by a participle.

Consistent with the above disclosure, the examples of systems and method enumerated in the following clauses are specifically contemplated and are intended as a non-limiting set of examples.

Clause 1. A method comprising:

- generating a biological context representation of a plurality of drug compounds, wherein the biological context representation comprises a first data structure having a first format;
- translating, by an artificial intelligence engine, the first data structure having the first format to a second data structure having a second format;
- generating, based on the second data structure having the second format, a plurality of candidate drug compounds; and classifying a candidate drug compound from the plurality of candidate drug compounds as a selected candidate drug compound.

Clause 2. The method of any preceding clause, wherein the biological context representation comprises, for each of the plurality of drug compounds, one or more relationships between or among:

- physical properties data,
- chemical data,
- biological data,
- clinical outcome data, or
- some combination thereof.

Clause 3. The method of any preceding clause, wherein the translating the first data structure further comprises:

- converting the first data structure having the first format to the second data structure having the second format according to a specific set of rules executed by the artificial intelligence engine.

Clause 4. The method of any preceding clause, wherein converting the biological context representation to the second data structure having the second format further comprises:

- obtaining a higher-dimensional vector from the biological context representation; and
- compressing the higher-dimensional vector to a lower-dimensional vector, wherein the compressing is performed by a machine learning model trained to perform deep auto-encoding via a recurrent neural network configured to output the lower-dimensional vector.

Clause 5. The method of any preceding clause, further comprising:

- training the machine learning model by using a second machine learning model to recreate the higher-dimensional vector,
- wherein the second machine learning model is trained to perform a decoding operation to recreate the higher-dimensional vector, wherein the decoding operation is performed on the lower-dimensional vector.

Clause 6. The method of any preceding clause, wherein:

- the translating is performed by a recurrent neural network,
- the generating of the plurality of candidate drug compounds is performed by a generative adversarial network, and
- the classifying of the candidate drug compound is performed by a classifier trained using supervised learning.

Clause 7. The method of any preceding clause, further comprising:

- generating a plurality of views comprising a representation of a design space; and
- causing the plurality of views to be presented on a computing device,
- wherein the representation pertains to:
  - antimicrobial activity,
  - immunomodulatory activity,
  - neuromodulatory activity,
  - cytotoxic activity,
  - or some combination thereof.

Clause 8. The method of any preceding clause, wherein the design space is antimicrobial.

Clause 9. The method of any preceding clause, wherein each view of the plurality of views presents an optimized sequence representing the selected candidate drug compound.

Clause 10. The method of any preceding clause, wherein each view presents a topographical heatmap comprising the optimized sequence overlaid on indicators ranging from an at least one least active property to an at least one most active property.

Clause 11. The method of any preceding clause, further comprising:

- performing one or more modifications pertaining to the biological context representation, the second data structure having the second format, or some combination thereof; and
- using causal inference to determine whether the one or more modifications provide one or more desired performance results.

Clause 12. The method of any preceding clause, wherein using causal inference further comprises:

- using counterfactuals to calculate alternative scenarios based on past actions, occurrences, results, regressions, regression analyses, correlations, or some combination thereof.

Clause 13. The method of any preceding clause, further comprising:

- causing the selected candidate drug compound to be formulated.

Clause 14. The method of any preceding clause, further comprising:

- causing the selected candidate drug compound to be created.

Clause 15. The method of any preceding clause, further comprising:

- causing the selected candidate drug compound to be presented on a computing device.

Clause 16. A tangible, non-transitory computer-readable medium storing instructions that, when executed, cause a processing device to:

- generate a biological context representation of a plurality of drug compounds, wherein the biological context representation comprises a first data structure having a first format;
- translate the first data structure having the first format to a second data structure having a second format;
- generate, based on the second data structure having the second format, a plurality of candidate drug compounds; and
- classify a candidate drug compound from the plurality of candidate drug compounds as a selected candidate drug compound.

Clause 17. The computer-readable medium of any preceding clause, wherein the biological context representation comprises, for each of the plurality of drug compounds, one or more relationships between or among:

- physical properties data,
- chemical data,
- biological data,
- clinical outcome data, or
- some combination thereof.

Clause 18. The computer-readable medium of any preceding clause, wherein translating the biological context representation to the second data structure having the second format further comprises:

- obtaining a higher-dimensional vector from the biological context representation; and
- compressing the higher-dimensional vector to a lower-dimensional vector, wherein the compressing is performed by a machine learning model trained to perform deep auto-encoding via a recurrent neural network configured to output the lower-dimensional vector.

Clause 19. The computer-readable medium of any preceding clause, wherein the processing device further:

- trains the machine learning model by using a second machine learning model to recreate the higher-dimensional vector,
- wherein the second machine learning model is trained to perform a decoding operation to recreate the higher-dimensional vector, wherein the decoding operation is performed on the lower-dimensional vector.

Clause 20. The computer-readable medium of any preceding clause, wherein:

- the translating is performed by a recurrent neural network,
- the generating of the plurality of candidate drug compounds is performed by a generative adversarial network, and
- the classifying of the candidate drug compound is performed by a classifier trained using supervised learning.

Clause 21. The computer-readable medium of any preceding clause, wherein the processing device further:

- generates a plurality of views comprising a representation of a design space,
- wherein the representation pertains to:
  - antimicrobial activity,
  - immunomodulatory activity,
  - neuromodulatory activity,
  - cytotoxic activity,
  - or some combination thereof.

Clause 22. The computer-readable medium of any preceding clause, wherein the processing device further:

- causes each view of the plurality of views to be presented, wherein each such view comprises an optimized sequence representing the selected candidate drug compound, and
- causes each view of the plurality of views to be presented, wherein each such view comprises the optimized sequence overlaid on indicators ranging from an at least one least active property to an at least one most active property.

Clause 23. The computer-readable medium of any preceding clause, wherein the processing device further:

- performs one or more modifications pertaining to the biological context representation, the second data structure having the second format, or some combination thereof; and
- uses causal inference to determine whether the one or more modifications provide a desired performance result.

Clause 24. The computer-readable medium of any preceding clause, wherein the processing device employs causal inference to further:

- use counterfactuals to calculate alternative scenarios based on past actions, occurrences, results, regressions, regression analyses, correlations, or some combination thereof.

Clause 25. A system comprising:

- a memory device storing instructions; and
- a processing device communicatively coupled to the memory device, the processing device executes the instructions to:
  - generate a biological context representation of a plurality of drug compounds, wherein the biological context representation comprises a first data structure having a first format;
  - translate the first data structure having the first format to a second data structure having a second format;
  - generate, based on the second data structure having the second format, a plurality of candidate drug compounds; and
  - classify a candidate drug compound from the plurality of candidate drug compounds as a selected candidate drug compound.

Clause 26. A computing device comprising:

- a display screen;
- a memory device storing instructions; and
- a processing device communicatively coupled to the memory device and the display screen, the processing device executes the instructions to:
  - receive, from an artificial intelligence engine, a candidate drug compound generated by the artificial intelligence engine;
  - generate a view comprising the candidate drug compound overlaid on a representation of a design space, wherein the view presents a topographical heatmap of the representation of the design space, and the topographical heatmap includes the candidate drug compound overlaid on indicators ranging from an at least one least active property to an at least one most active property; and
- present the view on the display screen.

Clause 27. A method comprising:

- generating a biological context representation of a plurality of drug compounds, wherein the biological context representation comprises a first data structure having a first format;
- translating, by an artificial intelligence engine, the first data structure having the first format to a second data structure having a second format;
- generating, based on the second data structure having the second format, a plurality of candidate anti-biofilm compounds; and
- classifying a candidate anti-biofilm compound from the plurality of candidate anti-biofilm compounds as a selected candidate anti-biofilm compound.

Clause 28. A method comprising:

- generating a biological context representation of a plurality of drug compounds, wherein the biological context representation comprises a first data structure having a first format;
- translating, by an artificial intelligence engine, the first data structure having the first format to a second data structure having a second format;
- generating, based on the second data structure having the second format, a plurality of candidate anti-cancer compounds; and
- classifying a candidate anti-cancer compound from the plurality of candidate anti-cancer compounds as a selected candidate anti-cancer compound.

Clause 29. A method comprising:

- generating a biological context representation of a plurality of drug compounds, wherein the biological context representation comprises a first data structure having a first format;
- translating, by an artificial intelligence engine, the first data structure having the first format to a second data structure having a second format;
- generating, based on the second data structure having the second format, a plurality of candidate antimicrobial compounds; and
- classifying a candidate antimicrobial compound from the plurality of candidate antimicrobial compounds as a selected candidate antimicrobial compound.

Claims

1. A method comprising:

generating a biological context representation of a plurality of drug compounds, wherein the biological context representation comprises a first data structure having a first format;

translating, by an artificial intelligence engine, the first data structure having the first format to a second data structure having a second format;

generating, based on the second data structure having the second format, a plurality of candidate drug compounds; and

classifying a candidate drug compound from the plurality of candidate drug compounds as a selected candidate drug compound.

2. The method of claim 1, wherein the biological context representation comprises, for each of the plurality of drug compounds, one or more relationships between or among:

physical properties data,

chemical data,

biological data,

clinical outcome data, or

some combination thereof.

3. The method of claim 1, wherein the translating the first data structure further comprises:

converting the first data structure having the first format to the second data structure having the second format according to a specific set of rules executed by the artificial intelligence engine.

4. The method of claim 1, wherein converting the biological context representation to the second data structure having the second format further comprises:

obtaining a higher-dimensional vector from the biological context representation; and

compressing the higher-dimensional vector to a lower-dimensional vector, wherein the compressing is performed by a machine learning model trained to perform deep auto-encoding via a recurrent neural network configured to output the lower-dimensional vector.

5. The method of claim 4, further comprising:

training the machine learning model by using a second machine learning model to recreate the higher-dimensional vector,

wherein the second machine learning model is trained to perform a decoding operation to recreate the higher-dimensional vector, wherein the decoding operation is performed on the lower-dimensional vector.

6. The method of claim 1, wherein:

the translating is performed by a recurrent neural network,

the generating of the plurality of candidate drug compounds is performed by a generative adversarial network, and

the classifying of the candidate drug compound is performed by a classifier trained using supervised learning.

7. The method of claim 1, further comprising:

generating a plurality of views comprising a representation of a design space; and

causing the plurality of views to be presented on a computing device,

wherein the representation pertains to: antimicrobial activity, immunomodulatory activity, neuromodulatory activity, cytotoxic activity, or some combination thereof.

8. The method of claim 7, wherein the design space is antimicrobial for prosthetic joint infections.

9. The method of claim 7, wherein each view of the plurality of views presents an optimized sequence representing the selected candidate drug compound.

10. The method of claim 9, wherein each view presents a topographical heatmap comprising the optimized sequence overlaid on indicators ranging from an at least one least active property to an at least one most active property.

11. The method of claim 1, further comprising:

performing one or more modifications pertaining to the biological context representation, the second data structure having the second format, or some combination thereof; and

using causal inference to determine whether the one or more modifications provide one or more desired performance results.

12. The method of claim 11, wherein using causal inference further comprises:

using counterfactuals to calculate alternative scenarios based on past actions, occurrences, results, regressions, regression analyses, correlations, or some combination thereof.

13. The method of claim 1, further comprising:

causing the selected candidate drug compound to be formulated.

14. The method of claim 1, further comprising:

causing the selected candidate drug compound to be created.

15. The method of claim 1, further comprising:

causing the selected candidate drug compound to be presented on a computing device.

16. A tangible, non-transitory computer-readable medium storing instructions that, when executed, cause a processing device to:

generate a biological context representation of a plurality of drug compounds, wherein the biological context representation comprises a first data structure having a first format;

translate the first data structure having the first format to a second data structure having a second format;

generate, based on the second data structure having the second format, a plurality of candidate drug compounds; and

classify a candidate drug compound from the plurality of candidate drug compounds as a selected candidate drug compound.

17. The computer-readable medium of claim 16, wherein the biological context representation comprises, for each of the plurality of drug compounds, one or more relationships between or among:

physical properties data,

chemical data,

biological data,

clinical outcome data, or

some combination thereof.

18. The computer-readable medium of claim 16, wherein translating the biological context representation to the second data structure having the second format further comprises:

obtaining a higher-dimensional vector from the biological context representation; and

compressing the higher-dimensional vector to a lower-dimensional vector, wherein the compressing is performed by a machine learning model trained to perform deep auto-encoding via a recurrent neural network configured to output the lower-dimensional vector.

19. The computer-readable medium of claim 18, wherein the processing device further:

trains the machine learning model by using a second machine learning model to recreate the higher-dimensional vector,

wherein the second machine learning model is trained to perform a decoding operation to recreate the higher-dimensional vector, wherein the decoding operation is performed on the lower-dimensional vector.

20. The computer-readable medium of claim 16, wherein:

the translating is performed by a recurrent neural network,

the generating of the plurality of candidate drug compounds is performed by a generative adversarial network, and

the classifying of the candidate drug compound is performed by a classifier trained using supervised learning.

21. The computer-readable medium of claim 16, wherein the processing device further:

generates a plurality of views comprising a representation of a design space,

wherein the representation pertains to: antimicrobial activity, immunomodulatory activity, neuromodulatory activity, cytotoxic activity, or some combination thereof.

22. The computer-readable medium of claim 21, wherein the processing device further:

causes each view of the plurality of views to be presented, wherein each such view comprises an optimized sequence representing the selected candidate drug compound, and

causes each view of the plurality of views to be presented, wherein each such view comprises the optimized sequence overlaid on indicators ranging from an at least one least active property to an at least one most active property.

23. The computer-readable medium of claim 16, wherein the processing device further:

performs one or more modifications pertaining to the biological context representation, the second data structure having the second format, or some combination thereof and

uses causal inference to determine whether the one or more modifications provide a desired performance result.

24. The computer-readable medium of claim 23, wherein the processing device employs causal inference to further:

use counterfactuals to calculate alternative scenarios based on past actions, occurrences, results, regressions, regression analyses, correlations, or some combination thereof.

25. A system comprising:

a memory device storing instructions; and

a processing device communicatively coupled to the memory device, the processing device executes the instructions to: generate a biological context representation of a plurality of drug compounds, wherein the biological context representation comprises a first data structure having a first format; translate the first data structure having the first format to a second data structure having a second format; generate, based on the second data structure having the second format, a plurality of candidate drug compounds; and classify a candidate drug compound from the plurality of candidate drug compounds as a selected candidate drug compound.

26. A computing device comprising:

a display screen;

a memory device storing instructions; and

a processing device communicatively coupled to the memory device and the display screen, the processing device executes the instructions to: receive, from an artificial intelligence engine, a candidate drug compound generated by the artificial intelligence engine; generate a view comprising the candidate drug compound overlaid on a representation of a design space, wherein the view presents a topographical heatmap of the representation of the design space, and the topographical heatmap includes the candidate drug compound overlaid on indicators ranging from an at least one least active property to an at least one most active property; and present the view on the display screen.

27. A method comprising:

generating a biological context representation of a plurality of drug compounds, wherein the biological context representation comprises a first data structure having a first format;

translating, by an artificial intelligence engine, the first data structure having the first format to a second data structure having a second format;

generating, based on the second data structure having the second format, a plurality of candidate anti-biofilm compounds; and

classifying a candidate anti-biofilm compound from the plurality of candidate anti-biofilm compounds as a selected candidate anti-biofilm compound.

28. A method comprising:

generating a biological context representation of a plurality of drug compounds, wherein the biological context representation comprises a first data structure having a first format;

translating, by an artificial intelligence engine, the first data structure having the first format to a second data structure having a second format;

generating, based on the second data structure having the second format, a plurality of candidate anti-cancer compounds; and

classifying a candidate anti-cancer compound from the plurality of candidate anti-cancer compounds as a selected candidate anti-cancer compound.

29. A method comprising:

generating a biological context representation of a plurality of drug compounds, wherein the biological context representation comprises a first data structure having a first format;

translating, by an artificial intelligence engine, the first data structure having the first format to a second data structure having a second format;

generating, based on the second data structure having the second format, a plurality of candidate antimicrobial compounds; and

classifying a candidate antimicrobial compound from the plurality of candidate antimicrobial compounds as a selected candidate antimicrobial compound.