FIELD-BASED SIMILARITY SEARCH SYSTEM AND METHOD
A field-based similarity search system includes an input device which inputs a query molecule, and a processor which partitions a conformational space of the query molecule into a fragment graph including an acyclic graph including plural fragment nodes connected by rotatable bond edges, computes a property field on fragment pairs of fragments of the query molecule from the fragment graph, the property field including a local approximation of a property field of the query molecule, constructs a set of features of the fragment pairs based on the property field, the features including a set of local, rotationally invariant, and moment-based descriptors generated from all conformations of the fragment graph of the query molecule, and weights the descriptors according to importance as perceived from a training set of descriptors to generate a context-adapted descriptor-to-key mapping which maps the set of descriptors to a set of feature keys.
Latest IBM Patents:
The present application is a Divisional application of U.S. patent application Ser. No. 13/597,385, filed on Aug. 29, 2012, which was a Continuation application of U.S. patent application Ser. No. 13/113,256, filed on May 23, 2011, which was a Divisional application of U.S. patent application Ser. No. 12/544,889 filed on Aug. 20, 2009, which was a Divisional application of U.S. patent application Ser. No. 10/102,902 filed on Mar. 22, 2002, which claims the benefit of Provisional Application No. 60/278,260 which was filed on Mar. 23, 2001, and which are incorporated herein by reference.
This application is related to U.S. Pat. No. 6,349,265 assigned to International Business Machines Corporation, and U.S. patent application Ser. No. 09/275,568 filed on Mar. 24, 1999 (assigned to International Business Machines Corporation and having assignee's Docket No. YO998-0112), which are also incorporated herein by reference.
BACKGROUND OF THE INVENTION1. Field of the Invention
The present invention generally relates to a field-based similarity search system and method, and more particularly, a field-based similarity search system and method which identifies similar molecules based on fragment pair feature similarities.
2. Description of the Related Art
An important problem in drug discovery efforts is finding molecules that have a similar function to a molecule known to be active towards a biological target or that elicits a biological response of interest. Commonly, such detailed structural information of the mechanism of action is unknown. Although, there exist several conventional methods to search for or create such compounds, the most common involves a conjectured conformation of the molecule of interest (e.g., query molecule), or the repeated application of a method to a series of query conformers.
More specifically, there exist several conventional methods of aligning a group of flexible molecules to a particular query conformation (e.g., molecular superposition) using computer-assisted drug design (see, e.g., C. Lemmen, T. Lengauer, and G. Klebe, FLEXS: A method for fast flexible ligand superposition. Journal of Medicinal Chemistry, 41:4502-4520, 1998; Michael D. Miller, Robert P. Sheridan, and Simon K. Kearsley, Sq: A program for rapidly producing pharmacophorically relevant molecular superpositions, J. Med. Chem., 42:1505-1514, 1999; Christian Lemmen, Claus Hiller, and Thomas Lengauer, Rigfit: a new approach to superimposing ligand molecules. Journal of Computer-Aided Molecular Design, 11:357 368, 1997; Gerhard Klebe, Thomas Mietzner, and Frank Weber, Different approaches toward an automatic structural alignment of drug molecules: Applications to sterol mimics, thrombin and thermolysin inhibitors, Journal of Computer-Aided Molecular Design, 8:751778, 1994; Simon K. Kearsley and Graham M. Smith, An alternative method for the alignment of molecular structures: maximizing electrostatic and steric overlap, Journal of Computer Aided Molecular Design, 8:565 582, 1994; Colin McMartin and Regine S. Bohacek, Flexible matching of test ligands to a 3d pharmacophore using a molecular superposition force field: Comparison of predicted and experimental conformations of inhibitors of three enzymes. J. Med. Chem., 42:1505 1514, 1999; S. Handschuh, M. Wagnener, and J. Gasteiger, Superposition of three-dimensional chemical structures allowing for conformational flexibility by a hybrid method, J. Chem. Inf. Comput. Sci., 38:220-232, 1998; J. Mestres. D. C. Rohrer, and G. M. Maggiora, A molecular field-based similarity approach to pharmacophoric pattern recognition, J. Mol. Graph. Modeling, 15:114-21, 1997; Y. C. Martin, M. G. Bures, E. A. Danaher, J. DeLazzer, J. Lico, and P. A. Pavlik, A fast new approach to pharmacophore mapping and its application to dopaminergic and benzodiazepine agonists, Journal of Computer Aided Molecular Design, 7:83-102, 1993; Peter Willett, Searching for pharmacophoric patterns in databases of three dimensional chemical structures, Journal of Molecular Recognition, 8:290-303, 1995; Gareth Jones, Peter Willett, and Robin C. Glen, A genetic algorithm for flexible molecular overlay and pharmacophore elucidation, J. Comput.-Aided Mol. Des., 9:532-549, 1995; D. A. Thorner, D. J. Wild, P. Willett, and P. M. Wright, Similarity searching in files of three-dimensional chemical structures: flexible field-based searching of molecular electrostatic potentials, Journal of Chemical Information and Computer Sciences, 36:900 908, 1996; D. J. Wild and P. Willett. Similarity searching in files of 3-dimensional chemical structures—alignment of molecular electrostatic potential fields with a genetic algorithm, Journal of Chemical Information and Computer Sciences, 36:159-167, 1996; David A. Thorner, Peter Willett, Robert C. Glen, P. M. Wright, and Robin Taylor, Similarity searching in files of three-dimensional chemical structures” representation and searching of molecular electrostatic potentials using field-graphs, J. Comput.-Aided Mol. Des., 1:163 174, 1997; and Gerhard Klebe, Structural alignment of molecules, In Hugo Kubinyi, editor, 3D QSAR in Drug Design, pages 173-199. ESCOM, Leiden, 1993).
In the absence of structural information regarding the ligand receptor or ligand-enzyme complex, structural alignment is a way of both elucidating important features responsible for activity (Ki Hwan Kim, List of comfa references 1993-1997, In Hugo Kubinyi, Gerd Folkers, and Yvonne C. Martin, editors, 3D QSAR in Drug Design, volume 3, pages 317 338, Kluwer, Dordrecht/Boston/London, 1998; and Gerhard Klebe, Comparative molecular similarity indicies analysis, In Hugo Kubinyi, Gerd Folkers, and Yvonne C. Martin, editors, 3D QSAR in Drug Design, volume 3, pages 87-104, Kluwer, Dordrecht/Boston/London, 1998) and a means of finding new molecules with similar or better activity (Michael D. Miller, Robert P. Sheridan, and Simon K. Kearsley, Sq: A program for rapidly producing pharmacophorically relevant molecular superpositions, J. Med. Chem., 42:1505-1514, 1999; Andrew C. Good and Jonathan S. Mason, Three dimensional structure database searches, In Kenny B. Lipkowitz and Donald B. Boyd, editors, Reviews in Computational Chemistry, volume 7, chapter 2, pages 67 117. VCH Publishers, Inc., New York, 1996; and Peter Willett, Searching for pharmacophoric patterns in databases of three dimensional chemical structures, Journal of Molecular Recognition, 8:290-303, 1995).
Generally, when one is attempting to elucidate spatial and chemical information about the nature of the host ligand interaction, one often begins with the alignment of a series of active compounds based on some kind of alignment rule. Unfortunately, this process is riddled with difficulties and assumptions about the relevant conformations, relevant features, importance of internal strain, the role of hydrogen bonds, electrostatics, solvation and hydrophobicity, as well as more profound concerns such as whether compounds in a data set even bind at the receptor site via the same mechanism. It is clear that no single method for alignment will settle these issues across widely varying contexts.
Several conventional superposition methods reported are field-based (see, e.g., Michael D. Miller, Robert P. Sheridan, and Simon K. Kearsley, Sq: A program for rapidly producing pharmacophorically relevant molecular superpositions, J. Med. Chem., 42:1505-1514, 1999; Christian Lemmen, Claus Hiller, and Thomas Lengauer, Rigfit: a new approach to superimposing ligand molecules, Journal of Computer-Aided Molecular Design, 11:357 368, 1997; J. Mestres. D. C. Rohrer, and G. M. Maggiora, A molecular field-based similarity approach to pharmacophoric pattern recognition, J. Mol. Graph. Modeling, 15:114-21, 1997; and D. A. Thorner, D. J. Wild, P. Willett, and P. M. Wright, Similarity searching in files of three-dimensional chemical structures: flexible field-based searching of molecular electrostatic potentials, Journal of Chemical Information and Computer Sciences, 36:900 908, 1996). An attractive aspect of field-based approaches is the potential for incorporating high levels of theory into the field. Apart from the difficulties and expense of deploying high level quantum mechanical calculations, the design of a system that can utilize the results of such calculations for use in similarity analysis is considered forward looking.
However, such conventional field-based approaches are confined to a particular field definition. For example, a conventional system designed for simplistic, phenomenological fields, is not available for fields derived from quantum mechanical calculations. This severely limits the versatility of these conventional systems.
SUMMARY OF THE INVENTIONIn view of the foregoing, and other problems, disadvantages, and drawbacks of conventional systems and methods, the present invention has been devised having as its objective, to provide a fast and efficient field-based similarity search system and method.
The present invention includes a field-based similarity search system which includes a database for storing at least one candidate molecule, an input device for inputting a query molecule, and a processor for identifying a candidate molecule which is similar to the query molecule based on a similarity of fragment pair features.
Specifically, the processor may identify a feature of a query molecule fragment pair. The processor may also match a query molecule fragment pair feature with a candidate molecule fragment pair feature, and retrieve a candidate molecule fragment pair having at least a predetermined number of matching features.
Further, the processor may align retrieved fragment pairs using a pose clustering method. The processor may also construct a fully aligned candidate molecule by assembling retrieved fragment pairs. The processor may also generate the fragment pair for the query molecule.
In addition, the inventive system may include a display device for displaying an output from the processor. The system may also include a memory device for storing a query molecule fragment pair feature. Further, the memory device may store data and instructions to be executed by the processor.
More specifically, the fragment pair may include two neighboring fragments connected by a rotatable bond at a specific dihedral angle. Further, the feature may include a generalization of a CoMMA descriptor.
In another aspect, the present invention includes an inventive field-based similarity search method which includes generating a conformational space representation and an arbitrary description of a three-dimensional property field for a flexible molecule, characterizing parts of the flexible molecule and representing a query molecule using the arbitrary description for a comparison with the conformational space representation, and aligning the parts to the query molecule and assembling the parts to form an alignment onto the query molecule.
Alternatively, the inventive field-based similarity search method may include identifying a feature of a query molecule fragment pair, matching a query molecule fragment pair feature with a candidate molecule fragment pair feature; and retrieving a candidate molecule fragment pair having at least a predetermined number of matching features.
Alternatively, the inventive field-based similarity search method may include a method for finding alignments of a flexible molecule to a query molecule. The method may include, representing a conformation space of the flexible molecule, generating an arbitrary description of a three-dimensional property field of the flexible molecule, characterizing parts of the flexible molecule using the arbitrary description, representing a query molecule using the arbitrary description for a comparison with molecules represented by the conformation space, aligning the parts of the flexible molecule to the query molecule, and assembling the parts of the flexible molecule to form an alignment thereof onto the query molecule.
In another aspect, the present invention includes a programmable storage medium tangibly embodying a program of machine-readable instructions executable by a digital processing apparatus to perform a similarity search method, the method including generating a conformational space representation and an arbitrary description of a three-dimensional property field for a flexible molecule, characterizing parts of the flexible molecule and representing a query molecule using the arbitrary description for a comparison with the conformational space representation, and aligning the parts to the query molecule and assembling the parts to form an alignment onto the query molecule.
With its unique and novel features, the present invention provides a fast and efficient field-based similarity search system and method which are not confined to a particular field definition. For example, the inventive system and method may be designed for simplistic, phenomenological fields, as well as for fields derived from quantum mechanical calculations. Therefore, the claimed system and method provides improved versatility over conventional systems and methods.
The foregoing and other objects, aspects and advantages will be better understood from the following detailed description of a preferred embodiment of the invention with reference to the drawings, in which:
Referring now to the drawings,
As shown in
Specifically, the processor 130 may be a central processing unit (CPU) or a plurality of processors. Further, the processor 130 may identify a feature of a query molecule fragment pair. The processor 130 may also match a query molecule fragment pair feature with a candidate molecule fragment pair feature, and retrieve a candidate molecule fragment pair having at least a predetermined number of matching features. Further, the inventive system 100 may also include a memory device (e.g., RAM, ROM, etc.) for storing instructions for performing an operation using the processor 130. The memory device may also be used to store features, fragment pairs, matches of molecules, or any other data used or generated by the inventive system.
Further, inventive system 100 may also store specific algorithms in the memory device for performing a particular operation. The processor 130 may, therefore, access the memory device and/or database in order to process data according to a particular algorithm stored therein. For instance, the processor 130 may access data stored in the memory device or database (e.g., plurality of databases) which pertains to a query molecule or candidate molecule, in order to process such data and, therefore, determine is the candidate molecule is similar to the query molecule.
Further, the processor 130 may align retrieved fragment pairs using a pose clustering method. The processor 130 may also construct a fully aligned candidate molecule by assembling retrieved fragment pairs. The processor may also generate the fragment pair for the query molecule.
Further, the inventive system 100 may include a display device (e.g., video display unit) for displaying, for example, the query molecule, a candidate molecule, or other data used or generated by the inventive system 100. Further, a user may utilize the display device in order to help the user direct an operation in the inventive system 100. For instance, the display device may be used with the input device 120 so that the user may input instructions, data, etc. into the inventive system for storage in the database or memory device, and/or for processing by the processor 130.
An important feature of the inventive system 100 is that it allows the incorporation of context-specific information to balance considerations in a manner suitable to the problem at hand. The result is that one-time calibrations of similarity measures are inappropriate. Therefore, the basis of a similarity search is tunable to the particular context being considered.
In short, the present invention provides a systematic way of treating field-based similarity searching, without being confined to a particular field definition. In other words, the present invention allows for fields ranging from simplistic, phenomenological fields, like the examples discussed below, to fields derived from quantum mechanical calculations.
Overview
The inventive system 100 includes a database 110 which may include, for example, a set of molecules over which to conduct a search. This set of molecules may be an explicitly defined set of molecules, or may be a “virtual” set of molecules (e.g., a set which is built from all or some combination of smaller molecules according to specified rules.
For example, the database 110 may include a database of molecules, each with a single conformation, which is defined by a specification of its atoms, bonds and atomic positions. Atom specification may be by element, a chemical tag denoting an atom in a certain environment, and/or an atom that is to be treated in a certain way. Bond specification may involve a reference to two atoms, and a bond type. Bond types can be a valance bond order, a chemical tag denoting the bond in a particular environment, and/or a bond that is to be treated in a certain way.
For a single conformation of the molecule to be specified, Cartesian coordinates may be assigned to each atom. The co-ordinates can originate from x-ray crystal structure, nuclear magnetic resonance (NMR) solution structure, molecular mechanics, molecular dynamics, any form of quantum mechanical or hybrid quantum/classical calculations, or an entirely heuristical procedure.
Further, additional rules regarding how the molecules could combine to make larger molecules may be added. For example, given two molecules A and B, a rule may specify that A and B can combine to make C by a series of operations, such as alignment of atoms of A and B, addition or deletion of atoms and/or bonds in A and/or B, and conformation adjustments to A, B, and/or C. In this case, the database 110 would include all or some of the possible C's that result from the application of a rule set comprised of any of the operations mentioned above.
Specifically, the inventive system 100 represents the conformational space for each molecule by a set of rules, rather than being explicitly defined. This eliminates the intractable storage requirements that explicit enumeration requires for large sets of flexible molecules.
Specifically, each molecule may be partitioned into smaller fragments by the selection (either manual or algorithmic) of a set of bonds to cleave. The allowed angles for a cleavage bond is enumerated explicitly for each pair of conformers in the corresponding node sets. Conformers of the fragments isolated by cleavage bonds, however, are explicitly enumerated. A data representation in the form of a graph is constructed such that each node of this graph is a set of fragment conformers and each edge of this graph is constructed such that each node of conformer of the molecule can then be expressed as a selection of one member in each node set (fragment conformer) and edge set (cleavage bond angle).
Alternatively, the fragment sets of molecules may be left disconnected in terms of a graph describing a larger molecule. A set of rules, likened to reaction descriptions, embodied as transformations such as adding or deleting atoms and/or bonds, and/or translating and/or rotating positions, can then be expressed in terms of matching topology of a fragment, and thus need not refer to particular fragment sets, but any fragment set possessing one or more instances of the topology on which the transformations operate. The rules may then denote the resulting conformations of their application. The result of any such application can itself by included in the pool of available fragment sets for further application.
A key to the system 100 is that it avoids the need to search every conformation of every molecule and apply a similarity metric to the query at each possible alignment. This is done by using selected features from fragment pairs to align them to the query. Thus, unlike previous works (e.g., see U.S. Pat. No. 6,349,265 and U.S. Prov. patent application Ser. No. 09/275,568) the inventive system 100 uses fragment pairs in place of the stored molecule. With said technology, features from the fragment pairs in the database may be matched with features of the query molecule.
Once features have been matched, the transformations used to align them may be classified by said methodology. The transformations may be grouped together, for example, using a clustering procedure or a binning procedure. A score may be assigned to the transformation groups based on group membership and/or other criteria based on the chemical and/or conformational similarity of the implied fragment pair alignment. Based on this scoring criteria, one may select the best alignments of fragment pairs to the query.
The user, therefore, has the option of combining the fragment pairs that have been aligned to the query to create larger or complete molecules. For example, a user may be restricted to only suggesting molecules in a given database. In this case, combining rules ensure that fragment pairs are not combined with others that do not share a common fragment conformation. If partial structures are allowed, then after combinations of fragment pairs are made, each product may be evaluated by applying scoring criteria that qualifies candidates. Otherwise, candidate qualification criteria may be applied only to completely constructed structures.
Alternatively, combinations of fragments according to reaction rules as described above may be allowed. These fragments need not belong to the same molecule. Combinations of fragment pairs may be directed by reaction rules, rather than the restriction of a common conformation. Each time a combination is made, the result may be checked for candidate qualification. Candidates may be stored, for example, in a database. The resulting set of molecule conformers may also be ranked according to a similarity metric.
Flexible Molecules
The inventive system 100 may represent the conformational space of a flexible molecule in terms of fragments and torsional angles of allowed conformations. A user definable property field may be used to compute features of fragment pairs. Features may be considered generalizations of CoMMA descriptors that characterize local regions of the property field by its local moments (B. D. Silverman and D. E. Platt, Comparative molecular moment analysis (comma): 3d-qsar without molecular superposition, J. Med. Chem., 39:2129 2140, 1996). The features are invariant under coordinate system transformations.
Features taken from a query molecule are used to foam alignments with fragment pairs in the database. An assembly algorithm may then be used to merge the fragment pairs into full structures, aligned to the query. Important to the system 100 is the use of a context adaptive descriptor scaling procedure as the basis for similarity. This helps to allow the user to tune the weights of the various feature components based on examples relevant to the particular context under investigation.
The property fields may range from simple, phenomenological fields, to fields derived from quantum mechanical calculations. For instance, the inventors have applied the inventive system 100 to a dihydrofolate/methotrexate benchmark system, to show that when one injects relevant contextual information into the descriptor scaling procedure, better results are obtained more efficiently.
Thus, the inventors have shown how the inventive system 100 provides for an effective and efficient search. For instance, the inventors have provided computer times for a query from a database that represents approximately 23 million conformers of nine flexible molecules.
Conformational Space and Quantum Mechanical Calculations
Generally, the conformational space for drug-like molecules can become quite appreciable. Some conventional methods represent the conformational space of a molecule as a collection of rigid fragments with preselected torsions (Gerhard Klebe and Thomas Mietzner, A fast and efficient method for generating biologically relevant conformations, Journal of Computer Aided Molecular Design, 8:583-606, 1994). Other approaches prepare a database of representative conformations (Simon K. Kearsley, Dennis J. Underwood, Robert P. Sheridan, and Michael D. Miller, Flexibases: A way to enhance the use of molecular docking methods, Journal of Computer Aided Molecular Design, 8:565 582, 1994; Mathew Hahn, Three-dimensional shape-based searching of conformationally flexible molecules, J. Chem. Inf. Comput. Sci., 37:80-86, 1996), or compute conformations on the fly (Gareth Jones, Peter Willett, and Robin C. Glen, A genetic algorithm for flexible molecular overlay and pharmacophore elucidation, J. Comput.-Aided Mol. Des., 9:532-549, 1995).
The inventive system 100, on the other hand, is based on fragmenting molecules into more manageable partitions. In particular, the inventive system 100 chooses as a smallest irreducible unit of characterization as the fragment pair rather than the fragment. This treatment makes conformational space more manageable than conventional forms of treatment.
Setting out to address flexible superposition via potentially sophisticated property fields, one must give special attention to reducing the number of similarity evaluations that are to be performed, while maintaining some degree of confidence that the space has been covered. The inventive system 100 attempts to pre-process as much as possible, while still leaving enough tunability to adapt to the context of the investigation. The system 100 seeks a practical tradeoff of the size of conformational space, and the need for fragment pairs as large as possible to maximize the relevance of the computer property fields.
More specifically, the inventive system 100 decomposes the conformational space of molecules to fragments. Then, to minimize boundary effects, the system 100 computes the property field on pairs of fragments. From the computer property fields of the fragment pairs, several features may be sampled and stored.
As noted briefly above, features may be considered to be generalizations of CoMMA descriptors that characterize local regions of the property field by its local moments. They are invariant under coordinate system transformations. To query the database for molecules that are similar to a particular molecule (e.g., the query molecule) features are calculated for the query molecule, and fragment pairs that contain a sufficient number of similar features are retrieved.
An important point is that, due to the coordinate system invariance of the features, the retrieval can happen without any alignment, or optimization over rotational and translational degrees of freedom. The alignment of retrieved fragment pairs on the query may be determined by a pose clustering procedure from the individual feature correspondences. Finally, to construct full aligned candidate molecules, the retrieved fragment pairs may be assembled by an incremental buildup procedure, similar in principle to ones used in docking (Matthias Rarey, Berud Kramer, Thomas Lengauer, and Gerhard Klebe, A fast flexible docking method using an incremental construction algorithm, J. Mol. Biol., 261:470 489, 1996)) and de novo design (Hans J. Bohm, The computer program ludi: A new method for the de novo design of enzyme inhibitors, J. Comput.-Aided. Mol. Des., 6:6178, 1992).
Fragment Pairs
There are two types of complexity in a database of three dimensional molecular structures: first, the conformational variety of individual molecules, and second, in the case of a virtual combinatorial library, the combinatorial variety that results from the possibility of synthesizing a large number of different molecules from a small number of reagents. Generally, the total number of three-dimensional structures grows exponentially both with the number of rotatable bonds and the number or reagents. Therefore, such a database should be efficiently represented and stored.
A molecule can be partitioned into a fragment graph. This is an acyclic graph that consists of fragment nodes connected by rotatable bond edges. Within a fragment node, there may be one zero, one or several rotatable bonds, as well as other degrees of freedom such as ring conformations. Given a molecule, there are in general multiple possible ways of partitioning it into a fragment graph. Typically, a fragment node may consist of about 10 heavy atoms, for example an aromatic ring plus some substituents.
Further, the substructure represented by a fragment node can in general assume several different conformations. A specific conformation of a fragment node may be referred to as a fragment. A fragment pair consists of two neighboring fragments connected by a rotatable bond at a specific dihedral angle. A schematic representation of a fragment graph is depicted in
Fragment pairs are important entities in the inventive system 100. Property fields are defined and calculated on fragment pairs, and the similarity search is based on rotationally invariant features that are calculated from those property fields. The assumption that underlies the use of fragment pairs is that a property field calculated in the interior of an isolated fragment pair is a good local approximation of the property field of the composite molecule.
Conceptually, the use of fragment pairs may be considered as equivalent to using overlapping fragments, with the overlap being about half their size. This has advantages both for the recognition and for the assembly steps. First, the fragmentation locally distorts the property field in those places where the molecule is cut. By using fragment pairs, the regions around the fragment joints are always in the interior of at least one fragment pair, such that meaningful local descriptors can be calculated for them. Further, an aligned database molecule is constructed by assembling fragment pairs that have one fragment in common, and which both locally match the query with compatible orientations. Thus, the relevant dihedral range of the connecting rotatable bonds determined (e.g. from steric and energetic criteria) is already available in the pre-computed fragment pairs.
Obviously, this approach is suitable both for conventional molecule libraries, as well as for virtual libraries supporting combinatorial chemistry approaches. The efficiency of the fragment pair representation is best discussed by way of an example. As indicated in
CMOL=65·2=15552. (1)
In practice, fewer conformations have to be considered because some are sterically forbidden. Similar to equation (1), CFP1, the number of fragment pairs from the lower and middle fragment node, and CFP2, the number of fragment pairs from the middle and upper fragment node are
CFP1=6·6·2=72, CFP2=2·6·36=432. (2)
The total number of fragment pairs is therefore 72+432=504. Furthermore, the size of a fragment pair is in this example only about ⅔ of the size of the whole molecule, with a corresponding smaller number of local descriptors, so that the fragment pair representation needs about 50 times less storage than the brute force enumeration.
More generally, for a molecule consisting of n fragments, each of which has CFRAG conformations, and which are connected by rotational bond edges that are sampled in CRBE steps, the total number of conformation is
CMOL=CMFRAG·Cn-1RBE, (3)
where n−1 is the number of rotatable bond edges. In comparison, the total number of fragment pairs is
CFP=(n−1)·C2FRAG·CRBE. (4)
Note that equation (3) grows exponentially with n, whereas equation (4) only depends linearly on n.
The Property Field
A basis for the three dimensional similarity searching and alignment are two property fields μ({right arrow over (r)}) and ρ({right arrow over (r)}). The inventive system 100 makes no assumptions about these fields, except that μ({right arrow over (r)}) is scalar and positive. It may also be assumed that ρ({right arrow over (r)}) is a single scalar, but this can be straightforwardly extended to multiple scalars, vectors or tensors.
Both fields μ({right arrow over (r)}) and ρ({right arrow over (r)}) are used to identify similar regions in query and database molecules. Their geometrical alignment however, is performed solely on the basis of field μ({right arrow over (r)}).
A simple-minded property field can be defined as:
Here the Oh atom is located at αj, and its electronegativity, Aj is given according to the Allred scale (Bodie E. Douglas, Darl H. McDaniel, and John Alexander, Concepts and Models of Inorganic Chemistry. John Wiley & Sons, 1983). The inventors used the following values: AC=2.6, AO=3.4, AN=3.0, AH=2.2, AP=2.2, AF=4.0. σ is a parameter that controls the range of the Gaussian smearing function. In their work, the inventors used σ=0.5 Å. The rationale is to choose a value as big as possible, but small enough for the property field not to be too uniform and unspecific. {right arrow over (μ)} is the average of μ({right arrow over (r)}) over all space. μ({right arrow over (r)}) is positive and ρ({right arrow over (r)}) is analogous to a neutral charge distribution.
Another possible choice of a property field is
where Mj is the atomic mass and Qj is the atomic charge computed by considering the “fraction of ionic character” of each bond in the molecule (M. Karplus and R. N. Porter, Atoms and Molecules, W. A. Benjamin, Inc., Menlo Park, Calif., 1971).
These example fields are by no means intended to offer new insight into processes that underlie the chemistry of the present invention. To exemplify and prototype the inventive system 100, the inventors selected the property field defined by equation (5) because of its simplicity. In spite of its simplicity, however, when this property field is used, the inventive system 100 performs adequately, and thus may serve in the future as a base level, against which more elaborate fields may be benchmarked.
Obviously, the choice of the property fields will have a great effect both on the selectivity and on the efficiency of the search. The point is that the preferred choice depends on the application and on the questions asked. The exploration of different alternative fields is intended to be part of the process of adapting the inventive system 100 to a certain domain.
Possible fields are by no means restricted to “smeared out” atomic properties. The present invention may be intended to make use of fields derived from quantum mechanical calculations.
Descriptors and Feature Generation
Given the property fields μ({right arrow over (r)}) and ρ({right arrow over (r)}), the inventive system 100 may construct a set of local, rotationally invariant, moment-based descriptors. If the property fields of the query molecule and a database fragment pair are similar, these descriptors will have similar values. Since the descriptors are rotationally variant, no alignment is necessary, and the comparison can be performed very quickly.
The similarity of the descriptors alone is an important but not a sufficient criterion for the similarity of the fields. However, together with the descriptors the inventive system 100 stores information on their relative positions and orientations within the query and database structures. When a database structure has enough descriptors similar to the query, the relative positions and orientations of the descriptors are compared. If these are also consistent, the two structures may be considered similar, and an approximate alignment is deduced from this information.
Note that in order to obtain the alignment, no explicit (and costly) optimization of a property field overlap function with respect to translation and rotation operators needs to be performed in the inventive system 100. However, if desired, such an optimization can afterwards be applied to a small set of promising candidates, starting from near-optimal initial conditions.
The first step in the construction of the descriptors is the partitioning of the volume occupied by the structure into overlapping scoops. If the property fields are defined by smearing out atomic properties Aj, (as in the two examples set forth above), this may be performed as follows: Let {{right arrow over (s)}k} be a set of points within or around the structure, such that the spheres with radius R around these points provide a highly overlapping covering of the relevant regions of the property fields. Each of these spheres is called a “scoop”. Further, define a ramping function
and a window function
Therefore, the “attenuated atomic properties” that contribute to the k-th scoop may, therefore, be given by
Aj(k)=Δ(|{right arrow over (a)}j−{right arrow over (s)}k|)·Aj (9)
and the k-th scoop's property field is
and corresponding for ρk({right arrow over (r)}), as in equation (5). The intention of the ramping function (7) is that the property fields μk and ρk (and therefore the descriptors) are continuous functions of the location of the scoop center {right arrow over (s)}k.
For general property fields μ({right arrow over (r)}) and ρ({right arrow over (r)}) that are not obtained by smearing out atomic properties, scoop property fields can simply be obtained by setting
μk({right arrow over (r)})=h(|{right arrow over (r)}−{right arrow over (s)}k|)·μ({right arrow over (r)}), (11)
and correspondingly for ρ({right arrow over (r)}).
For example, the set of points {{right arrow over (s)}k} may be the set of all atom positions, and R=3 Å. With this, the scoops may objects of intermediate size, larger than a functional group, but smaller than a fragment pair. Typically, a scoop may contain 6 to 8 non-hydrogen atoms.
Having defined a set of scoops and their associated local property fields μk and ρk, rotationally invariant descriptors may be constructed. There will be a descriptor consisting of 16 real numbers for each scoop. To simplify the notation, in the following, the index k that numbers the scoops was dropped, and the continuum fields was replaced by the discretized versions μi and ρi defined on a grid of points {{right arrow over (r)}i|i=1, . . . , N}. For the grid, a face-centered cubic lattice (other types of lattices can also be used) of unit cell length ΔR=R/18 within each scoop of radius R has been used. Further, the grid spacing ΔR has been determined by varying the grid orientation with respect to the atoms in the scoop, and making sure that the resulting descriptors, as described below, do not significantly depend on the orientation.
The zeroth moments of the fields are
In the case of the property field defined by equation (6), M and Q correspond to total mass and total charge within the scoop.
The center of the μ-field is defined by
and the center of the ρ field by
where {right arrow over (b)} and B are dipole and quadrupole moment of the ρ-field with respect to the origin of the laboratory coordinate system,
and the superscript t stands for transposition. In the case of the property field defined by equation (6), {right arrow over (C)}μ is the center of mass, and the first line of equation (14) is the center of charge. The center of charge may be defined only if the scoop has a net charge (e.g., if the charge is larger than the threshold εQ. Otherwise, the second line of equation (14) calculates the center of dipole.
The inertial tensor J and a cubic vector {right arrow over (J)} with respect to {right arrow over (C)}μ, the center of μ, are defined as
where {right arrow over (r)}i′={right arrow over (r)}i−{right arrow over (C)}μ. Similarly, dipole moment {right arrow over (p)} and quadrupole moment Q of the ρ-field with respect to {right arrow over (C)}ρ, the center of ρ, may be defined as follows:
where {right arrow over (r)}i″={right arrow over (r)}i−{right arrow over (C)}ρ.
It is important to note that the quantities (e.g., equations) (17)-(19) are expressed in a uniquely defined scoop-internal coordinate system so that they no longer depend on the arbitrary choice of the laboratory frame. Therefore, the descriptors of different scoops can be compared without prior alignment. The axes of the scoop internal coordinate system are given by the eigenvectors of the inertial tensor J:
J{right arrow over (v)}n=Jn{right arrow over (v)}n, n=1,2,3. (20)
The positive numbers Jn are the inertial moments. The vectors {right arrow over (v)}n can be arranged into the columns of an orthonormal matrix V. An arbitrary vector may be then transformed from the laboratory frame to the internal frame by left-multiplying it with Vt.
In order to uniquely define the internal coordinate system, (i) the ordering, and (ii) the sensing (i.e. the signs) of the coordinate axes should be fixed.
The ordering is defined by J1≦J2≦J3. If two or three of the eigenvalues of J are degenerate, then there are no unique eigenvectors, but rather two- or three-dimensional eigenspaces, there is no well-defined inertial coordinate system, and the corresponding scoop is not used to generate a descriptor. Degeneracy in this context means that two eigenvalues are equal to within some threshold that may depend on the choice of property field. The degeneracy condition is
J2/J1<1+εJ or J3/J2<1+εJ. (21)
To fix the sensing, the dimensionless asymmetry vector was defined by
where {right arrow over (J)} is the cubic vector from equation (17), and the signs of the axes were selected according to Table I illustrated in
The sign sε{−1, 1} is determined by requiring right-handedness, i.e. det V=1. The table above represents a unique choice of the axes sensing depending only on the cubic vector {right arrow over (J)}.
If two, or three components of α are within some chosen threshold of zero, one is left with two or even four different equally admissible choices for the axes sensing. For descriptors stored in the database, an arbitrary choice may simply be taken. For the query features that are to be matched against the database, a descriptor is generated for each admissible choice of axes sensing. For the present data, the inventors used the duplication condition αn/RJn<0.02.
As a result, the COMMA descriptor Xk for the k-th scoop, which may be called a feature, may consist of the d=16 real numbers which are shown in Table II in
If the ρ field is more complex than a scalar field (e.g., a combination of several scalars) or a vector or a tensor field, then the first and second moments {right arrow over (p)} and Q will have a correspondingly greater number of components, and d, the number of components of the descriptor Xk is larger than 16.
Since the vector and tensor quantities in the descriptor are expressed in the internal coordinate system, which only depends on local properties of the molecule, the descriptor is completely independent of the laboratory frame. This has two important implications: first, two scoop descriptors can be compared against each other without prior alignment, and second, if two descriptors are found to be similar, the two corresponding molecules can be locally aligned by simply overlaying the internal coordinate systems of the two scoops.
Descriptor Scaling and Quantization
The scoop descriptors Xk=(Xk1, . . . , Xkd) cannot be used straightforwardly to define similarity between scoops because the components Xk1, Xk2, . . . have different physical units, and because the different components may be more or less important for the similarity search at hand. For example, it may be important in a particular application of this method to recognize and distinguish between some specific set of structural motifs or chemical functional groups in identifying fragment pairs from a database to align on a query molecule. In a different application of the inventive system 100, it may be more important instead to recognize and distinguish a different set of motifs, or regions of electrical polarity. Clearly, a useful system should be able to adapt to what is meaningful to the user in assessing similarity.
The inventors, therefore, defined a distance measure that weights the different descriptors according to their importance as perceived from a training set of descriptors. The training data consists of a number of sample descriptors that are categorized into groups. From this, the inventive system 100 may “learn” two types of descriptor variations: Important variations may be considered those that occur systematically between descriptors from different groups. Such variations may be used to define the distance measure. The descriptors within a group are considered similar, and the distance measure is made to ignore these types of variations.
While there is a vast repertoire of methods from the disciplines of classification and pattern recognition to address this task, the inventive system 100 may simply use Fisher's linear discriminant analysis. It provides a linear mapping from the d-dimensional descriptor space into a lower-dimensional space:
Yk=WtXk (23)
where W is a d×p matrix with p≦d.
The discriminant matrix W is calculated from the user-supplied classification of a sample of descriptors Xk, into p+1 groups. The within-group scatter matrix Sw is the average covariance matrix of descriptors that are in the same group, and the between-group scatter matrix Sb is the covariance of the group centroids (Richard O. Duda and Peter E. Hart, Pattern Classification and Scene Analysis, John Wiley & Sons, Inc., New York, 1973). The discriminant matrix W is defined through the maximization of the criterion function:
This optimization criterion selects W such that the distances between the Yk within the same group are minimized, whereas the distances between the groups' centroids are maximized. Numerically, W may be calculated using a generalized eigenvalue routine (IBM, Poughkeepsie, N.Y. Engineering and Scientific Subroutine Library for AIX Version 3, Guide and Reference, 1997).
To allow for a fast database lookup of stored descriptors that are similar to a query, descriptor space may be quantized. Descriptors that fall within the same compartment may be considered similar, and are matched. Those that fall into different compartments do not match.
For example, the compartments may be defined by a rectilinear grid in discriminant space (i.e. in the space of the Yk). The grid spacings sj are chosen such that a grid cell can accommodate the typical scatter within a group. This is done by a heuristic that sets the bin width sj to four times the largest standard deviation of the j-th component (j=1, . . . , p) within an individual group, and the right end of the leftmost bin to the minimum over all groups.
For example,
Although the scaling and quantization in the inventive system 100 may be fairly simple, it is robust. Moreover, the inventive system 100 does accomplish a context-adapted descriptor calibration and similarity measure, utilizing a user-defined training set of descriptors. Missed matches as a consequence of cell boundaries are not as likely in the low dimension in which the grid is defined as they would be for higher-dimensional situations. Since there is redundancy in the set of scoops characterizing a molecule, a meaningful alignment will be recovered if just a fraction of them are actually matched.
A virtue of linear mappings (e.g., equation (23)), in comparison to, for example, neural networks, is that linear mappings have a limited number of parameters and require only a relatively small training set, which is important for practical applications. Furthermore, the mapping (e.g., equation (23)) and the subsequent discretization are exactly invariant under overall linear transforma-tions of the descriptors. This means that the calibration and quantization scheme does not depend on whether, for example, length is measured in meters or Angstroms, and mass in grams or amu.
The training set may be chosen according to the following criteria: the different groups should be selected to contain examples of structural or functional units that are relevant for the similarity search. Within the individual groups, members should represent the kinds of variations that occur in the descriptor database, like experimental uncertainty in bond lengths and angles, different environments of functional groups, and different conformations deemed irrelevant for the problem at hand.
The Feature Database
The previous section generated a context-adapted mapping from descriptor space to an integer set of feature keys. These keys are simply the indices that label the cells in discriminant space. The use of a hash table and fast integer key lookup methods supports efficient queries against large databases, with access times largely independent of database size.
In principle, compared to a method that would use a detailed, fully continuous distance metric, the quantization of the descriptors to produce the keys causes a loss of sensitivity (more false negatives), and a loss of selectivity (more false positives) in the similarity search. However, the set of descriptors that describe a molecule may be highly redundant, so that an incomplete set of scoop matches still leads to a complete alignment of the molecules. Furthermore, false positive matches are reduced in a subsequent clustering step since false positive matches do not occur consistently.
The feature database may be the product of two inputs: a set of descriptors, generated from all conformations of a selection of fragment graphs, and the context-adapted descriptor-to-key mapping. The feature database is a hash table, whose keys are given by evaluating the mapping on the descriptors. An entry in the hash table consists of a reference back to a fragment pair, and a description of the internal frame associated with the descriptor. The explicit values of the descriptors are not stored in the feature database. The thrifty use of memory by the feature database is important for the scalability of the inventive system 100.
Generally, the calculation of the descriptor set for all the fragment pairs of the fragment graphs may be an expensive part of the inventive system 100. However, the descriptor set may be computed only one time, in a preprocessing step, and then the set may be stored. Application of the descriptor to key mapping is fairly cheap and quick. As the domain context is varied, the descriptor-to-key mapping will change, and different feature databases can be created to query against. Finally, a given feature database can be queried very rapidly from the keys of query features, with any number of queries, including differing conformations of the same molecule, as well as differing molecules.
Feature Correspondence
The feature database may be used to align fragment pairs to the query molecule. The basic idea is to calculate scoops, descriptors, and keys for the query in the same way as for the fragment pairs stored in the database. Each query key then accesses the matching entries in the feature database. Each pair of query and database scoops that have the same key may be referred to as a correspondence.
By overlaying the internal coordinate axes of such a scoop pair, each correspondence implies a certain alignment of a stored fragment pair onto the query. Whereas such a single correspondence might be coincidental, a significant alignment—what may be referred to as a hypothesis—is inferred when several independent correspondences from different regions of the query and fragment pair support the same relative orientation of the two. This assumption is a basis of pose-clustering methods (G. Stockman, Object recognition and localization via pose clustering, Computer Vision, Graphics, and Image Processing, 40:361387, 1987). A selected number of the strongest hypotheses may be passed on to the assembly.
The feature correspondence process may, therefore, include construction of all correspondences by keyed access from the query to the feature database, clustering of the correspondences, and construction of the hypotheses as average alignments of the significant clusters.
The internal coordinate system that is associated with a feature may be specified by its origin {right arrow over (C)}μ, and an orthonormal system of inertial eigenvectors V (e.g., see equations (13) and (20)). The transformation of laboratory frame coordinates {right arrow over (x)} into internal coordinates may be given by
T:{right arrow over (x)}Vt({right arrow over (x)}−{right arrow over (c)}μ). (25)
Given a correspondence between a query feature Xq and a stored feature Xs, and the two associated transformations Tq and Ts, the transformation that aligns the stored fragment pair coordinates onto the query is
Tqs=T−1qoTs:{right arrow over (x)}{right arrow over (c)}μ,q+VqVs1({right arrow over (x)}−{right arrow over (c)}μs). (26)
This is a putative alignment of the fragment pair represented in the database onto the query molecule, based on a single feature correspondence.
However, it would be impractical and prohibitively expensive to evaluate and score each such alignment separately. In the inventive system 100, therefore, the set of all putative alignments may be divided into clusters. In order to perform clustering, a metric in the space of transformations (e.g., see equation (26)) may be required. For example, the inventors have used the following metric:
if s and s′ are from the same fragment pair, and d(Tqs, Tq′s′)=∞ if they are from different fragment pairs. {right arrow over (x)}0 is the center of geometry of the fragment pair coordinates, and the first term on the right hand side of equation (27) is simply the Euclidean distance between the transformed fragment pair centers. Vq′Vts′Vs Vtq is the rotation part of the transformation Tq′s′ o T−1qs that maps a set of coordinates transformed by Tqs onto the one transformed by Tq′s′, and drot is the magnitude of the angle of that rotation.
Further, the function
is close to me identity for small angles, but becomes large when x goes towards π. This function may be used to make sure that transformations with very different rotations are considered very far apart. The parameter α provides a measure of the relative weighting of orientation and translation for the transformation distance. For example, α=3 Å may be selected for all fragment pairs.
Having defined a metric, the clustering may be performed. For example, the hierarchical clustering method G03ECF implemented in the NAG library (NAG Ltd., Oxford, UK. The NAG Fortran Library Manual, Mark 16, 1993), with the complete-linkage distance updating method, may be used. Hierarchical clustering may be selected over partitional, because the latter is cumbersome for non-Euclidean metrics like equation (27), and the complete-linkage method because it produces the most compact clusters, which is desirable for the subsequent averaging (Anil K. Jain and Richard C. Dubes, Algorithm for Clustering Data, Prentice Hall, 1988).
The only parameter of the clustering step may be the distance level dclust at which the dendrogram is cut. For instance, after some investigation (e.g., the choice for dclust used in the present work was made based on a number of experiments with the DHF-MTX system. However, results were relatively insensitive to values in the range of 2.5 to 4.0 Å. This value is probably appropriate for studies using the same clustering algorithm, transformation distance metric, with fragments that are approximately 10 heavy atoms in size, and for nuclear placement of scoops) the inventors have used dclust=3 Å. Note that if dclust is too large, what should be distinct clusters will be merged and the average transformation will not be representative of any transformation of these distinct clusters. On the other hand, if dclust is too small, the number of members in each cluster is small and no clusters emerge with sufficient signal.
For hypothesis building, the largest and therefore most significant clusters may be selected, and average transformations calculated. For instance, consider a cluster of n transformations T1, . . . Tn, each one represented by
Ti:{right arrow over (x)}{right arrow over (t)}i+Ri{right arrow over (x)}, i=1, . . . , n. (29)
Representations given by equations (29) and (26) are related by R=VqYst and {right arrow over (t)}={right arrow over (c)}μ,q−R{right arrow over (c)}μ,s. The average rotation may be calculated as
is a singular value decomposition (SVD) of ΣiRi (W. Dan Curtis, Adam L Janin, and Karel Zikan, A note on averaging rotations, In IEEE Virtual Reality Annual International Symposium, pages 377-385, IEEE, 1993).
This is well-defined as long as the rotations in the cluster are not too different. In the situation where rotations vary widely, the notion itself of an average rotation becomes disputable. The average translation vector {right arrow over (t)}av is calculated by requiring that the fragment pair center {right arrow over (x)}0 under the average transformation is the arithmetic average of the fragment pair center under the individual transformations in the cluster,
The average transformation for the cluster may be therefore T:{right arrow over (x)}{right arrow over (t)}av+Tav{right arrow over (x)}. For example, all feature correspondences may carry an equal weight both in the clustering and averaging steps. Alternatively, it may be considered that some correspondences are more significant than others.
Summing up, for each query feature, the descriptor-to-key mapping may form a correspondence with all stored features that have the same key value. The set of all such pairs may be clustered according to the metric in transformation space. There might be correspondences involving many different stored fragment pairs, but the correspondences within one cluster may only refer to the same fragment pair.
Further, the largest clusters, according to a user-defined minimum cluster size nclust may be selected. For each cluster, the average transformation may be calculated and applied to the coordinate of the fragment pair from whose features it was derived. The result is a set of hypotheses, that is, fragment pairs aligned on the query.
Assembly
The assembly process in the inventive system 100 may begin with this set of hypotheses: fragment pairs positioned in the query frame according to the transformations derived in the clustering procedure. These fragment pairs may belong to any molecule in the database. An objective is to merge these fragment pairs into complete, aligned structures. The assembly algorithm should be able to process as much in parallel as possible.
If only fragment pairs belonging to the same fragment graph are allowed to be merged, the structures that result from the assembly process will be conformers of those molecules that were used to build the database. However, if fragment pairs from different fragment graphs are allowed to merge, new structures may be created through the assembly. Furthermore, pieces of larger structures may qualify as matches to the query.
The assembly may be an iterative process, where each iteration step begins with a set of fragment pairs and/or partially assembled structures, and produces a set of partially or fully assembled structures that have increased in size by one fragment. Whereas the overall iteration loop runs sequentially, within a step, many fragment pairs or partial structures can be processed in parallel.
At the beginning of each step, fragment pairs and any partial structures from previous steps may be grouped together and referred to as bases. In other words, a base may be a group of adjacent fragments, that have been aligned to the query, and connected with specific torsional angles of the rotatable bonds. All atom coordinates may be expressed in the query frame. In addition, a base carries a record of which fragment nodes it contains, and which fragment pairs could be used to grow it.
The first phase of an assembly step may be termed the base expansion phase. In this phase, the list of potential merges of fragment pairs with bases may be determined. It results from consideration of the fronts of the growing bases. The front of a base is defined as the set of rotatable bond edges between a matched and an unmatched fragment node on the fragment graph. Any fragment pair from the set of hypotheses that includes both a frontal rotatable bond edge and matches the fragment conformer present in the base can potentially merge with the base. A merge between a base and fragment pair may occur if the root mean square distance computed between corresponding atoms in the fragments that are common to the base and the fragment pair is less than a threshold. Note that during the assembly process a hypothesis fragment pair can be used for merges many times, onto multiple growing bases.
There are different ways to merge a base and a fragment pair that have close, but not identical atom positions of the common fragment. First, one could leave the base fixed and superpose the fragment pair onto the base's fragment. In the other extreme, one could leave the fragment pair fixed and superpose the base onto the fragment in the fragment pair.
However, the inventors have used an intermediate option, which is constructed by joining the base and the fragment pair, and then determining the least squares alignment (K. S. Arun, T. S. Huang, and S. D. Blostein, Least square fitting of two 3-d point sets, IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-9(5):698 700, 1987) of the unique corresponding atoms in the original base and the grown one, and in the fragment pair and the grown base. The atoms of the common fragment are not necessarily included in this calculation. Note, that due to the merging, the orientation and position of a base depends not only on the fragment pairs that went into it, but also on the order in which they were merged. Therefore, multiple identical bases can be produced with slightly different positions and orientations relative to the query molecule.
After each merge, a bump check may be performed, which checks whether atoms from the newly added fragment are too close to those that were in the base before. For instance, the inventors have used a threshold of 1.7 Å for non-bonded atom-atom distance (e.g., the value of 1.7 Å chosen for the bump check parameters used in this work is rather small. Therefore, some high energy conformations are used as fragment pairs and also some high energy candidates are produced during assembly. The value was chosen to screen out only the worst cases of steric overlap in order to assess the method with a larger work load. The performance of the method improves if this parameter is increased).
After a bump check, a shape screen may be applied, which prevents the base from growing too far outside the query's volume. The screen may be performed simply by verifying that no atom in the base is more than a threshold distance from the closest atom in the query. For instance, this threshold may default to 3.5 Å.
Bases that pass the shape screen may be then scored against the query. For example, the default scoring function may be a simple Carbo function, as used in other work (R. Carbo, L. Leyda, and M. Arnaua, How similar is a molecule to another? An electron density measure of similarity between two molecular structures, Int. J. Quant. Chem., 17:1185, 1189, 1980).
where {right arrow over (a)}j is the atom coordinate of the jth query atom, {right arrow over (b)}j is the atom coordinate of the jth atom in the base, and h({right arrow over (x)})=π−3/4τ−3/2 exp (−x2/2τ2) is a normalized three dimensional Gaussian density with τ=1 Å. Nq and Nb are the numbers of atoms in the query molecule and the base, respectively. Scoring functions can vary widely in their sophistication (Andrew C. Good and W. Graham Richards, Explicit calculation of 3d molecular similarity, In Hugo Kubinyi, Gerd Folkers, and Yvonne C. Martin, editors, 3D QSAR in Drug Design, volume 2, pages 321-338, Kluwer Academic Publishers, Great Britain, 1996). The inventors have chosen equation (32) for its simplicity.
A base that passes certain criteria may qualify as a candidate, meaning that it may be considered a successful match to the query. For instance, a base may be considered a candidate if it has some minimum number of fragments, passes all applicable checks and/or has a sufficient score with respect to the scoring function in use.
Bases that have no opportunities for growth may be removed from consideration in subsequent iterations of the assembly process. The bases remaining may be used as input to the next iteration of the assembly phase. The assembly process may be terminated when there are no bases left for consideration (e.g., when there are no bases left to grow).
Note that the procedure may be considered exhaustive in that all candidate alignments may be produced if they qualify with respect to the bump check, shape screen and scoring criteria. The inventive system 100 may, therefore, attempt to produce the better candidates early. For example, in their application of the inventive system 100 discussed below, the inventors produced all possible qualifying candidate alignments that could be constructed from the aligned fragment pairs submitted to the assembly phase.
Example Application of the Present Invention to Dihydrofolate and MethotrexateFor example, the inventors have applied the inventive system 100 to the case of dihydrofolate (DHF) and methotrexate (MTX). This case has been the subject of study for numerous methods of superposition (Michael D. Miller, Robert P. Sheridan, and Simon K. Kearsley, Sq: A program for rapidly producing pharmacophorically relevant molecular superpositions, J. Med. Chem., 42:1505-1514, 1999; Christian Lemmen and Thomas Lengauer, Time-efficient flexible superposition of medium-sized molecules, Journal of Computer-Aided Molecular Design, 11:357 368, 1997; Simon K. Kearsley and Graham M. Smith, An alternative method for the alignment of molecular structures: maximizing electrostatic and steric overlap, Journal of Computer Aided Molecular Design, 8:565 582, 1994) as well as docking (Matthias Rarey, Berud Kramer, Thomas Lengauer, and Gerhard Klebe, A fast flexible docking method using an incremental construction algorithm, J. Mol. Biol., 261:470 489, 1996; David M. Lorber and Brian K. Shoichet, Flexible ligand docking using conformational ensembles, Protein Science, 7:938 950, 1998). In fact, the case has become a benchmark for such methods, as it exhibits an interesting aspect that the most reasonable superposition from the perspective of topology is different from the superposition that can be deduced from aligning the enzyme parts of the crystal structures of ligand-enzyme complexes of DHF and MTX bound to dihydrofolate reductase (DHFR) (Gerhard Klebe. Structural alignment of molecules, In Hugo Kubinyi, editor, 3D QSAR in Drug Design, pages 173-199, ESCOM, Leiden, 1993).
The latter superposition can be understood in teems of electrostatics and hydrogen bonding sites (Hugo Kubinyi, Similarity and dissimilarity: A medicinal chemist's view, In Hugo Kubinyi, Gerd Folkers, and Yvonne C. Martin, editors, 3D QSAR in Drug Design, volume 3, pages 317-338. Kluwer, Dordrecht/Boston/London, 1998). The MTX-DHF system may thus serve as a probe to characterize how such methods weight the importance of topological similarity, electrostatic similarity, and potential non-bonded interactions. The inventors chose this case as their case study to characterize how the inventive system 100 performs, using the very simply property field detailed above.
To this end, the inventors described two experiments, which illustrate the importance of feature classification. The first considers the alignment of a rigid conformation of MTX on a rigid conformation of DHF. Both conformations are extracted from the crystal structures of the molecules bound to DHFR. Features are computed at atom centers and grouped according to a classification scheme derived with knowledge of the respective binding modes observed in the crystal structures. This functional-based grouping (e.g., as shown in
For example,
Taking DHF as the query, MTX may be aligned, for example, under the two conditions of classification, and the results compared. It should be noted that knowledge of the particular binding mode is not a requirement for application of the inventive system 100. If, however, one has such information, one should be able to use it to better characterize the meaning of similarity in a similarity search.
As a second experiment the inventors used DHF as a query molecule on a database that represents the full conformational space of MTX. The features were classified using two schemes. The first scheme is the functional-based grouping described above. The second scheme assumes no knowledge of the binding modes and simply groups features by the element type (e.g., C, N, or O) of the central atom.
Comparison of the results of using these two schemes illustrates the value of injecting relevant context into the feature classification scheme. The inventors compared the workload consequent from these two feature classification schemes, and concluded with the top scoring alignments assembled from the fragment pairs of MTX. As noted below, reasonable results are obtained with both grouping schemes. However, when features are classified with the functional-based (i.e., context-derived) scheme, the results are obtained more efficiently and are of higher quality, as assessed by the Carbo scoring function or by visual inspection.
Rigid Body Superposition
Structures of DHF and MTX were extracted from the crystal structures with PDB identifiers 1dhf and 4dfr respectively. 1dhf provides the coordinates of DHF bound to the human form of the enzyme DHFR(N. J. Prendergast, T. J. DeCamp, P. L. Smith, and J. H. Fresheim, Expression and site-directed mutagenesis of human dihydrogolate reductase, Biochemistry, 27:3664, 1988). 4dfr provides the coordinates of MTX bound to an E. coli strain (J. T. Bolin, D. J. Filman, D. A. Matthew, R. C. Hamlin, and J. Kraut, Structures of escherichia coli and lactobacillus casei dihydrogolate reductase refined at 1.7 angstroms resolution. i. general features and binding of methotrexate, J. Biol. Chem., 257:13650, 1982) of DHFR. No attempts were made to optimize the structures with quantum or classical methods. Thus, the ring distortion and out-of-plane bending as observed in the crystal structures were left intact. Hydrogen atoms were added using Cerius2 (Program Cerius2, distributed by Molecular Simulations Inc., 9685 Scranton Rd. San Diego Calif. 92121-3752).
Features were computed according to the parameters in Table III, as shown in
The loadings of the two Fisher discriminants derived from the functional grouping are shown in
Further, the first discriminant has domain loadings in Qxx and Qyy, as well as the inertial Jy and Jz components of the feature. Large Q components with the present property field arise when the differences in electronegativity with respect to the average lie further from their center of dipole. The second discriminant is most sensitive to M, the integral of the electronegativity property field over the scoop, and differences in Jz from Jx to Jy, which may be seen as a crude measure of planarity of field within the scoop. Differing training sets will lead to a different feature weighting in discriminant space.
The atom-centered features of both DHF and MTX, once projected onto the Fisher discriminants resulting from the functional grouping, are partitioned into bins. The size of the bins in each dimension is set to four times the standard deviation of the largest group.
The resulting partitioning scheme is shown in
Specifically,
DHF has 53 atoms, so with scoops located on atomic centers, this results in 53 features. However, 6 features fail to pass the threshold for sufficient non-degeneracy and are removed. For the construction of correspondences between the query and database features, multiple features are generated for scoops in the query molecule (DHF) that lack sufficient asymmetry to define the sense of their internal axes. This results in the addition of 25 more features for DHF, for a total of 72. Of the 56 scoops for MTX, 10 fail to pass the threshold for sufficient non-degeneracy, leaving 46 features. The total number of correspondences without grouping is therefore the product of the number of features in DHF and MTX, or 72*46=3312, as shown in Table V.
Specifically, Table V in
Therefore, it can be seen that, in spite of its problems, a partitioning of discriminant space may be necessary. A brute force comparison of all query features with all candidate features will likely become prohibitively expensive from a computational perspective when applied to larger problems. Furthermore, examination of the large clusters in the full correspondence case (e.g., see Table V in
In both groupings, when one passes the clusters to the assembler (which in this case may only apply shape screens and scoring since the molecules are already assembled), essentially the same alignments result. This will not be the case when considering the flexibility of MTX. An important point to emphasize here, is that by injecting relevant knowledge into the feature classification scheme, one arrives at the answer with less work. This point becomes important when larger databases of molecules with more conformational degrees of freedom are considered. This is more than a timing issue: in practice, one often must apply cutoffs and limits to the search. Arriving at an answer efficiently at a small scale may sometimes translate into whether one sees it at all at a larger scale.
A result that is fairly robust with respect to choice of clustering parameters is the production of two alignments, one where the benzamide rings are aligned, and the other with the observed crystallographic alignment of the pteridine rings indicating superposition of the regions of the molecules that bind to DHFR. The presence of these two alignments stems from two origins in conflict. One alignment arises from the strength of an exact substructural match, namely the p-amino-benzamide. The other, which is the experimentally observed alignment of the pteridine rings, is due to the similarity in locations of chemical functional groups.
Balancing how exact substructural matches should be weighted with respect to similarity in functional group definition is an issue that must be dealt with in any superposition algorithm or similarity scoring function. Exact substructure matches are sometimes relevant, even if trivial. One certainly could not fault an algorithm for scoring such a correspondence high. Rather, one must balance the relative importance exact substructural matches have with respect to similar substructures. Since there is no universal answer, this should be addressed by the context. In this case (e.g., DHF and MTX), the two form separate and distinct clusters in transformation space, and are thus presented as alternative alignments.
Flexible Superposition
To query DHF against a database that represents the full conformational space of MTX, the MTX structure was fragmented, and the conformations of each fragment and fragment pair were tabulated. The fragmentation is illustrated in
The non-bonded atom threshold reduces the number of conformers of fragment C from 36 to 27, the number of A-B fragment pairs from 72 to 48, and the number of B-C fragment pairs from 324 to 234. The list of inter-fragment torsion angles is appended to the original angle present before fragmentation. The results are shown in Table VI in
Specifically, Table VI shows the statistics for the fragment and fragment pair partitioning of MTX. The 35 fragment conformers and 284 rotatable bond edge records allow for the representation of more than 15,500 full molecular conformations (e.g., see equation (1)). Thus, one of the 49 A-B fragment pairs and one of the 235 B-C fragment pairs has the original angle found in the crystal structure.
The cluster distributions in Table VII in
Thus, as can be seen from the total number of correspondences constructed in each case, the work load of the correspondence step is markedly higher for the elemental grouping scheme than in the functional grouping scheme. Inspection of the partitioning of the elemental grouping revealed a decidedly less rational distribution. Bins were occupied with more random groupings of features compared to the functional scheme.
Furthermore, there is a difference in the types of alignments of the fragment pairs containing the p-amino-benzamide ring and the pteridine ring that are produced by the two scheme. In the functional grouping scheme we see three types of alignments: one with the p-amino-benzamide ring of MTX tightly aligned onto that of the DHF; one with the pteridine ring of MTX tightly aligned onto that of DHF in a way compatible with the DHFR binding implied by the crystal structures; and one with the pteridine ring of MTX tightly aligned onto that of DHF in a way compatible with good steric overlap of the rings.
In contrast, with the element grouping scheme we see only alignments where the p-amino-benzamide ring of MTX is tightly aligned onto that of the DHF. With the set of parameters shown in Table III of
It should be added, however, that even with the element grouping scheme, although alignments based on superposition of the pteridine ring of MTX onto that of DHF are not observed, both the binding mode and steric overlap orientations of the pteridine ring can be seen. However, these alignments are produced by correspondences from regions of the molecules away from the pteridine ring and happen to survive shape screening in the assembly phase.
All clusters with votes of five or more are used to produce the starting bases of the assembly step. Since, for this example, completed molecules of MTX are produced when two fragment pairs are merged, the assembly phase completes after one iteration.
Table VIII in
The assembly phase produced 428 candidate alignments with the element grouping scheme and 436 with the functional grouping scheme. All alignments were scored using the function of equation (32) and clustered into sets based on similarity of score and visual appearance. The scores ranged from 0.40 to 0.75.
Specifically,
Further, as shown in
While the alignments represented in
Specifically,
Further, due to the strong effect the benzamide has on the average transformation that aligned the respective fragment pairs, the correct hydrogen bonding groups on the pteridine ring are not as incident to the corresponding groups on DHF as they are in the alignment shown in
Specifically,
The alignments produced by the element grouping scheme were similar to those shown in
Selectivity
It is important to consider how well the inventive system 100 works using these operating parameters when other molecules are added to the database. Two groups of molecules will be added to the database: a group of DHFR inactives and a second group of DHFR inhibitors. By comparing the selectivity of MTX to the eight DHFR inactives the inventors first examined selectivity in a broad context. Examination of selectivity in a finer context will be conducted by comparing MTX to the eight DHFR inhibitors.
Table IX in
The initial coordinates for the eight DHFR inhibitors shown in
Specifically,
Though small, this group exhibits important features. The size on the inactive range from the small and rigid sugars that are Glycogen Phosphorylase inhibitors, to the larger flexible Thermolysin inhibitors. The size of the eight DHFR inhibitors are consistently smaller than MTX, and their activity spans a wide range (G. M. Crippen, Quantitative structure activity relationships by distance geometry: Systematic analysis of dihydrogolate reductase inhibitors, J. Med. Chem., 23:599 606, 1980), as shown in the last column of Table X in
Specifically,
Using a torsion angle resolution consistent with the methotrexate characterization described above, the number of conformations represented ranges from a few dozen to over 10 million for the DHFR inactives. The number of conformations represented for the DHFR inhibitors total under a thousand for the set, due to their limited flexibility. Thus, with a fairly small set of molecules, the inventors created a virtual conformational database where the DHFR inhibitors are highly under-represented.
The inventors used the same conditions as were described above (e.g., “Example: Application of the Present Invention to Dihydrofolate and Methotrexate”). All single bonds were sampled at 60° intervals. Terminal groups such as carbody, isopropyl, and phenyl were left rigid. Amides were left in a trans configuration. All property field settings were used as specified in Table III in
Ligands from 1tlp, 1tmn, 7 cpa, and 5tmn were partitioned into four fragment nodes; and the ligands from 3tmn, 1cbx, 3gpb, 4gpb, and the eight DHFR inhibitors were partitioned into two fragment nodes.
It should be noted that the atom numbering for the non-hydrogen atoms was kept the same as in the corresponding pdb files. For the nine molecules studied, the bonds that were cut to define fragments are as follows (pdb label: {(atom number)-(atom number), . . . }): 1cbx: {3 5}, 1tlp: {12, 15 16, 23 24}, 1tmn: {113, 14 15, 22-25}, 3tmn: {8-9}, 5tmn: {12-13, 17-18, 24-25}, 7 cpa: {9-10, 18-41, 23 26}, 3gpb: {1 7}, 4gpb: {1 7}, and 4dfr: {11 12, 24 25}. The internal torsional angles that were rotated are as follows: 1cbx: {2-3, 5-6}, 1tlp: {2-12, 16 17, 16 19, 27 28, 24 27} 1tmn: {1 2, 2 3, 13 14, 14 17, 21 22, 25 26}, 3tmn: {12-13, 9-12, 2-3}, 1tmn: {3-4, 11-12, 16-17, 17-20, 25-28}, 7 cpa {23 24, 36 41, 18 40, 10 11, 10 39, 2 32}, 3gpb: {5 6}, 4gpb: {5 6}, and 4dfr: {9-13, 14-19, 25-29, 29-30}. All angles were sampled at 60° with the exception of 14 19 in 4dfr, which was sampled at 180°. For the additional DHFR eight inhibitors, the structural parameters of C—S—C groups in molecules 2 and 67 were constrained to the experimental values of CS bond lengths and CSC bond angles in dimethylsulfide. Structural parameters of C—SO2-C groups in molecules 1, 4 and 68 were constrained to the experimental values of CS and SO bond lengths and CSC, CSO and OSO bond angles in dimethylsulfone (David R. Lide, Cre handbook of chemistry and physics, 75.sup.th edition, 1994. pg. 9-31).
The fragmentation statistics for the sixteen molecules are shown along with those for methotrexate in Table IX in
The number of fragment pairs is the number that have passed a bump check of 1.7 Å. Features were computed at each atom center for each fragment pair. One can see from the data in the table that there are considerable savings in breaking the larger structures up and characterizing fragment pairs rather than molecular conformations. Due to the single partitioning of the DHRF inhibitors, only the bump checks reduced the number of fragment pairs used for feature extraction.
Feature computation can be expensive, but it may only have to be performed once, when molecular information is added to the database. Its cost will vary widely depending on complexity of the field computation and grid resolution, if the field is computed on a grid. Once the features are computed, however, timings for the alignment and assembly stages of a query are independent of the field complexity. Feature computation with the property field as discussed above, and operational settings averaged about 4 features on a workstation (The timings were produced running the program of an IBM RS/6000 43P Model 260 workstation, which has a Power3 CPU running at 200 and a 4 MB L2 Cache).
Selectivity During Alignment Stage
It is important for the efficiency and scalability of the method that the burden on the assembly stage is kept to a minimum. The results shown in Tables X, XI, XII and XIII (
If the other eight inhibitors are included, this number only rises to 1,458,843. The correspondences of MTX and the eight inhibitors are only 0.19% of the total. Each correspondence has an associated transformation which is clustered as described above. Clustering produced 1,053,312 transformation clusters in total. These transformation clusters are ranked according to the number of correspondences they contain. From this pool of transformation clusters, each of which describes a fragment pair aligned on the query molecule, one can select for assembly either the top ranking clusters from each molecule, or the top ranking clusters from the entire pool of clusters. For either choice, one sees a high degree of selectivity for MTX, which has been achieved by the choice of property field and functional grouping scaling scheme.
Table X (
As shown in
Thus, even though MTX represents a small fraction of the total correspondences considered, clustering clearly reveals MTX fragment pairs as the best ones to use in the assembly stage. Subsequent assembly of these MTX fragment pairs leads to the results seen in the previous sections. The additional 23 million conformations represented by the 13 thousand fragment pairs of the molecules in
Table XI in
Further, as seen in Table XI, despite the low number of conformations relative to the inactives, the DHFR inhibitors scored consistently high and were within the selection threshold of 6 votes of higher. A threshold of 6 or more reduces the number of alignments for consideration to 606,478 of which are from the inhibitor set (including MTX). Even with a threshold of 5 votes or higher, the number of alignments to consider is 5,221, 1, 031 being from the inhibitor set. This is out of the 1,061,703 total clusters.
Selection of the larger clusters for assembly therefore appears to be a valid means of significantly reducing the aligned fragment pairs prior to assembly. Given the elementary level of consideration for medicinal chemistry, the dramatic reduction in the number of clusters to consider is encouraging. It is clear that such specific pre-screening is essential for larger datasets, particularly when the final scoring function used is expensive.
Selectivity During Assembly Stage
If, however, the top ranking transformation clusters from each molecule are considered individually, it is interesting to see whether there will still be selectivity for MTX. Table XII in
Table XIII in
Based on the size of the transformation clusters, the top ranking fragment pair alignments were selected for each molecule separately. For each molecule, enough cluster sizes were considered to yield at least 500 fragment pair alignments. This resulted in the number of initial alignments shown in Table XII (
Thus, it can be seen that the bulk of the time for the Thermolysin and Carboxypeptidase inhibitors is spent indexing features and clustering transformations during the alignment stage, where most of the potential assembly load is screened out. For MTX, however, most of the time is spent on assembly due to the large number of fragment pair alignments that could be merged, screened for shape and scored. All assembly runs were exhaustive for the selected fragment pairs.
The assembly results again strongly favor MTX over any of the other molecules investigated. Most of the molecules did not have sufficient similarity to DHF for a complete assembly of the molecule. In fact, only MTX showed any merges of fragment pairs. For those molecules that had opportunities to merge, fragment pairs must be placed such that the common fragments are proximal enough to be merged. For molecules where only a small part is similar to the DHF query, the assembly stage will yield as candidates the fragment pairs or partial assemblies aligned to the corresponding query regions. This is the case in Table XII (
Alignments must also pass a shape screen in order to qualify as candidates and be scored. One can see from Table XII (
The last column of Table XII (
Inspection of the alignments of the DHFR inhibitors reveal that in each case, the binding mode corresponding to MTX is observed (e.g., see
As noted above, Table XIII (
One of the alignments of 3tmn has a Carbo score of 0.54. This is the only compound with a comparable Carbo score to the DHFR inhibitors. Inspection of this alignment, however, reveals that 3tmn is aligned to the branched chain of DHF. Again, this could well be screened out by filters favoring the known binding mode of DHFR inhibitors, since the branched chain region is known from crystal structures to be solvent exposed. This is also the case with the alignments observed with the sugars 3gpb and 4gpb.
A further note on the lower scores of the DHFR inhibitors is that the Carbo score penalizes for any mismatched shape. When aligned correctly to DHF, the DHFR inhibitors have no corresponding structure to fill the volume corresponding to the chain of DHF. This is reflected in the scores of the best alignments, which are around 0.5.
Even though the inventors selected the top ranking fragment pair alignments for each molecule, fragment pairs did not match across the entire volume of the query, so there were not multiple fragment pairs to be considered for merging to form more complete conformers (and, therefore, higher scoring candidates) for molecules other than MTX.
The structure of DHFR was not used at any phase of the screening process, since this is a similarity searching application. It is not our intention here to further refine or optimize the binding mode. Our results show that from a modest flexible dataset, we are able to reduce the number of conformation and alignments to a few hundred, from several million.
In summary, the inventive system 100 offers a systematic way to use arbitrary property fields for the purposes of similarity search and alignment. Context specific information can be used to scale the relative importance of high dimensional descriptors derived from the property field under study. The system 100 may be designed to operate on overlapping parts, or fragment pairs, of molecules, and utilizes an efficient method of conformational space representation.
The inventors applied the inventive system 100 to the benchmark dihydrofolate-methotrexate system, and have demonstrated that by injecting context specific knowledge into the feature classification scheme, one arrives at the reasonable alignments more efficiently. For this study, the inventors used a very simple property field. Even with this simple property field the appropriate inclusion of context in the definition of similarity allowed the production of alignments consistent with the binding modes present in the crystal structures in both rigid and flexible treatments of conformational space.
Further, the inventors looked in detail at how the inventive system 100 works in performing a query for alignments on dihydrofolate from a database of seventeen molecules representing approximately 23 million conformations. The inventive system 100 exhibits a high degree of selectivity for alignments of methotrexate and as well as other DHFR inhibitors. The selectivity is apparent at initial alignment and assembly stages, and is reflected in the resulting Carbo scores.
Referring again to the drawings,
Referring again to the drawings,
In addition to the system described above, a different aspect of the invention includes a computer-implemented method for performing the above method. As an example, this method may be implemented in the particular environment discussed above.
Such a method may be implemented, for example, by operating a computer, as embodied by a digital data processing apparatus, to execute a sequence of machine-readable instructions. These instructions may reside in various types of signal-bearing media.
Thus, this aspect of the present invention is directed to a programmed product, including signal-bearing media tangibly embodying a program of machine-readable instructions executable by a digital data processor to perform the above method.
Such a method may be implemented, for example, by operating the CPU 311 to execute a sequence of machine-readable instructions. These instructions may reside in various types of signal bearing media.
Thus, this aspect of the present invention is directed to a programmed product, comprising signal-bearing media tangibly embodying a program of machine-readable instructions executable by a digital data processor incorporating the CPU 311 and hardware above, to perform the method of the invention.
This signal-bearing media may include, for example, a RAM contained within the CPU 311, as represented by the fast-access storage for example. Alternatively, the instructions may be contained in another signal-bearing media, such as a magnetic data storage diskette 400 (
Whether contained in the computer server/CPU 311, or elsewhere, the instructions may be stored on a variety of machine-readable data storage media, such as DASD storage (e.g., a conventional “hard drive” or a RAID array), magnetic tape, electronic read-only memory (e.g., ROM, EPROM, or EEPROM), an optical storage device (e.g., CD-ROM, WORM, DVD, digital optical tape, etc.), paper “punch” cards, or other suitable signal-bearing media including transmission media such as digital and analog and communication links and wireless. In an illustrative embodiment of the invention, the machine-readable instructions may comprise software object code, complied from a language such as “C,” etc.
With its unique and novel features, the present invention provides a fast and efficient field-based similarity search system and method which are not confined to a particular field definition. For example, the inventive system and method may be designed for simplistic, phenomenological fields, as well as for fields derived from quantum mechanical calculations. Therefore, the claimed system and method provides improved versatility over conventional systems and methods.
While the invention has been described in terms of preferred embodiments, those skilled in the art will recognize that the invention can be practiced with modification within the spirit and scope of the appended claims
Claims
1. A field-based similarity search system, comprising:
- an input device which inputs a query molecule; and
- a processor which: partitions a conformational space of said query molecule into a fragment graph comprising an acyclic graph including plural fragment nodes connected by rotatable bond edges; computes a property field on fragment pairs of fragments of said query molecule from said fragment graph, said property field comprising a local approximation of a property field of said query molecule; constructs a set of features of said fragment pairs based on said property field, said features comprising a set of local, rotationally invariant, and moment-based descriptors generated from all conformations of said fragment graph of said query molecule; and weights said descriptors according to importance as perceived from a training set of descriptors to generate a context-adapted descriptor-to-key mapping which maps said set of descriptors to a set of feature keys comprising indices that label grid cells in discriminant space,
- wherein said processor partitions said conformational space by selecting of a set of bonds in said molecule to cleave.
2. The field-based similarity search system of claim 1, wherein said processor further:
- aligns fragment pairs of a candidate molecule stored in a feature database to said query molecule.
3. The field-based similarity search system of claim 2, wherein said processor further:
- assembles said aligned candidate molecule fragment pairs to construct a full aligned candidate molecule.
4. The field-based similarity search system of claim 3, further comprising:
- a display device for displaying said full aligned candidate molecule.
Type: Application
Filed: Oct 17, 2013
Publication Date: Feb 20, 2014
Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATION (Armonk, NY)
Inventors: Michael C. Pitman (Stamford, CT), Blake G. Fitch (White Plains, NY), Hans W. Horn (San Jose, CA), Wolfgang Huber (Murg-Ninederhof), Julia E. Rice (Sunnyvale, CA), William C. Swope (Morgan Hill, CA)
Application Number: 14/056,517
International Classification: G06F 17/30 (20060101);