Methods, software arrangements, storage media, and systems for genotyping or haplotyping polymorphic genetic loci or strain identification

Info

Patent number: 7917297
Type: Grant
Filed: Aug 2, 2004
Date of Patent: Mar 29, 2011
Patent Publication Number: 20070043514
Assignee: New York University (New York, NY)
Inventors: Vera Cherepinsky (Sandy Hook, CT), Bhubaneswar Mishra (Great Neck, NY)
Primary Examiner: Jerry Lin
Attorney: Dorsey & Whitney LLP
Application Number: 10/567,126

Abstract

The present invention relates generally to systems, methods, storage media, and software arrangements for genotyping and/or haplotyping a sequence of polymorphic genetic loci in a deoxyribonucleic acid (DNA) sample or identifying a strain variant from the DNA sample. Exemplary embodiments of systems, methods, storage media, and software arrangements may perform the optimization of the design of one or more microarrays, each containing a set of oligonucleotide probes capable of detecting one or more known genotypes and/or haplotypes at given polymorphic genetic loci or identifying the strain variant, by optimizing the set of oligonucleotides to be incorporated into the microarrays and by optimizing the arrangement of a set of oligonucleotides on the microarrays. The optimization may be achieved through the application of one or more optimization procedures. The instant invention may be useful in typing individuals at the HLA loci or other polymorphic genetic loci, or may be employed to quickly identify viral or bacterial pathogens from which genome sequence information is available.

Description

Description

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a national stage application of PCT Application No. PCT/US2004/024766 which was filed on Aug. 1, 2004 and published in English on Feb. 10, 2005 as International Publication No. WO 2005/013091 (the “International Application”), This application also claims priority from U.S. Patent Application Ser. No. 60/492,210, filed on Aug. 1, 2003 (the “'210 Application”). The entire disclosures of these applications are incorporated herein by reference. This application claims priority from the International Application pursuant to 35 U.S.C. §365, and from the '210 Application pursuant to 35 U.S.C. §§119(e) and 365.

FIELD OF THE INVENTION

The present invention relates generally to systems, methods, and software arrangements for genotyping or haplotyping polymorphic genetic loci or strain identification.

BACKGROUND OF THE INVENTION

The human leukocyte antigen (“HLA”) region on chromosome 6 is highly polymorphic. In particular, the sequence of this region varies from person to person. See, e.g., Consolandi et al., Human Immunology 2003, 64:168. Approximately 1,750 different sequence variants or “alleles” have been identified to date at the HLA locus. There are many biological implications of the high degree of heterogeneity of this region. For example, the presence or absence of a given HLA allele or “HLA type” may predict the presence or absence of diseases, dictate the course of treatment for a patient, or, most notably, determine the compatibility of a potential transplant recipient with the donor organ or bone marrow.

One of the approaches to finding the right allele is to design a microarray experiment that provides the allele as an answer. In fact, HLA typing by sequence hybridization with sequence-specific oligonucleotide probes (“SSOP”) is currently practiced by the National Marrow Donor Program (“NMDP”) for donor-recipient matching, along with more traditional serological-based methods. See, e.g., Cao et al., Reviews in Immunogenetics 1999, 1:177; and Noreen et al., Tissue Antigens 2001, 57:221. In a format that is popular in many current test methods, the DNA samples to be classified are amplified with locus-specific primers, and spotted onto the microarray chips, thus resulting in multiple copies of identical chips; each chip is then hybridized to a different probe. See Balazs et al., Human Immunology 2001, 62:850; Consolandi et al., Human Immunology 2003, 64:168. This methodology necessitates a new design process every time a new set of patient samples must be classified. Moreover, most of the currently used techniques are both time-consuming and lack optimality.

The system, process, storage medium and software arrangement according to one exemplary embodiment of present application provides a graph model on the set of potential probes in which the HLA typing problem is formulated mathematically as an optimization problem. According to the present application, it is also possible to utilize an algorithm for solving the optimization problem. The processes of translating the typing problem to the graph model and translating the optimizing probe set back to an experimental design for HLA typing are also described. Extensions of the graph model to more detailed physical models are discussed as well.

SUMMARY OF THE INVENTION

Embodiments of systems, processes, storage media and software arrangements for genotyping or haplotyping the polymorphic genetic loci or strain identification according to the present invention may optimize the design of one or more microarrays, each containing a set of oligonucleotides that are capable of detecting known genotypes or haplotypes at given polymorphic genetic loci. This can be done by optimizing the set of oligonucleotides to be incorporated into the microarrays and/or by optimizing the arrangement of a set of oligonucleotides on the microarrays. Such optimization may also be achieved through the application of one or more optimization algorithms. The present invention may be useful in typing individuals at the HLA loci or other polymorphic genetic loci, and/or may be employed to quickly identify viral or bacterial pathogens from which genome sequence information is available.

In contrast to the conventional systems and methods according to the present invention, the sequence-specific probes may be placed on a microarray chip, and each patient sample can be applied to the chip to allow the hybridization with one or more of the chip-bound probes to occur. With an appropriate selection of probes, the same chip can be used for all classifications. However, the number of probes to be used, their sequence compositions, and their arrangement on the chip are some of the design variables that should preferably be considered in preparing the microarray. A general solution to solving these design problems is preferably one that allows the “recognition” of all existing alleles at a target locus, and/or that can decide that the given DNA sequence contains an allele that is not in the “known” list. Such an allele may be a new, previously unknown allele, or one of the very rare alleles that occur so infrequently that they are not considered HLA types. For example, an exemplary embodiment of the present invention is directed toward systems, processes, storage media and software arrangements for genotyping or haplotyping a DNA sample at one or more polymorphic loci contained in the sample through the use of microarray μA, which can be defined by a set of oligonucleotide hybridization probes configured on the surface of the microarray in a two-dimensional arrangement. The process of querying a given polymorphic locus in a DNA sample (hereafter referred to as a “target” sequence) by a hybridization experiment can be denoted by the expression
(T_j,μA_k)→D→{circumflex over (T)}_j (1)
where T_jis the true allele contained in the target sequence, μA_kis the microarray used in the query, D is the data output of the hybridization experiment, and {circumflex over (T)}_jis the allele inferred from the data. Both processes in equation (1) are described below.

The problem of genotyping or haplotyping then can be formulated as that of designing, e.g., the “best” microarray, namely, the set and arrangement of oligonucleotide probes that “works” for all known alleles (i.e., ∀j). In the notation employed herein, this means finding μA_kwhich solves the optimization

$\begin{matrix} \min \sum_{type j} w_{j} E [Π_{T_{j} \neq {\hat{T}}_{j}}] \Leftrightarrow \min \sum_{type j} w_{j} \Pr (T_{j} \neq {\hat{T}}_{j}) . & (2) \end{matrix}$
Here, Π_Xis the indicator function

$Π_{X} = {\begin{matrix} 1, & if X is true \\ 0, & otherwise \end{matrix}}$
and w_jis the weight assigned to type j. Initially, w_jcan be set such that w_j=1∀j. At a later point it may be desirable to weigh different HLA types differently, based on the frequency of their occurrence in human population or some other criteria.

According to another exemplary embodiment of the present invention, a pseudocode can be provided to select an optimal set of oligonucleotide probes to be incorporated into the one or more microarray devices to be used to genotype or haplotype polymorphic loci or identify the strain present in a DNA sample. According to this exemplary embodiment, vertex boosting weights (initially set to probe information weights) can be used to define a probability distribution on a vertex set present in a graphical representation of “response vectors” derived from each potential probe sequence. On each iteration of a boosting loop, a random subset of a specified size can be selected according to the current probability distribution. All edges in the induced subgraph on this random subset are broken, with one of the terminal vertices removed. The boosting weights of the elements of the subset can then be modified so that the vertices that stayed in the subset are more likely, and the vertices that were thrown out are less likely to be chosen on the next iteration. The boosting loop may be terminated after a predetermined number of iterations have been performed without further improvement to the list of top independent sets. In addition, the boosting loop can be restarted several times with original probe information weights, which prevents convergence to a local optimum.

The exemplary embodiments of the systems, processes, storage media and software arrangements of the present invention are generally more beneficial in comparison to conventional methods in that they require fewer probes, thereby minimizing the cost associated with the use of such probes. The exemplary embodiments of the systems, processes, storage media and software arrangements of the present invention are also preferable to conventional methods in that they can minimize competition among neighboring probes, thereby reducing or eliminating the occurrence of systematic biases in the error process.

For a better understanding of the present invention, together with other and further objects, reference is made to the following description, taken in conjunction with the accompanying drawings, and its scope will be pointed out in the appended claims.

DETAILED DESCRIPTION OF THE DRAWINGS

For a more complete understanding of the present invention and its advantages, reference is now made to the following description, taken in conjunction with the accompanying drawings, in which:

FIG. 1A is an exemplary embodiment of a system according to the present invention for determining an optimal set of oligonucleotide probes for use in a microarray designed to perform genotyping or haplotyping of polymorphic genetic loci;

FIG. 1B is an exemplary embodiment of a procedure according to the present invention for determining an optimal set of oligonucleotide probes for use in a microarray designed to perform genotyping or haplotyping of polymorphic genetic loci;

FIG. 2 is a first exemplary embodiment of a probe selection procedure (Shown in FIG. 1B) of the present invention for determining an optimal set of oligonucleotide probes for use in a microarray designed to perform genotyping or haplotyping of polymorphic genetic loci;

FIG. 3 is a second exemplary embodiment of the probe selection procedure shown in FIG. 1B of the present invention for determining an optimal set of oligonucleotide probes for use in a microarray designed to perform genotyping or haplotyping of polymorphic genetic loci, in which exemplary embodiments of certain steps of FIG. 2 are further illustrated.

DETAILED DESCRIPTION OF THE INVENTION

An exemplary embodiment of the present invention is directed to systems, processes, storage media and software arrangements for genotyping or haplotyping polymorphic genetic loci (e.g., an HLA locus, in a subject) or strain identification. HLA haplotyping or genotyping may assist in, e.g., a prediction of a presence or absence of diseases, selecting a course of treatment for a patient, and/or determining the compatibility of a potential transplant recipient with the donor organ or bone marrow. The systems, processes, storage media and software arrangements of the present invention also may be useful for e.g., identifying the presence of an unknown pathogen, including but not limited to a virus or a bacterium, in a sample.

By providing a relatively rapid, inexpensive, and potentially highly accurate way of genotyping or haplotyping polymorphic genetic loci or strain identification, the exemplary systems, processes, storage media and software arrangements of the present invention also can be useful in elucidating genotype/phenotype correlations in complex genetic disorders, i.e., those in which more than one gene may play a significant role, or in genetic diseases characterized by the presence of a relatively large number of genotypes or haplotypes at polymorphic loci. The knowledge obtained from the exemplary systems, processes, storage media, and software arrangements of the present invention therefore may assist in facilitating the diagnosis, treatment, and prognosis of individuals bearing a given phenotype.

FIG. 1A shows an exemplary embodiment of a system according to the present invention for determining a particular set of oligonucleotide probes for a use in a microarray designed to perform genotyping and/or haplotyping of at least one polymorphic genetic locus. For example, the system includes a processing arrangement 10 (e.g., a computer), which stores a computer program on a storage arrangement 15 (e.g., memory, hard drive, etc.) to execute the exemplary techniques described herein below. In particular, the computer program, when executed by the processing arrangement 10, causes the processing arrangement to obtain known sequences representing classes, e.g., HLA allele sequences, from an external source 20 or any source (as described below). Then the processing arrangement 10 applies a probe selection technique according to the present invention. Thereafter, the processing arrangement 10 outputs the results of the execution of this technique.

FIG. 1B illustrates a first exemplary embodiment of a process for determining an optimal set of oligonucleotide probes for use in a microarray designed to perform genotyping or haplotyping of polymorphic genetic loci or strain identification which can use the exemplary system shown in FIG. 1A. In this exemplary embodiment, the process executes a first step 100 in which nucleotide sequences of known genotypes or haplotypes at predetermined polymorphic loci, or strain-specific sequences, may be obtained. Such information can be obtained from a number of sources. For example, this data can be obtained from genetic databases including, but not limited to, GenBank and/or other databases maintained by public or private (commercial and noncommercial) institutions. The databases suitable for use with the systems, processes, storage media and software arrangements of the present invention will be readily apparent to those of ordinary skill in the art of DNA sequencing. Alternatively, such information may-be obtained from the prior direct, sequencing of one or more test samples.

Once a set of known sequences corresponding to genotypes or haplotypes at polymorphic genetic loci or strains is assembled, the probe selection procedure 200 of the present invention may be applied to this set to identify an optimal subset of oligonucleotide probes required to genotype or haplotype the polymorphic genetic loci or identify the strain. This can be executed by the processing arrangement of FIG. 1A. The optimization of the set of oligonucleotide probes may include, but is not limited to, the determination of the overall lengths of the various probes, their sequence compositions, and the minimum number of probes required to identify all selected target genotypes or haplotypes or strains. Many if not all selected target genotypes or haplotypes or strains may include all known genotypes or haplotypes at predetermined polymorphic genetic loci or strains, or a subset thereof. For example, the identification may be limited only to those alleles that are most prevalent in the population from which the DNA sample to be genotyped and/or haplotyped has been obtained.

In another step of the process for determining the optimal set of oligonucleotide probes of FIG. 1B, an optimal subset of oligonucleotide probes preferred for genotyping and/or haplotyping polymorphic genetic loci or strain identification may be generated by the probe selection technique 200 and processing arrangement 10. The optimal probe set 300 may then be arranged on one or more microarray devices to determine the genotype or haplotype of polymorphic genetic loci or identify the strain present in a DNA sample.

Another exemplary embodiment of the present invention further relates to the optimal arrangement of the optimal set of oligonucleotide probes on the one or more microarray devices. The exemplary arrangement of the probes may be based on one or more “conceptual” measures of probe distance. The “conceptual” probe distance can use available biological knowledge about the portions of the genome containing the probes in question, as well as a measure of competition between these probes, as described in Chapter 3 of Cherepinsky, Ph.D. Thesis, New York University 2003. An exemplary recursive technique according to the present invention can facilitate a separation between the “nearest” probe pairs by at least a specified minimum physical distance on the chip. This optimized arrangement can be designed to be disruptive to neighborhoods, defined with respect to the “conceptual” probe distance. Thus, the exemplary arrangement can place conceptually nearby probes far apart on the surface, thereby minimizing the likelihood for competitive interactions between neighboring probes.

FIG. 2 illustrates a first exemplary embodiment of the probe selection algorithm 200 of the present invention for determining the optimal set of oligonucleotide probes. In this exemplary embodiment, the biological problem to be solved, e.g., designing an optimal set of probes to be used on a microarray for determining the genotype or haplotype of polymorphic genetic loci, such as the human HLA region, or strain, variation may first be translated into a graph optimization problem 210. The graph optimization problem can then be solved (item 220) through an application of the exemplary technique according to the present invention. The graph optimization solution 230 can then be translated back into the context of the biological problem to be solved.

FIG. 3 shows a second exemplary embodiment of the probe selection technique (see process 200 of FIG. 1B) according to the present invention for determining the optimal set of oligonucleotide probes, in which the exemplary embodiments of the processes 210 and 220 of FIG. 2 are further illustrated. In this exemplary embodiment, selected target genotypes and/or haplotypes, which may include all known genotypes or haplotypes at one or more predetermined polymorphic genetic loci or a subset thereof, depending on whether pre-processing has been performed, may be used to generate potential probe sequences 211. The potential probe sequences 211 can then be used to generate probe response vectors (PRVs) 212. Using the PRVs 212, a complete edge-weighted and vertex-weighted graph G=(V,E) 213 may be constructed and the algorithm parameters estimated and set. Said parameters include the following: M, the upper bound on the size of the independent set sought; a, the scaling factor used to modify boosting weights of vertices on each iteration of the algorithm; ρ, the edge threshold; parameters minRestartNum and noImprovements, used to set loop termination conditions; as well as unnamed parameters such as the size of the “current-best” list of independent sets. Criteria for estimating M are obtained via probabilistic analysis, discussed in section C.6 of present application. Other parameters are chosen by trial and error.

Once the graph G is constructed, the exemplary technique of the present invention may be applied to identify one or more optimal subsets of PRVs (step 221), which are output from this exemplary technique. In a post-processing procedure, one or more of the candidate optimal subsets 222 can be selected for a maximum discriminatory power by testing allele coding vectors for redundancy.

The PRVs present in the optimal subset may then be converted back into probe nucleotide sequences by reporting the DNA sequence associated with the probe used to generate each PRV contained in the optimal-subset (item 231). This exemplary procedure may be equivalent to translating the solution of the graph optimization problem back into a biological context.

Overall Process Description According to the Present Invention

A. Mathematical Formulation

1. Definitions

Let the different HLA types, or alleles, be denoted by T_j, j=1, . . . , N. (Here, N=1750, the approximate number of known HLA types.) Let a given microarray be denoted by μA_k,k ε , where a microarray is defined by a set of hybridization probes and their two-dimensional arrangement on the chip surface. The process of querying the given DNA sequence (hereafter referred to as a “target” sequence) by hybridization can be denoted by the expression
(T_j,μA_k)→D→{circumflex over (T)}_j (1)
where {circumflex over (T)}_jis the true allele contained in the target sequence, μA_kis the microarray used in the query, D is the data output of the hybridization experiment, and {circumflex over (T)}_jis the allele inferred from the data. Both processes in (1) are described below.

The problem of HLA typing can then be formulated as that of designing the best microarray, namely, the set and arrangement of probes, which “works” for all known HLA types (i.e., ∀j). In the present notation, this can mean finding μA_kwhich solves the optimization problem

$\begin{matrix} \min \sum_{type j} w_{j} E [Π_{T_{j} \neq {\hat{T}}_{j}}] \Leftrightarrow \min \sum_{type j} w_{j} \Pr (T_{j} \neq {\hat{T}}_{j}) . & (2) \end{matrix}$
Here, Π_Xis the indicator function

$Π_{X} = {\begin{matrix} 1, & if X is true \\ 0, & otherwise \end{matrix}}$
and w_jis the weight assigned to type j. Initially, w_jcan be set such that w_j=1∀j. At a later point, it may be desirable to weigh different HLA types differently, based on the frequency of their occurrence in human population or some other criteria.

There are several procedures that should be considered in detail: obtaining data D from an experiment based on allele T_jand microarray μA_k(described below in Section A.2), inferring allele {circumflex over (T)}_jfrom the data (described below in Section A.3), generating potential microarrays for the typing experiments (described below in Section A.4), and selecting the optimal microarray (Section B).

2. (T_j,μA_k)→D

Consider the set of probes {P₁,_k, . . . , P_n(k),k} constituting microarray μA_k, initially neglecting their arrangement. Ideally, the outcome of the hybridization of the target sequence with each probe P would be binary: 1, if the target contains a subsequence complementary to P, and 0, otherwise. Using n=n(k) probes then yields a binary string of length n, or, alternately, a vector of length n, as a code for the target sequence. In practice, hybridization results may not necessarily be binary. In particular, the measurements are the analog intensity values corresponding to the amount of formed probe-target complex for each probe. In addition, in an attempt to “factor out” the non-specific signal, each probe can often be present in two versions: one (e.g., a perfect match, or “pm”) perfectly complementary to a region on the target, and the other (e.g., a mismatch, or “mm”) slightly mismatched, the latter usually containing a single base mismatch near the center of the probe. This is the case, for example, in Affymetrix GeneChips. In such setup, the signal from probe P may be the match-to-mismatch ratio, i.e., the ratio of the intensities corresponding to the matched and mismatched probe-target complexes. Furthermore, the signal can be log-transformed, so that the hybridization outcome for probe P is really the value of

$\log (\frac{{TP}_{pm}}{{TP}_{mm}})$

The situation may further be complicated because the probes may hybridize to positions on the target other than those they were designed to detect—this is known as “cross-hybridization.” In addition, the fact that many probes are present in the system may cause the signal (i.e., the hybridization outcome) from a given probe to differ from the signal of the same probe in the absence of other probes. This is described in more detail in Cherepinsky, Ph.D. Thesis, New York University 2003, Chapter 3.

Thus, the actual result of a hybridization experiment is a vector of n measurements, D ε ⁿ.

3. D→{circumflex over (T)}_j

To infer the allele from the n-vector, the ideal process should be referred to again, where the outcome is D ε {0,1}ⁿ. If the probes are selected in such a way so as to provide a distinct binary string for each known allele (so that the Hamming distance d_Hbetween any pair of data vectors D_i, D_kis at least 1), then these n probes can be sufficient to identify the allele of the target sequence. What is preferred is to query the sequence with the n probes and read off the allele to which the pattern corresponds. Furthermore, if it is required that d_H(D_i, D_k)≧α for some α>1, then the discrimination power can be increased and error-correction is possible. This is discussed in greater detail below (e.g., Section B.6).

In a practical setting, D ε ⁿ. Thus, as a first procedure, some thresholding process must be applied to D to reduce it to a binary string.

4. Generating Potential Microarrays

- a. Selection of Informative Probes. A set of n probes, each of length L, that are at least d letters apart (pairwise), must be provided for optimal discrimination among the allele sequences.

If L is not specified, it can be chosen arbitrarily (say, 20), or allowed to vary from probe to probe.

With no restrictions, a very large n can be chosen; for example, every possible 20-mer could be used as a probe. However, this would result in 4²⁰=2⁴⁰=(2¹⁰)⁴>(10³)⁴=10¹², or over a trillion, probes. Such a large set may not be desirable, since many of these probes would give the same results, and it is too expensive to produce all of them. Allowing both n and L to vary may give an even larger number of potential probe sequences.

The probe design problem relates to selecting which of the probes are most useful in discriminating among the given allele sequences, and how many (or rather, how few) one can use appropriately.

- b. Arrangement of Probes on the Chip. When a set of probes has been selected using techniques described herein above, a question still remains of how to arrange these probes on the microarray chip. Several studies indicate that the patterns observed in the results of chip experiments may be due to the arrangement of probes on the chip. See e.g., Kluger et al., submitted to Nature Genetics, 2004 at URL http://bioinfo.mbb.yale.edu/˜kluger/pipeline/KLUGERetal_NG.pdf; Yu et al., submitted to Nature Biotechnology, 2004 at URL http://bioinfo.mbb.yale.edu/˜kluger/pipeline/YQK_NB.pdf; and Qian et al., submitted to Biotechniques, 2004 at URL http://bioinfo.mbb.yale.edu/˜kluger/pipeline/QYK_artifact.pdf. In particular, it has been observed that the probes are arranged on a chip based on the labels of the genes they represent, and a gene label is often related to the function and/or disorder in which the gene is involved. As a result, genes of shared function have similar labels and are coexpressed, generating monochromatic bands on microarray chip scans.

These studies indicate that additional consideration should be given to the arrangement of the probes on the chip, based on certain “conceptual” measure of probe distance. The “conceptual” probe distance can use available biological knowledge about the portions of the genome containing the probes in question, as well as a measure of competition between these probes, as discussed Chapter 3 of Cherepinsky, Ph.D. Thesis, New York University 2003. The recursive technique described below can ensure that the “nearest” probe pairs would be separated by at least a specified minimum physical distance on the chip. It is designed to be disruptive to neighborhoods, defined with respect to the “conceptual” probe distance. Thus, the exemplary technique places conceptually nearby probes far apart on the surface.

Consider a bijective function f: {0, . . . , N²−1}→{0, . . . ,N−1}×{0, . . . ,N−1} that maps every pair of “nearby” points in the domain space to a pair of “distant” points in the range space. In particular, the following devised function ƒ with the following property should be considered: For every x, y, if |x−y|≦4^α, then ∥ƒ(x)−ƒ(y)∥₁≧N/(2^α+1). This function likely gives an optimal placement. If the elements of the domain space satisfy other distance properties, this technique can be suitably generalized to handle similar properties with respect to the new distance metric.

This function can play an important role in determining how to place a set of oligonucleotide probes on a microarray surface in such a manner that if two probes are close to one another in their genome locations then they are reasonably far apart on the array. Thus, a placement determined by the function can minimize the competition among the probes for the genomic targets, as well as the systematic biases in the error processes.

Inductively, a uniform family of functions ƒ_kmay be defined as follows. Let k<lg N.

$\begin{matrix} f_{k + 1} : {0, \dots, N^{2} - 1} \to {0, \dots, N - 1} \times {0, \dots, N - 1} \\ : x \mapsto 〈 i, j 〉 . \end{matrix}$
f_k+1is defined in terms of f_k,
ƒ_k:{0, . . . ,N²/4−1}→
{0, . . . ,N/2−1}×{0, . . . ,N/2−1},
as follows:

$\begin{matrix} f_{k + 1} (x) = {\begin{matrix} f_{k} ([\frac{x}{4}]), & if x \equiv 0 \mod 4 \\ f_{x} ([\frac{x}{4}]) + 〈 0, \frac{N}{2} 〉, & if x \equiv 1 \mod 4 \\ f_{x} ([\frac{x}{4}]) + 〈 \frac{N}{2}, 0 〉, & if x \equiv 2 \mod 4 \\ f_{x} ([\frac{x}{4}]) + 〈 \frac{N}{2}, \frac{N}{2} 〉, & if x \equiv 3 \mod 4 \end{matrix}} & (3) \end{matrix}$

This function can be generalized, without its general properties being affected, by including a random permutation π_k+1: {0, . . . ,3}→{0, . . . ,3} as follows:

$f_{k + 1} (x) = {\begin{matrix} f_{k} ([\frac{x}{4}]), & if x \equiv π_{k + 1} (0) \mod 4 \\ f_{k} ([\frac{x}{4}]) + 〈 0, \frac{N}{2} 〉 & if x \equiv π_{k + 1} (1) \mod 4 \\ f_{k} ([\frac{x}{4}]) + 〈 \frac{N}{2}, \frac{N}{2} 〉 & if x \equiv π_{k + 1} (2) \mod 4 \\ f_{k} ([\frac{x}{4}]) + 〈 \frac{N}{2}, \frac{N}{2} 〉 & if x \equiv π_{k + 1} (3) \mod 4 \end{matrix}}$

Herein below, k may take the value (lg N−1), and the base case is given by the function
ƒ₂:{0, . . . ,15}→{0, . . . ,3}×{0, . . . ,3}
where
00,0 10,2 22,0 32,240,1 50,3 62,1 72,381,0 91,2 103,0 113,2121,1 131,3 143,1 153,3 (4)
This base map can be described in matrix format as follows:

$\begin{matrix} [\begin{matrix} 0 & 4 & 1 & 5 \\ 8 & 12 & 9 & 13 \\ 2 & 6 & 3 & 7 \\ 10 & 14 & 11 & 15 \end{matrix}] & (5) \end{matrix}$
Taking N=2^l, N²=4^lprobes can be placed by applying ƒ_l: {0, . . . ,4^l−1}, which after l −2 recursive steps, defined in equation (3), may reduce to the base case ƒ₂: {0, . . . ,15} shown in equations (4) and (5).

Function ƒ_lcan have the following distance properties. Let D(i, j) be the distance between probes p_iand p_j, when arrayed on a line (by relabeling the probes, one can view this as the index separation |i−j|). Let d(i, j) be their distance when arrayed on the surface. Then, the mapping ƒ_lmay guarantee that for all p_i, p_jfor which D(i, j)≦4^k, d(i, j)≧2^l−k−1, where k=0, . . . ,l−1. Furthermore, if d(i, j)=1, that is p_iis placed next to p_jon the surface, then D(i, j) ≧3·4^l−2.

For example, let l=3. There are N²=4^l=64 probes, which are placed by ƒ₃according to

$[\begin{matrix} 0 & 16 & 4 & 20 & 1 & 17 & 5 & 21 \\ 32 & 48 & 36 & 52 & 33 & 49 & 37 & 53 \\ 8 & 24 & 12 & 28 & 9 & 25 & 13 & 29 \\ 40 & 56 & 44 & 60 & 41 & 57 & 45 & 61 \\ 2 & 18 & 6 & 22 & 3 & 19 & 7 & 23 \\ 34 & 50 & 38 & 54 & 35 & 51 & 39 & 55 \\ 10 & 26 & 14 & 30 & 11 & 27 & 15 & 31 \\ 42 & 58 & 46 & 62 & 43 & 59 & 47 & 63 \end{matrix}]$
The probe distances

$\begin{matrix} k & D & d \\ 0 & 1 & 4 \\ 1 & 4 & 2 \\ 2 & 16 & 1 \end{matrix}$
likely satisfy the condition that if D≦4^k, then d≧N/2^k+1=2^l−k−1.

The problem of an automatic generation of probe sets for DNA microarrays is described in Krause et al., Second IEEE International Workshop on High Performance Computational Biology (HiCOMB 2003), online proceedings at URL http://hpc.eece.unm.edu/HiCOMB/proceedings.html. However, the work described by Krause et al. aims for a probe set that is, even in ideal circumstances, asymptotically much larger than the one generated by the exemplary embodiments of the present invention.

Other biological problems, such as identifying an unknown pathogen as a member of a list of known pathogens, be they viral (see Rash and Gusfield, in Proceedings of the Sixth Annual International Conference on Computational Biology (RECOMB '02), ACM Press, pp. 254-261) or bacterial (see Borneman et al., Bioinformatics 2001, 17:S39), likely have the same mathematical formulation as the problem of HLA typing discussed here. Those applications can also benefit from the improvements to existing approaches provided by the exemplary embodiments of the present invention.

B. Definitions

The exemplary problem of selecting the optimal microarray is described below. The problem of selecting the constituent probes can be reduced to a “best independent set” problem. The following sections can define the graph model employed, as well as the meaning of the term “best independent set,” and describe the optimizing algorithm.

1. Notation

There are N known alleles, and n potential probes. Each probe can be described by a “response vector” {right arrow over (v)}_{j ε{}0,1}^N, j=1, . . . ,n. The response vector data can be represented in tabular form:

$\begin{matrix} \begin{matrix} {\vec{v}}_{1} & {\vec{v}}_{2} & \dots & {\vec{v}}_{n} \\ {HLA}_{1} & 1 & 0 & \dots & 1 \\ {HLA}_{2} & 1 & 1 & \dots & 0 \\ ⋮ & ⋮ & ⋮ & ⋮ \\ {HLA}_{N} & 0 & 0 & \dots & 1 \end{matrix} & (6) \end{matrix}$
Column j is the response vector for probe {right arrow over (v)}_j:
{right arrow over (v)}_j=(v_j[1], v_j[2], . . . ,v_j[N])^T, (7)
and row i is the code for allele i, which we will call HLA_i:
HLA_i=(v₁[i], v₂[i], . . . ,v_n[i]) (8)

2. Original Graph

Each potential probe can form a vertex in the graph. Conceptually, an edge in the graph should connect two probes with shared characteristics.

First, essentially a complete edge- and vertex-weighted undirected graph G=(V,E) on n vertices is constructed, where n is the number of potential probes. In a general problem formulation, n can be very large. For each probe length L, there are likely 4^Lpossible probes. For instance, as shown in above in Section A.4, there may be over a trillion possible probes of length 20. Thus, the graph in the probe interaction model can be very large.

Second, weights are assigned to each vertex v and to each edge e, 0≦w(v), w(e)≦1. The weight of a vertex can be initially set to the “information content” of the corresponding probe response vector (PRV) with respect to the HLA typing problem:
w(v)=min {percentage of 0's, percentage of 1's}/100. (9)
The term “information content” can be used here differently than defined in information theory, where it may mean the minimum amount of information needed to send a string. See e.g., Cover and Thomas. Elements of Information Theory. John Wiley and Sons, New York. 1991. Ideally, if possible, all vertices should have weight 0.5. A vertex with a weight too close to zero can be uninformative, and the corresponding probe should only be used if it serves to differentiate an allele that is not distinguishable by using other, more informative probes.

The weight of an edge is initially set to the scaled Hamming distance of the probe response vectors represented by its endpoints:
w(e)=Hamming distance/vector length (10)
with values close to zero corresponding to sequence-similar probes.

3. Thresholded Graph

Third, the graph G is transformed by thresholding the edges. A threshold ρ, is selected and used to generate a modified graph G_mod=(V, E_mod), where
E_mod={e ε E: w(e)≦ρ}
is a set of unweighted edges, and the set of weighted vertices V is unchanged. The choice of threshold ρ will be discussed in further detail below in section D. Hereafter, this modified vertex-weighted graph is employed and denoted by G.

An independent set can be defined on the graph. An independent set is defined as a set of vertices such that for any pair of vertices, there is no edge between them. See Dictionary of Algorithms and Data Structures at NIST at URL http://www.nist.gov/dads/. In particular, a set of vertices V′⊂V can be an independent set if (u ε V′ and v ε V′) implies that {u, v} ∉ E.

4. Exemplary Goal

The best microarray can be defined above in Section A. The corresponding concept is now formulated on the graph model. The best independent set can be defined as a maximum weight yet minimum size independent set. Thus, such a set S ⊂ V must be an independent set, have maximum weight w(S)=Σ_vεSw(v), and minimum cardinality |S|. The condition of independence is meant to preclude any unintended interaction among the chosen probes. Maximum weight provides S with maximum discrimination power. Minimum size likely ensures that the smallest collection of probes is used to perform the preferred functions.

Since all vertex weights are nonnegative, the requirements of maximum weight and minimum cardinality are clearly contradictory. The definition can be relaxed somewhat by specifying a priori the desired size M of the set, and looking instead for the maximum weight independent set of size ≦M.

C. Optimization Procedure

To achieve this goal, a modification of a Maximal Independent Set procedure, described in Luby, Proceedings of the Seventh Annual ACM Symposium on Theory of Computing (STOC '85), pp. 1-10, is utilized.

1. Procedure Pseudocode

Given graph G and set size M:

- a) Initialization:
  - (i) Initialize a “current-best” list of independent sets, with associated information weights. It stores a list of the best, say, 20, independent sets seen so far, sorted by information weight.
- b) Restart Loop: Execute at least minRestartNum times; if the “current-best list” is not full (i.e., does not have 20 independent sets) by then, keep repeating until the list is filled.
  - (i) Initialize boosting weights; this was accomplished by setting the boosting weights to the information weights of the vertices:
    ∀v ε V, w_b(v)←w(v).
  - (ii) Boosting Loop: Repeat until no improvements have been made to the “current-best” list for a fixed number of iterations (say, five iterations).
    - aa. Choose a set S of M vertices randomly from V, with

$P (v \in S) = \frac{w_{b} (v)}{\sum_{u \in V} w_{b} (u)} .$

- - - bb. For each edge {u,v} in G|_s(the induced subgraph on S), eliminate one of the endpoint vertices. This leaves a set, S₁, of K≦M independent vertices.
    - cc. Adjust the boosting weights of vertices in S; for example, increase the boosting weights of the vertices in S₁:
      w_b(v)←a w_b(v) ∀v ε S₁
    - and decrease the boosting weights of the vertices in S-S₁:

$\begin{matrix} w_{b} (v) \leftarrow \frac{1}{a} w_{b} (v) & \forall v \in S - S_{1} \end{matrix}$

- - - where a≧1 is some previously selected constant.
    - dd. If S₁is not already in the “current-best” list and provides an improvement over some current member of the list, reset the noImprovements counter, locate an appropriate location for S₁in the list, and update the list. Otherwise, make a note that no changes to “current-best” list were made on this iteration (ie., increment the noImprovements counter).
  - (iii) If the condition for continuing the restart loop holds (namely, minRestartNum restarts have not yet been executed or the “current-best” list is not yet full), reset the noImprovements counter and repeat step (b).

2. Procedure Description

The procedure can utilize vertex boosting weights (e.g., initially set to probe information weights) to define a probability distribution on the vertex set. On each iteration of the boosting loop (step b.ii above), a random subset of a specified size can be selected according to the current probability distribution (step b.ii.aa). All edges in the induced subgraph on this random subset may be broken, with one of the terminal vertices removed (step b.ii.bb). The boosting weights of the elements of the subset can then be modified (step b.ii.cc), so that the vertices that stayed in the subset are more likely, and the vertices that were removed are less likely to be selected on the next iteration. The boosting loop can terminate after a certain number of iterations with no improvement to the list of top independent sets (to allow some flexibility, the algorithm keeps track of several of the top independent sets instead of only storing the best one seen so far).

The procedure can also restart the boosting loop several times with original probe information weights. This feature can be used to prevent convergence to a local optimum, which is possible for high values of the boosting factor.

3. Detailed Explanation of Procedure

The boosting procedure (as a whole) can be viewed as operating on the probability space of all subsets of our graph. Step b.ii.bb provides that the selected subset is independent, so that the probability distribution can only be supported on independent sets (i.e., the distribution is zero on all non-independent sets). In this view, the procedure can converge to a probability distribution where the best independent set has the highest probability. Each iteration of the boosting loop adjusts the probabilities associated with each vertex in the graph. The subset of interest is always drawn randomly according to the current probability distribution.

If the solution S* is known a priori, its selection by the procedure can possibly be guaranteed by initializing the boosting weights in step b.i to be
∀v ε S* , w_b(v)←1,
∀v ε V−S*, w_b(v)←0.

In other words, the associated probability distribution may have a probability of 1 for each vertex v ε S*, and a probability of 0 for each remaining vertex v ε V−S*.

Given an unlimited time for obtaining a solution (and an appropriate set of parameters), the boosting procedure can converge to this ideal distribution. However, when time is limited, a “good” (i.e., informative) independent set of size ≦M is only “more likely” than other independent sets of similar size. The procedure is provided to give an effective solution in a limited time, and yet be able to improve on it iteratively when more time is permitted (with minimal modifications to the loop terminating conditions).

For example, the best independent set may be, ideally, a fixed point for the algorithm, in the sense that if the procedure starts at a perturbed location in the subset probability space, it should converge to the optimal set. In particular, if the initial probability distribution is heavily favored towards a set that does not differ from the best set in many vertices, the procedure likely converges to the best set.

4. Breaking Edges

In step b.ii.bb of the boosting procedure can keep the vertex that has the higher boosting weight. If vertices have equal boosting weights, one at random (with probability ½) can be selected.

5. Selecting Scaling Factor a

In step b.ii.cc of the boosting algorithm, the weights of those vertices that were selected and kept are boosted (scaled up) by a factor of a≧1, while the weights of discarded vertices are scaled down by the same factor. This has the effect of noting which vertices are selected for membership in the independent set, and increasing the likelihood that these vertices will be selected in the future, with the reverse effect on the discarded vertices. The manner in which the value of the scaling factor a affects the “memory” of the probability space evolution is discussed below. A single “restart” of the procedure (namely, step b.ii) is described.

a) Extreme cases:

- a=1: No memory of previous selections. Ignore the current selection, and choose anew on the next iteration. The boosting procedure performs an exhaustive search.
- a=∞: Perfect memory. Once a set S of vertices is selected and pruned, and its elements' boosting weights are modified, each of the vertices remaining in the independent set S₁will have a boosting weight of ∞ and each of those thrown out of the independent set due to conflicts will have a boosting weight of 0. Thereafter, the boosting procedure will always choose the independent set selected on the first run.

b) Real values:

The boosting procedure is executed on the same graph model with several values of a ε {2, 1.5, 1.2, 1.1}. The executions with higher values of a were observed to terminate a single “restart” after a smaller number of iterations than those with lower values of a. Values of a can be chosen by many methods, known to those with ordinary skill in the art.

6. Choosing M: the Maximum Size of the Independent Set

This section contains a probabilistic analysis of an answer to the following question: What are the bounds on the number of probes, k, that is sufficient to distinguish N known alleles? In order to answer this question, certain assumptions can be made on the random distribution from which the known alleles are assumed to be drawn.

a) Similar PRV Entries

Assume that each probe, at each index i=1, . . . ,N, assumes values 0 and 1 independently and with equal probability. Consider k such probes and two alleles (HLA_land HLA_m). Thus, if HLA_lis fixed:
HLA_l=(HLA_l[1], . . . ,HLA_l[k]),
then for each j,
Pr(HLA_m[j]=HLA_l[j])=½
Pr(HLA_m[j]≠HLA_l[j])=½
then, for these two HLA vectors,

$\begin{matrix} \Pr (The Hamming distance between 2 HLA vectors = x) = (\begin{matrix} k \\ x \end{matrix}) 2^{- k}, & (11) \end{matrix}$
which can easily be seen as follows:

$\begin{matrix} \Pr (\begin{matrix} The Hamming dist bet ’ n \\ 2 HLA vectors = x \end{matrix}) = \Pr (\begin{matrix} HLA vectors differ in \\ exactly x positions \end{matrix}) \\ = \Pr (\begin{matrix} x successes in k Bernoulli \\ trials, where \\ success = {{HLA}_{m} [j] \neq {HLA}_{l} [j]} \\ and p = \Pr (success) = \frac{1}{2} \end{matrix}) \\ = (\begin{matrix} k \\ x \end{matrix}) {(\frac{1}{2})}^{x} {(\frac{1}{2})}^{k - x} \\ = (\begin{matrix} k \\ x \end{matrix}) 2^{- k} \end{matrix}$
Thus, for a fixed pair of alleles,

$\begin{matrix} \begin{matrix} \Pr (x \geq 1) = 1 - \Pr (x = 0) \\ = 1 - (\begin{matrix} k \\ 0 \end{matrix}) 2^{- k} & (by (11)) \\ = 1 - 2^{k}, \end{matrix} & (12) \\ and \\ \begin{matrix} \Pr (\forall_{pairs} x \geq 1) = \prod_{pairs} \Pr (x \geq 1) & (by indep .) \\ = \prod_{pairs} (1 - 2^{- k}) \\ = {(1 - 2^{- k})}^{# pairs} & (by (12)) \\ = {(1 - 2^{- k})}^{(\begin{matrix} N \\ 2 \end{matrix})} \end{matrix} & (13) \end{matrix}$
since there are N distinct allele vectors and pairs are unordered.

This probability ideally should be greater than (1−ε) for some fixed small 0<ε<<1, i.e.,

$\begin{matrix} {(1 - 2^{- k})}^{(\begin{matrix} N \\ 2 \end{matrix})} \overset{want}{>} 1 - ε . & (14) \end{matrix}$
First, the left-hand side term is bound:

$\begin{matrix} {(1 - 2^{- k})}^{(\begin{matrix} N \\ 2 \end{matrix})} = {[{(1 - 2^{- k})}^{2^{k}}]}^{(\begin{matrix} N \\ 2 \end{matrix}) 2^{- k}} \\ > {(ⅇ^{- 1 - 2^{- k}})}^{(\begin{matrix} N \\ 2 \end{matrix}) 2^{- k}} \\ = ⅇ^{- (\begin{matrix} N \\ 2 \end{matrix}) (1 + 2^{- k}) 2^{- k}} \end{matrix}$
where the inequality comes from the bound (appendix A)

$\begin{matrix} \begin{matrix} {(1 - \frac{1}{n})}^{n} > ⅇ^{- 1 - \frac{1}{n}} & for large n . \end{matrix} & (15) \end{matrix}$
For the inequality in (14) to operate, the bound (15) should be in the correct direction. Suppose a>b is desired. If instead a>c is demonstrated and the parameter is selected such that c>b, then it can be concluded from a>c>b that a>b. Therefore, the above inequality chain would work if found (15) holds.

Hereafter, the symbol is used to indicate the steps in the inequality reduction that can satisfy the previous statements whenever the parameter in question is selected to satisfy the current statement.

Thus, an inequality (14) can be reduced to the following:

$\begin{matrix} ⅇ^{- (\begin{matrix} N \\ 2 \end{matrix}) (1 + 2^{- k}) 2^{- k}} > 1 - ε ⟺ - (\begin{matrix} N \\ 2 \end{matrix}) (1 + 2^{- k}) 2^{- k} > \ln (1 - e) & (16) \end{matrix}$
Next, consider the right-hand term: ln (1−x)<−x for 0<x<1. Again, a>b can be desired. If b<d is demonstrated and the parameter is selected such that a>d, it can be concluded from a>d>b that a>b. This permits the reduction of the inequality (16) to the following:

$\begin{matrix} - (\begin{matrix} N \\ 2 \end{matrix}) (1 + 2^{- k}) 2^{- k} > - ε & (17) \\ ⟺ ε > (\begin{matrix} N \\ 2 \end{matrix}) (1 + 2^{- k}) 2^{- k} \\ ⟺ \frac{4^{k}}{2^{k} + 1} > (1 / ε) (\begin{matrix} N \\ 2 \end{matrix}), & (18) \\ since \\ (1 + 2^{- k}) 2^{- k} = (2^{k} + 1) 2^{- 2 k} = (2^{k} + 1) 4^{- k} . & (19) \end{matrix}$
Furthermore,

$\begin{matrix} \begin{matrix} \frac{β^{2}}{β + 1} = \frac{β^{2} + β - β - 1 + 1}{β + 1} \\ = β - 1 + \frac{1}{β + 1} > β - 1 & \forall β > - 1. \end{matrix} & (20) \end{matrix}$
Hence, taking β=2^kyields

$\begin{matrix} \frac{4^{k}}{2^{k} + 1} > 2^{k} - 1. & (21) \end{matrix}$
Thus, (18) follows if k is chosen to satisfy

$(22)$ $2^{k} - 1 > (1 / ε) (\begin{matrix} N \\ 2 \end{matrix}) ⟺ 2^{k} > (1 / ε) (\begin{matrix} N \\ 2 \end{matrix}) + 1$
The remaining chain of inequalities, from “desired” to “obtained,” can now be verified. For example, the inequality (17) can be extended

$- (\begin{matrix} N \\ 2 \end{matrix}) (1 + 2^{- k}) 2^{- k} > - ε > \ln (1 - ε) ⟹ ⅇ^{- (\begin{matrix} N \\ 2 \end{matrix}) (1 + 2^{- k}) 2^{- k}} > 1 - ε,$
which in turn can be extended

${(1 - 2^{- k})}^{(\begin{matrix} N \\ 2 \end{matrix})} > ⅇ^{- (\begin{matrix} N \\ 2 \end{matrix}) (1 + 2^{- k}) 2^{- k}} > 1 - ε \begin{matrix} ⟹ {(1 - 2^{- k})}^{(\begin{matrix} N \\ 2 \end{matrix})} > 1 - ε, & as desired . \end{matrix}$
Therefore, k (given ε, N) is selected to satisfy (22):

$2^{k} > (1 / ε) (\begin{matrix} N \\ 2 \end{matrix}) + 1$
A simpler bound on k can be obtained by imposing a stronger condition

$\begin{matrix} 2^{k} \overset{want}{>} (2 / ε) (\begin{matrix} N \\ 2 \end{matrix}), & (23) \end{matrix}$
which implies (22) since

$(1 / \in) (\begin{matrix} N \\ 2 \end{matrix}) > 1.$
The right-hand side of equation (23) simplifies to

$(2 / ε) (\begin{matrix} N \\ 2 \end{matrix}) = (2 / ε) \frac{N (N - 1)}{2} = (1 / ε) N (N - 1),$
so that
k>lg N+lg (N−1)−lg ε (24)
is equivalent to equation (23). Furthermore, since 2 lg N>lg N+lg (N−1), selecting
k>2 lg N−lg ε
certainly gives a value of k that satisfies (23). Therefore, requiring

$\begin{matrix} k > 2 \lg N - \lg ε & (25) \end{matrix}$
imposes the strongest condition of those listed above. Hence, a value of k that satisfies equation (25) also satisfies equation (22), and therefore the original desired inequality (14).

b) Dissimilar PRV Entries. A violation of the similarity assumptions may be modeled by an error term δ. Suppose a probe fails to contribute to a Hammning distance with probability (1+δ)/2. As discussed previously, each position of the HLA code vector can be considered as a Bernoulli trial, where success is defined as the event that j^thentry of a code vector contributes to the Hamming distance, i.e., {HLA_m[j]≠HLA_l[j]}, so that
q=Pr(failure)=(1+δ)/2
p=Pr(success)=(1−δ)/2
Therefore,

$\begin{matrix} \begin{matrix} \Pr (The Hamm dist = x) = (\begin{matrix} k \\ x \end{matrix}) {(\frac{1 - δ}{2})}^{x} {(\frac{1 + δ}{2})}^{k - x} \\ = (\begin{matrix} k \\ x \end{matrix}) {(1 - δ)}^{x} {(1 + δ)}^{k - x} 2^{- k} \end{matrix} & (26) \end{matrix}$
Continuing as before, the following condition can be obtained.

$\begin{matrix} \Pr (x \geq 1) = 1 - \Pr (x = 0) = 1 - (\begin{matrix} k \\ 0 \end{matrix}) {(1 - δ)}^{0} {(1 + δ)}^{k - 0} 2^{- k} = 1 - {(1 + δ)}^{k} 2^{- k}, and \Pr (\forall_{pairs} x \geq 1) = {(1 - {(1 + δ)}^{k} 2^{- k})}^{(\begin{matrix} N \\ 2 \end{matrix})} . & (27) \end{matrix}$
This probability should be bigger than (1−ε). In other words,

$\begin{matrix} {(1 - {(1 + δ)}^{k} 2^{- k})}^{(\begin{matrix} N \\ 2 \end{matrix})} \overset{want}{>} 1 - ε & (28) \\ LHS (28) \overset{by (15)}{>} ⅇ^{- (\begin{matrix} N \\ 2 \end{matrix}) (1 + {((1 + δ) / 2)}^{k}) {((1 + δ) / 2)}^{k}} & (29) \\ ⟸ \overset{want}{>} 1 - ε \\ ⟺ - (\begin{matrix} N \\ 2 \end{matrix}) [1 + {(\frac{1 + δ}{2})}^{k}] {(\frac{1 + δ}{2})}^{k} > \ln (1 - ε) & (30) \\ ⟸ - (\begin{matrix} N \\ 2 \end{matrix}) [1 + {(\frac{1 + δ}{2})}^{k}] {(\frac{1 + δ}{2})}^{k} \overset{want}{>} - ε \\ ⟺ ε > (\begin{matrix} N \\ 2 \end{matrix}) [1 + {(\frac{1 + δ}{2})}^{k}] {(\frac{1 + δ}{2})}^{k} \\ ⟺ \frac{{(\frac{2}{1 + δ})}^{2 k}}{{(\frac{2}{1 + δ})}^{k} + 1} > (1 / ε) (\begin{matrix} N \\ 2 \end{matrix}), \end{matrix}$
where the last transformation is obtained as in equation (19), replacing 2 by 2/(1+δ). The same substitution in (21) (i.e., taking β=(2/(1+δ)^kin (20)) yields

$\begin{matrix} \begin{matrix} \frac{{(\frac{2}{1 + δ})}^{2 k}}{{(\frac{2}{1 + δ})}^{k} + 1} > {(\frac{2}{1 + δ})}^{k} - 1 & \forall k \in ℕ \end{matrix} & (31) \end{matrix}$
Thus, equation (30) follows if k is chosen to satisfy

$(32)$ ${(\frac{2}{1 + δ})}^{k} - 1 > (1 / ε) (\begin{matrix} N \\ 2 \end{matrix}) ⟺ {(\frac{2}{1 + δ})}^{k} > (1 / ε) (\begin{matrix} N \\ 2 \end{matrix}) + 1$
Again, a simpler bound on k can be obtained by imposing a stronger condition

$\begin{matrix} {(\frac{2}{1 + δ})}^{k} \overset{want}{>} (2 / ε) (\begin{matrix} N \\ 2 \end{matrix}) = (1 / ε) N (N - 1) & (33) \end{matrix}$
so that

$\begin{matrix} \lg [{(\frac{2}{1 + δ})}^{k}] = k (1 - \lg (1 + δ)) & (34) \\ \overset{want}{>} \lg N + \lg (N - 1) - \lg ε \\ ⟸ k (1 - \lg (1 + δ)) > 2 \lg N - \lg ε . \\ k > \frac{1}{(1 - \lg (1 + δ))} [2 \lg N - \lg ε] & (35) \end{matrix}$

c) Non-unit Minimum Hamming Distance. Further, the preferable size k can be estimated for (almost) any desired minimum Hamming distance between allele code vectors. As demonstrated above, the Hamming distance between a pair of HLA vectors is a binomial random variable x˜S (n, p) where # trials ≡n=k, Pr (success)≡p=(1−δ)/2, and Pr (failure)≡q=(1+δ)/2:

$\Pr (x) = (\begin{matrix} k \\ x \end{matrix}) {(1 - δ)}^{x} {(1 + δ)}^{k - x} 2^{- k}$
Its mean is np=k(1−δ)/2 and variance is npq=k(1−δ)/4. The following estimate can be obtained (using Chernoff bounds):
Pr(x≦k(1−δ)/4)≦e^{−k(1−δ)/16} (36)
Chernoff inequality states (see Appendix B for the proof):

$\begin{matrix} \Pr (S (n, p) \leq (1 - λ) np) \leq e^{- \frac{λ^{2}}{2} np} & (37) \end{matrix}$
n=k, p=(1−δ)/2, and let λ=½. Then by equation (37),

$\begin{matrix} \Pr (x \leq k (1 - δ) / 4) \leq ⅇ^{- \frac{{(1 - 2)}^{2}}{2} k \frac{1 - δ}{2}} \\ = ⅇ^{- \frac{1}{8} k \frac{1 - δ}{2}} \\ = ⅇ^{- k (1 - δ) / 16} \end{matrix}$
Thus, it can be estimated for which k
Pr(∀_pairsx≧k(1−δ)/4)≧1−ε (38)
From equation (36), it is possible to obtain
Pr(x≧k(1−δ)/4)≧1−e^{−k(1−δ)/16}
and hence,

$\begin{matrix} \begin{matrix} \Pr (\forall_{pairs} x \geq k (1 - δ) / 4) = \Pr {(x \geq k (1 - δ) / 4)}^{(\begin{matrix} N \\ 2 \end{matrix})} \\ \geq {(1 - ⅇ^{- k (1 - δ) / 16})}^{(\begin{matrix} N \\ 2 \end{matrix})} \\ \overset{want}{>} 1 - ε \\ ⟸ (expression (39)) > \\ \exp {- (\begin{matrix} N \\ 2 \end{matrix}) (1 + ⅇ^{- k (1 - δ) / 16}) ⅇ^{- k (1 - δ) / 16}} \overset{want}{>} 1 - ε \\ ⟺ - (\begin{matrix} N \\ 2 \end{matrix}) (1 + ⅇ^{- k (1 - δ) / 16}) ⅇ^{- k (1 - δ) / 16} > \ln (1 - ε) \\ ⟸ - (\begin{matrix} N \\ 2 \end{matrix}) (1 + ⅇ^{- k (1 - δ) / 16}) ⅇ^{- k (1 - δ) / 16} > - ε \\ (see (17)) \\ ⟺ ε > (\begin{matrix} N \\ 2 \end{matrix}) (1 + ⅇ^{- k (1 - δ) / 16}) ⅇ^{- k (1 - δ) / 16} \\ ⟺ \frac{ⅇ^{k (1 - δ) / 8}}{ⅇ^{k (1 - δ) / 16}} > (1 / ε) (\begin{matrix} N \\ 2 \end{matrix}) \\ ⟸ ⅇ^{k (1 - δ) / 16} - 1 > \begin{matrix} (1 / ε) (\begin{matrix} N \\ 2 \end{matrix}) & (by (20)) \end{matrix} \\ ⟺ ⅇ^{k (1 - δ) / 16} > (1 / ε) (\begin{matrix} N \\ 2 \end{matrix}) + 1 \\ ⟸ ⅇ^{k (1 - δ) / 16} > (2 / ε) (\begin{matrix} N \\ 2 \end{matrix}) = (1 / ε) N (N - 1) \\ (see (23)) \\ ⟺ k (1 - δ) / 16 > \ln N + \ln (N - 1) - \ln ε \\ ⟸ k (1 - δ) / 16 > 2 \ln N - \ln ε \\ ⟺ k > \frac{16}{1 - δ} [2 \ln N - \ln ε] \end{matrix} & (39) \end{matrix}$
Therefore if,

$(40)$ $k > \frac{16}{1 - δ} [2 \ln N - \ln ε]$ $then$ $\Pr (\forall_{pairs} x \geq k (1 - δ) / 4) \geq 1 - ε$
For this exemplary probe set (with k=k(ε, N, δ)), an arbitrarily high probability can be obtained, by the selection of ε in (38), that all BLA coding vectors have pairwise Hamming distance of at least k(1−δ)/4. Thus, this exemplary probe set is able to correct k(1−δ)/8 errors by choosing the coding vector closest to that obtained.

The error term δ should then be estimated. This can be accomplished on a given set S of probes by sampling pairs {l, m} of indices on probes from S and examining the resulting 2-vectors on {0, 1}. The probability of a failure to contribute to the Hamming distance (given by (1+δ)/2) can be estimated by the frequency f₌ of observing equal entries in the 2-vector, since each probe with equal entries in the 2-vector fails to contribute to the Hamming distance between alleles l and m. Therefore,
{circumflex over (δ)}=2f₌−1 (41)
Let f_≠ denote the frequency of observing unequal entries in the same setting. Thus, f₌+f_≠=1, and it is possible to obtain
1−δ=1−(2f₌−1)=2(1−f₌)=2f_≠. (42)
The bound in equation (40) becomes

$\begin{matrix} \begin{matrix} k > \frac{16}{2 f_{\neq}} [2 \ln N - \ln ε] \\ = \frac{8}{f_{\neq}} [2 \ln N - \ln ε] \end{matrix} & (43) \end{matrix}$

Thus, to generate distinct coding vectors for all alleles (e.g., to guarantee a Hamming distance d_H(c_i, c_j)≧1 w.p. >1−ε), it is possible to select M>k, where k satisfies (35) with δ estimated as in (41). It is also possible to choose M to allow for error correction of up to D/2 errors (guaranteeing w.p.>(1−ε) a minimum Hamming distance d_H(c_i, c_j)≧D): set
D=k(1−δ)/4=kf_≠/2
in equation (38), so that k=2D/f_≠ should satisfy equation (40), and, again, select M>k.
D. Pre-Processing

1. Initial Probe Selection

In section B.2, it was described that starting with all possible probes results in a graph that has many vertices. The discussion below provides certain pre-processing steps that allow the elimination of a large portion of this probe set.

a. Probes that do not hit the BLA region on any allele. Many of the possible length-L probes generally do not provide sequence-specific information about the target. As such, they may be safely left out of our probe selection process. This would allow for a reduction of the starting (perfectly matched) probe set to those probes that are complementary to a subsequence of at least one of the alleles. A way to obtain such set can be as follows. Assume that the allele sequences are provided in the 5′ to 3′ orientation.

A window of length L can be considered along allele T₁. Denote the length of the allelic sequence by len(T₁), and index elements of the sequence starting with 1, so that the entire allele sequence can be denoted by
T₁[1] . . . T₁[len(T₁)].
A probe complementary to the allele subsequence seen through the window [1 . . . L] can be constructed and placed in the set. The window then may be shifted by one nucleotide in the direction of the 3′-end. This process can be repeated until the last window [k . . . (L+k−1)] reaches the end of the target sequence:
L+k−1=len(T₁)
k=len(T₁)−L+1,
thus, generating a set of (len(T₁)−L+1) probes, each perfectly complementary to T₁.

The procedure described above can generate all probes of length L that are, e.g., perfectly complementary to a length-L subsequence of the target (ie., allele) sequence. Depending on the form in which the allele sequences are given, it may also be desirable to include probes corresponding to windows that are partially shifted off the allele, i.e., windows showing a portion of the given allele sequence together with the corresponding 5′-tail of the sequence, if the window is shifted to the left, or the 3′-tail, if the window is shifted to the right. There are 2(L−1) such probes, corresponding to indices
[(len(T₁)−L+2) . . . (len(T₁)+1)] , . . . ,
[len(T₁) . . . (len(T₁)+L−1)]
for the right-shifted windows and “indices”
[0 . . . (L−1)], . . . , [(2−L) . . . 1]
for the left-shifted windows.

This process can be repeated for the other alleles T₂, . . . ,T_N. To avoid placing duplicate probes in the set, a generated probe can be added to the set if its sequence is not already present. Alternately, the duplicates can be weeded out subsequently. It should be noted that this may have the added advantage of eliminating probes hitting sequence repeats.

The above-described exemplary process has the effect of eliminating probes that hit genomic sequences outside the target region, including probes that hit introns (if allelic sequences are provided in genomic DNA form) from the original collection of all possible probes of length L. The resulting set may contain only those probes complementary to subsequences of the HLA region, or sub-words of the pool of all allele sequences.

b. Non-informative probes. In the set created as described in the previous section, some (and perhaps many) of the probes would not be able to give any information useful for distinguishing among the alleles. These are the probes that may be drawn from the windows that are shared among the alleles—they hybridize to a common subsequence of the alleles. Any such probe may be useless for discriminating alleles: to such a probe, all alleles will look alike. Therefore, these probes can be safely eliminated from the potential probe set.

- To locate all such probes, it is necessary to find the common subsequences of length≧L of all the alleles, identify the probes complementary to these subsequences, and remove these probes from the set. This can be done as the next step in the “refinement” of the starting probe set, and/or included as a condition in the process for probe addition specified in the previous section.

c. Potential for cross-hybridization. The probes that are likely to hit multiple sites on the target sequence(s), such as those hitting a repeated region, should be eliminated, as is usually done in microarray design, as their use is likely to produce a high level of noise. Each probe is usually expected to have a unique site on the target. See e.g., Lockhart et al., Nature Biotechnology 1996, 14:1675; Kaderali and Schliep, An algorithm to select target specific probes for DNA chips at URL http://citeseer.nj.nec.com/kaderali01algorithm.html; Li and Stormo, Bioinformatics, 2001, 17:1067.

2. Graph Generation

a. Generating Probe Response Vectors. Once a set of initial probes is selected, as described above, a probe response vector (7) must be generated for each of these probes. To do that, given probe j, string-matching is performed on each of the N alleles for the Watson-Crick complement of probe j, and the results are used to set

$v_{j} [i] = {\begin{matrix} 1, & if there is a match with allele T_{i} \\ 0, & otherwise \end{matrix}$

b. Choice of Edge Threshold. The edge threshold parameter ρ was used in section B.3 to transform the initial complete edge-weighted graph. Its value determines how many edges remain in the graph, as well as how “independent” each independent set on the graph really is.

- If ρ is too small, there will be very few edges in the graph. Most of the random sets selected by the boosting algorithm will prove to be independent. However, upon examination in post-processing, described in section B.3, it may be found that many of these sets do not possess enough discrimination power to discern all N known alleles.
- If, on the other hand, ρ is too large (e.g., ρ>0.5), the graph will be very dense (i.e., have a lot of edges). The algorithm will then have a much harder time finding an independent set of large enough size. The output sets will likely contain much fewer than M vertices, and there may not be enough probes in the candidate sets to discern all N alleles.
- A reasonable value of ρ is obtained by trial and error on a given set of potential probes.
  E. Post-Processing

The procedure (as described in section C.1) can return a list of 20 best independent sets, sorted by the total information weight of the constituent vertices. Each set may be composed of e.g., at most M vertices (probes). While the independence and maximum weight conditions can be selected to steer each selected set towards maximum discrimination power, this desired outcome may not be guaranteed. Thus, each of these best independent sets should be checked for redundancy of the allele coding vectors. Given a set S of probe response vectors, the N allele coding vectors generated by S (the rows in (6)) may be extracted and their pairwise Hamming distances d_H(c_i,c_j), 1≦i≦j≦|S| may be computed. If min_{ij} d_H(c_i, c_j)=0, the set lacks discrimination power: at least two of the codes are the same, so the set will not be able to discern all known alleles. Such a set should not be used “as is”, and should either be discarded or supplemented by additional probes. This may indicate that the set is not truly independent, so the choice of edge threshold ρ, discussed in Section D.2.b, was inappropriate.

It is possible to make the testing more stringent, in order to allow for up to D/2 errors in the data, as discussed in Section C.6. Those independent sets of the list of best sets that pass the redundancy testing (by satisfying min_{ij} d_H(c_i, c_j)≧D), in fact satisfy a definition stronger than that formulated for the best independent set. D-best independent set denotes a best independent set with the additional condition min_{i,j} d_H(c_i, c_j)≧D.

Those sets that pass the redundancy test can be reordered by aveHamDist=ave_{i,j} d_H(c_i,c_j): once the minimum allele code separation is guaranteed, the usefulness of a probe set to the HLA typing problem can be judged by the metric aveHamDist.

F. Interpreting Results

Given the best independent set generated by the above-described procedure a determination regarding how it can be converted into a microarray for genotyping or haplotyping experiments must be made.

In order to generate the microarray corresponding to an independent set of vectors yielded by the procedure (followed by post-processing steps from Section E), the DNA sequence for each probe, which was used to generate the probe response vector used in the analysis, should be recalled. The spatial arrangement of these probes on the chip surface can be decided as discussed above in Section A.4.b.

G. Additional Applications

Many extensions of the approaches presented herein are possible within the scope of the present invention. Two exemplary approaches are discussed below.

1. Extending Weight Functions in the Graph Model

The graph model discussed herein relies on the characteristics of the probe response vectors to define the weights of vertices and edges. While this model can generate certain interesting results, it can be extended to a more meaningful model by incorporating the physical properties of the probe sequences and their interactions, some of which are described in Cherepinsky, Ph.D. Thesis, New York University 2003, Chapter 3.

In particular, the annotation of all potential probes with physical properties, such as melting temperature, free energy, entropy, and enthalpy of hybridization, for perfect matches and for closest matches in other alleles can be used to define cost functions that determine the weights. While the vertex weight may provide a measure of the performance of the corresponding probe in discriminating among known alleles, the pairwise probe interaction and the resulting competition effects, as described in Cherepinsky, Ph.D. Thesis, New York University 2003, Chapter 3, can be reflected in the edge weights.

2. Pooling Real Data from Previously Tested Chips

Another extension of the procedure discussed herein involves the use of data from microarray chips used for HLA typing by different companies. Many biotechnology companies are working on the HLA typing problem in the hope of designing probe sets that give the answer more quickly and with greater accuracy. The sequences of the probes may generally be considered to be proprietary information and thus not shared. As a result, the collection of experimental data from testing the various probes in different combinations and arrangements on the microarray chips generated by different companies may almost never be examined as a whole.

It is possible to employ the probe interaction model presented herein to make use of the aggregate experimental data. Suppose the following information can be obtained: a set of microarray chips along with some identifiers, if not the actual sequences, of the probes comprising each chip, and values measuring the performance of each chip in all previously conducted HLA typing experiments. That is, for each chip, there is a list of unique probe identifiers and some measure of how well this chip performed in HLA typing. It may not be necessary to know the sequence of each probe, so long as the uniqueness of the identifiers can be verified by the company providing the data. It is possible to combine the information from a large number of such previously tested chips to generate a plan for a new microarray chip (i.e., a collection of probe identifiers and their spatial arrangement) with a performance value higher than that of “input” chips by the following process. The probe content and arrangement for each chip, together with its performance value, can be used to build the graph model. Vertex weights can be inferred from chip membership information. Edge weights can be estimated from conditional probabilities using pairwise membership information—that is, by considering two chips at a time, quantities such as the conditional probability that probe P_iwas used on chip C_j, given that it was used on chip C_k, can be estimated. Once the graph is constructed, the boosting-algorithm can be used to generate the best set of probes, as discussed in Section C.1.

All publications cited above are incorporated herein by reference in their entireties.

APPENDIX

Appendix A: Exponential Limit Inequality: Proof

Claim: For large n,

$\begin{matrix} {(1 - \frac{1}{n})}^{n} > ⅇ^{- 1 - \frac{1}{n}} . & (A1) \end{matrix}$

Proof:

Inequality (A1) is equivalent to

$\begin{matrix} \ln [{(1 - \frac{1}{n})}^{n}] \overset{want}{>} - 1 - \frac{1}{n} . & (A2) \end{matrix}$

Since the series expansion of the logarithm is given by

$\begin{matrix} \ln (1 - x) = - \sum_{j = 1}^{\infty} \frac{x^{j}}{j} = - x - \frac{x^{2}}{2} - \frac{x^{3}}{3} - \dots for \langle x \rangle < 1, & (A3) \end{matrix}$
we can expand the left-hand side of (A2) as follows.

$\begin{matrix} \begin{matrix} \ln [{(1 - \frac{1}{n})}^{n}] = n \ln (1 - \frac{1}{n}) \\ = n {- \sum_{j = 1}^{\infty} \frac{{(\frac{1}{n})}^{j}}{j}} (by (A 3)) \\ = - n {\sum_{j = 1}^{\infty} \frac{1}{{jn}^{j}}} \\ = - \sum_{j = 1}^{\infty} \frac{1}{{jn}^{j - 1}} \\ = - 1 - \frac{1}{2 n} - \frac{1}{3 n^{2}} - \frac{1}{4 n^{3}} - \dots \end{matrix} & (A 4) \end{matrix}$

Thus, inequality (A2) reduces to

$\begin{matrix} - 1 - \frac{1}{2 n} - \frac{1}{3 n^{2}} - \frac{1}{4 n^{3}} - \dots \overset{want}{>} - 1 - \frac{1}{n} or, equivalently, & (A 5) \\ ⟺ \frac{1}{3 n^{2}} + \frac{1}{4 n^{3}} + \frac{1}{5 n^{4}} + \dots \overset{want}{<} - \frac{1}{2 n} + \frac{1}{n} = \frac{1}{2 n} & (A6) \end{matrix}$
Now,

$\begin{matrix} \frac{1}{3 n^{2}} + \frac{1}{4 n^{3}} + \frac{1}{5 n^{4}} + \dots < \frac{1}{3 n^{2}} + \frac{1}{3 n^{3}} + \frac{1}{3 n^{4}} + \dots \\ = \frac{1}{3 n^{2}} (1 + \frac{1}{n} + \frac{1}{n^{2}} + \dots) \\ = \frac{1}{3 n^{2}} \cdot \frac{1}{1 - \frac{1}{n}} & (geometric sum) \\ \frac{1}{3 n} \cdot \frac{1}{n - 1} \overset{want}{<} \frac{1}{2 n} & (by (A 6)) \end{matrix}$
Simplifying yields

$⟺ \frac{1}{3 (n - 1)} \overset{want}{<} \frac{1}{2} ⟺ 2 < 3 (n - 1) = 3 n - 3 ⟺ 5 < 3 n,$
which holds for every n≧2. Retracing the chain of inequalities, we obtain

$\begin{matrix} {(1 - \frac{1}{n})}^{n} > ⅇ^{- 1 - \frac{1}{n}} \forall n \geq 2 & (A7) \end{matrix}$
as desired.
Appendix B: Chernoff's Inequality: Proof

Claim:

$\begin{matrix} \Pr (S (n, p) \leq (1 - ε) n p) \leq e^{- s \frac{2}{2} n p}, ε \in (0, 1) . & (B1) \end{matrix}$

Proof:

S is a Binomial random variable:
S(n,p)=X₁+ . . . +X_n, (B2)
where X_iare i.i.d.r.v.'s with

$\begin{matrix} X_{i} = {\begin{matrix} 1 & w . p . p \\ 0 & w . p . (1 - p) \end{matrix}, i = 1, \dots, n & (B 3) \end{matrix}$
Therefore,

$\begin{matrix} E (S) = \sum_{i = 1}^{n} E (X_{i}) = \sum_{i = 1}^{n} p = np . & (B4) \end{matrix}$
Since

$\begin{matrix} S \leq (1 - ε) np ⟺ S - np \leq - ε np ⟺ \begin{matrix} λ (S - np) \leq - λ ε np & \forall λ > 0 \end{matrix} ⟺ \begin{matrix} - λ (S - np) \geq λ ε np & \forall λ > 0 \end{matrix} ⟺ e^{- λ (S - np)} \geq e^{λε np} \forall λ > 0, & (B5) \end{matrix}$
it follows that

$\begin{matrix} \Pr (S \leq (1 - ε) np) = \Pr (e^{- λ (S - np)} \geq e^{λ ε np}) & (B6) \\ \begin{matrix} \leq \frac{E [e^{- λ (S - np)}]}{e^{λ ε np}} & (by Markov ’ s inequality) \end{matrix} & (B7) \end{matrix}$
For a proof of Markov's inequality, see, e.g., [24].

From (B2) and (B3), we know that

$\begin{matrix} S - np = \sum_{i = 1}^{n} X_{i} - np = \sum_{i = 1}^{n} (X_{i} - p) . & (B8) \end{matrix}$
Therefore,

$\begin{matrix} E [e^{- λ (S - np)}] = E [e^{- λ \sum_{i} (X_{i} - p)}] & (B9) \\ = E [\prod_{i = 1}^{n} e^{- λ (X_{i} - p)}] \\ \begin{matrix} = \prod_{i = 1}^{n} E [e^{- λ (X_{i} - p)}] & (by independence) \end{matrix} \\ \begin{matrix} = {E [e^{- λ (X_{i} - p)}]}^{n} & (by (B3)) \end{matrix} \\ E [e^{- λ (X_{1} - p)}] = p e^{- λ (1 - p)} + (1 - p) e^{- λ (- p)} & (B10) \\ = e^{λ p} (p e^{- λ} + (1 - p)) \\ = e^{λ p} (1 + \underset{\underset{u}{︸}}{p (e^{- λ} - 1)}) \\ \begin{matrix} \leq e^{λ p} (e^{p (e^{- λ} - 1)}) & (since 1 + u \leq e^{u} \forall u) \end{matrix} \\ = e^{p (e^{- λ} - 1 + λ)} & (B11) \\ \leq e^{p \frac{λ^{2}}{2}}, \end{matrix}$
where the last inequality follows from

$\begin{matrix} \begin{matrix} e^{- λ} \leq 1 - λ + \frac{λ^{2}}{2} & \forall λ > 0 \end{matrix} ⟹ e^{- λ} - 1 + λ \leq \frac{λ^{2}}{2} & (B12) \end{matrix}$
Therefore, by (B9) and (B11),

$\begin{matrix} E [e^{- λ (S - np)}] \leq {(e^{p \frac{λ^{2}}{2}})}^{n} = e^{\frac{λ^{2}}{2} np} & (B13) \\ and \\ \Pr (S \leq (1 - ε) np) \leq e^{- λ ε np} E [e^{- λ (S - np)}] & (B14) \\ \leq e^{- λ ε np} e^{\frac{λ^{2}}{2} np} \\ \begin{matrix} = e^{np (\frac{λ^{2}}{2} - λ ε)} & \forall λ > 0 \end{matrix} \end{matrix}$
so that

$f (λ^{*}) = e^{np (\frac{ε^{2}}{2} - ε^{2})} = e^{- \frac{ε^{2}}{2} np}$
and, from (B14),

$\begin{matrix} \begin{matrix} \Pr (S \leq (1 - ε) np) \leq e^{- \frac{ε^{2}}{2} np} & \forall ε \in (0, 1) \end{matrix} & (B16) \end{matrix}$
as desired.

It remains to check that the optimizing λ=λ* is a minimum of f(λ), that is, f″(λ*)>0. By (B15),

$\begin{matrix} f^{″} (λ^{*}) = f^{'} ((λ) \cdot np (λ - ε) + f (λ) \cdot np |_{λ = λ^{*}} \\ = \underset{\underset{0}{︸}}{f^{'} (λ^{*})} \cdot {np (0)}^{'} + f (λ^{*}) \cdot np \\ = np e^{- \frac{ε^{2}}{2} np} > 0. \\ ∴ λ^{*} = ε is a minimum . \end{matrix}$

Claims

1. A method for at least one of genotyping or haplotyping a sequence of polymorphic genetic loci in a deoxyribonucleic acid (DNA) sample or identifying a strain variant from the DNA sample, comprising:

i) providing one or more microarrays that include a set of oligonucleotide probes that are capable of detecting the at least one of the genotypes, the haplotypes or the strain variant;

ii) hybridizing the DNA sample to the one or more microarrays to create a hybridization pattern;

iii) determining at least one of a genotype, a haplotype or a strain variant based on the hybridization pattern; and

iv) optimizing at least one of the set or an arrangement of the oligonucleotide probes as a function of at least one of a match criteria or a mismatch criteria between a true allele contained in the DNA sample and an allele determined by the hybridizing step.

2. The method of claim 1, wherein the one or more microarrays include a set of oligonucleotide probes that are capable of detecting at least one of all known genotypes or all known haplotypes at the polymorphic genetic loci or the strain identification.

3. The method of claim 1, wherein the one or more microarrays are configured to include at least one of an optimal set or an optimal arrangement of oligonucleotide probes.

4. The method of claim 1, wherein the mismatch criteria is the following: min ⁢ ∑ type ⁢ ⁢ j ⁢ ⁢ w j ⁢ E [ ∏ T j ≠ T ^ j ⁢ ] ⇔ min ⁢ ∑ type ⁢ ⁢ j ⁢ ⁢ w j ⁢ Pr ( T j ≠ T ^ j ). ∏ x ⁢ ⁢ = { 1, if ⁢ ⁢ X ⁢ ⁢ is ⁢ ⁢ true 0, otherwise },

wherein Tj is a true allele contained in the DNA sample, {circumflex over (T)}j is the allele determined by the hybridization step,

and wj is a weight assigned to at least one of the genotype or the haplotype j.

5. The method of claim 4, wherein the weights are provided as follows: wj=1∀j, wherein ∀j is a set of at least one of all known genotypes or all known haplotypes at one or more predetermined polymorphic genetic loci.

6. The method of claim 4, wherein the weights are provided as follows: wj is different for each genotype or haplotype.

7. The method of claim 1, further comprising: at least one of displaying or storing data associated with the at least one of the genotype, the haplotype or the strain variant in a storage arrangement in at least one of a user-accessible format or a user-readable format.

8. A method for at least one of genotyping or haplotyping a sequence of polymorphic genetic loci in a deoxyribonucleic acid (DNA) sample or identifying a strain variant from the DNA sample, comprising: wherein step (iii) produces a vector of n measurements, and wherein n is a number of probes contained on the one or more microarrays.

i) providing one or more microarrays that include a set of oligonucleotide probes that are capable of detecting the at least one of the genotypes, the haplotypes or the strain variant;

ii) hybridizing the DNA sample to the one or more microarrays to create a hybridization pattern; and

iii) determining at least one of a genotype, a haplotype or a strain variant based on the hybridization pattern,

9. The method of claim 8, wherein the n potential probes provided to identify N known genotypes or haplotypes are each associated with a response vector {right arrow over (υ)}j ε{0,1}N, j=1,..., n.

10. The method of claim 9, further comprising generating a graph G on vertices corresponding to probe response vectors.

11. The method of claim 10, wherein the graph G is a complete edge-weighted and vertex-weighted undirected graph G=(V, E) provided on n vertices, wherein n is the number of potential probes.

12. The method of claim 11, wherein the weights w of each vertex v and each edge e are constrained by: 0≦w(v), w(e)≦1.

13. The method of claim 12, wherein the weight w of a vertex v is set to:

w(v)=min{fraction of 0's, fraction of 1's}.

14. The method of claim 12, wherein the weight w of an edge e={u,v} is set to: wherein Hamming distance is measured between the probe response vectors corresponding to vertices u and v, and vector length is the length of the probe response vectors, namely, N.

w(e)=Hamming distance/vector length,

15. The method of claim 11, further comprising modifying the graph G by thresholding the edges such that the modified graph Gmod is defined as Gmod=(V, Emod), wherein Emod={e ε E: w(e)≦ρ}, and ρ is a selected threshold value.

16. The method of claim 15, wherein, for the modified graph Gmod and the probe set size M, the following is performed:

i) initializing a current-best list of independent sets with associated information weights,

ii) initializing vertex boosting weights to vertex weights w(v),

iii) defining a probability distribution on the vertex subset based on vertex boosting weights,

iv) choosing a random subset of vertices of a specified size M based on the probability distribution,

v) eliminating one of the end-point vertices in each of the edges remaining in the induced subgraph on the random subset,

vi) modifying the vertex boosting weights by increasing the weights of the vertices that are retained in the subset and decreasing the weights of the vertices that were selected in step (iv) but eliminated in step (v), and

vii) repeating steps (iii) through (vi) for at least one of a predetermined number of iterations or until no improvement to the list of top independent sets is achieved.

17. The method of claim 16, wherein, for the modified graph Gmod and the probe set size M, steps (ii) through (vii) are repeated for a predetermined number of iterations, each iteration starting with reinitializing vertex boosting weights to vertex weights w(v) in step (ii).

18. The method of claim 17, wherein, for a given fixed small 0<ε<<1, the probe set size M satisfies an inequality Pr(∀code pairs, Hamming distance≧1)>1−ε.

19. The method of claim 17, wherein, for a given fixed small 0<ε<<1 and a fixed α>1, the probe set size M satisfies an inequality Pr(∀code pairs, Hamming distance≧α)>1−ε.

20. The method of claim 16, wherein the threshold ρ has a value to enable the graph G to have a sparsity bounded by A≦sparsity≦B, wherein the sparsity is definable by the average degree of a vertex in the graph G.

21. The method of claim 20, wherein the lower bound A is a relatively small constant, and the upper bound B is a function of the number of vertices n.

22. The method of claim 8, further comprising: at least one of displaying or storing data associated with at least one of the genotype, the haplotype, the strain variant or the vector in a storage arrangement in at least one of a user-accessible format or a user-readable format.

23. A non-transitory storage medium which includes thereon a software arrangement for providing one or more microarrays, which configures a processing arrangement to perform the procedures comprising:

i) receiving information regarding a hybridization of the DNA sample to one or more microarrays to create a hybridization pattern, the one or more microarrays including a set of oligonucleotide probes that are capable of detecting at least one set of genotypes or haplotypes for a sequence of polymorphic genetic loci in a deoxyribonucleic acid (DNA) sample or identifying a strain variant from the DNA sample;

ii) determining at least one of a genotype, a haplotype or a strain variant based on the hybridization pattern; and

iii) optimizing at least one of the set or an arrangement of the oligonucleotide probes as a function of at least one of a match criteria or a mismatch criteria between a true allele contained in the DNA sample and an allele determined from the hybridization.

24. The storage medium of claim 23, wherein the mismatch criteria is the following: min ⁢ ∑ type ⁢ ⁢ j ⁢ ⁢ w j ⁢ E ⁢ ⌊ ∏ T j ≠ T ^ j ⁢ ⌋ ⇔ min ⁢ ∑ type ⁢ ⁢ j ⁢ ⁢ w j ⁢ Pr ( T j ≠ T ^ j ). ∏ x ⁢ ⁢ = { 1, if ⁢ ⁢ X ⁢ ⁢ is ⁢ ⁢ true 0, otherwise },

wherein Tj is a true allele contained in the DNA sample, {circumflex over (T)}j is the allele determined by the hybridization step,

and wjis a weight assigned to at least one of the genotype or the haplotype j.

25. The storage medium of claim 24, wherein the weights are provided as follows: wj=1∀j, wherein ∀j is a set of at least one of all known genotypes or all known haplotypes at one or more predetermined polymorphic genetic loci.

26. The storage medium of claim 24, wherein the weights are provided as follows: wj is different for each genotype or haplotype.

27. A system for at least one of genotyping or haplotyping polymorphic genetic loci or strain identification in a deoxyribonucleic acid (DNA) sample, comprising:

a processing device which, when executed, is configured to:

i) receive information regarding a hybridization of the DNA sample to one or more microarrays to create a hybridization pattern, the one or more microarrays including a set of oligonucleotide probes that are capable of detecting at least one set of genotypes or haplotypes for a sequence of polymorphic genetic loci in a deoxyribonucleic acid (DNA) sample or identifying a strain variant from the DNA sample;

ii) determine at least one of a genotype, a haplotype or a strain variant based on the hybridization pattern; and

iii) optimize at least one of the set or an arrangement of the oligonucleotide probes as a function of at least one of a match criteria or a mismatch criteria between a true allele contained in the DNA sample and an allele determined by the hybridization.

28. The system of claim 27, wherein the mismatch criteria is the following: min ⁢ ∑ type ⁢ ⁢ j ⁢ w j ⁢ ⁢ E [ Π T j ≠ T ^ j ⁢ ] ⇔ min ⁢ ∑ type ⁢ ⁢ j ⁢ w j ⁢ ⁢ Pr ⁡ ( T j ≠ T ^ j ). ∏ x ⁢ ⁢ = { 1, if ⁢ ⁢ X ⁢ ⁢ is ⁢ ⁢ true 0, otherwise },

wherein Tj, is a true allele contained in the DNA sample, {circumflex over (T)}j is the allele determined by the hybridization step,

and wjis a weight assigned to at least one of the genotype or the haplotype j.

29. The system of claim 28, wherein the weights are provided as follows:

wj=1∀j, wherein ∀j is a set of at least one of all known genotypes or all known haplotypes at one or more predetermined polymorphic genetic loci.

30. The system of claim 28, wherein the weights are provided as follows: wj is different for each genotype or haplotype.

31. A non-transitory computer-accessible medium having stored thereon computer executable instructions for at least one of genotyping or haplotyping a sequence of polymorphic genetic loci in a deoxyribonucleic acid (DNA) sample or identifying a strain variant from the DNA sample, wherein, when the executable instructions are executed by a processing arrangement, configure the processing arrangement to: wherein procedure (iii) produces a vector of n measurements, and wherein n is a number of probes contained on the one or more microarrays.

i) provide one or more microarrays that include a set of oligonucleotide probes that are capable of detecting the at least one of the genotypes, the haplotypes or the strain variant;

ii) hybridize the DNA sample to the one or more microarrays to create a hybridization pattern; and

iii) determine at least one of a genotype, a haplotype or a strain variant based on the hybridization pattern,

32. The computer-accessible medium of claim 31, wherein the n potential probes provided to identify N known genotypes or haplotypes are each associated with a response vector {right arrow over (υ)}j ε{0,1}N, j=1,...,n.

33. The computer-accessible medium of claim 32, further comprising generating a graph G on vertices corresponding to probe response vectors, wherein the graph G is a complete edge-weighted and vertex-weighted undirected graph G=(V, E) provided on n vertices, wherein n is the number of potential probes.

34. The computer-accessible medium of claim 33, wherein the weights w of each vertex v and each edge e are constrained by: 0≦w(v), w(e)≦1.

35. The computer-accessible medium of claim 34, wherein at least one of (i) the weight w of a vertex v is set to: w(v)=min{fraction of 0's, fraction of 1's}, or (ii) the weight w of an edge e={u,v} is set to: w(e)=Hamming distance/vector length, wherein the Hamming distance is measured between the probe response vectors corresponding to vertices u and v, and wherein the vector length is the length of the probe response vectors N.

36. The computer-accessible medium of claim 33, further comprising modifying the graph G by thresholding the edges such that the modified graph Gmod is defined as Gmod=(V, Emod), wherein Emod={e ε E: w(e)≦ρ}, and ρ is a selected threshold value.

37. The computer-accessible medium of claim 36, wherein, for the modified graph Gmod and the probe set size M, when the executable instructions are executed, the processing arrangement is further configured to:

iv) initialize a current-best list of independent sets with associated information weights,

v) initialize vertex boosting weights to vertex weights w(v),

vi) defining a probability distribution on the vertex subset based on vertex boosting weights,

vii) choose a random subset of vertices of a specified size M based on the probability distribution,

viii) eliminate one of the end-point vertices in each of the edges remaining in the induced subgraph on the random subset,

ix) modify the vertex boosting weights by increasing the weights of the vertices that are retained in the subset and decreasing the weights of the vertices that were selected in procedure (vii) but eliminated in procedure (viii), and

x) repeat procedures (vi) through (ix) for at least one of a predetermined number of iterations or until no improvement to the list of top independent sets is achieved.

38. The computer-accessible medium of claim 37, wherein, for the modified graph Gmod and the probe set size M, when executed, the processing arrangement is further configured to repeat procedures (v) through (x) for a predetermined number of iterations, and to start each iteration with a reinitialization of the vertex boosting weights to the vertex weights w(v) in procedure (v).

39. The computer-accessible medium of claim 38, wherein at least one of (i) for a given fixed small 0<ε<<1, the probe set size M satisfies an inequality Pr(∀code pairs, Hamming distance≧1)>1−ε, or (ii) for a given fixed small 0<ε<<1and a fixed α>1, the probe set size M satisfies an inequality Pr(∀code pairs, Hamming distance≧α)>1−ε.

40. The computer-accessible medium of claim 37, wherein the threshold p has a value to enable the graph G to have a sparsity bounded by A≦sparsity≦B, wherein the sparsity is definable by the average degree of a vertex in the graph G, wherein the lower bound A is a relatively small constant, and wherein the upper bound B is a function of the number of vertices n.

41. A system for at least one of genotyping or haplotyping a sequence of polymorphic genetic loci in a deoxyribonucleic acid (DNA) sample or identifying a strain variant from the DNA sample, comprising:

a processing device that, when executed, is configured to: i) provide one or more microarrays that include a set of oligonucleotide probes that are capable of detecting the at least one of the genotypes, the haplotypes or the strain variant; ii) hybridize the DNA sample to the one or more microarrays to create a hybridization pattern; and iii) determine at least one of a genotype, a haplotype or a strain variant based on the hybridization pattern,

wherein procedure (iii) produces a vector of n measurements, and wherein n is a number of probes contained on the one or more microarrays.

42. The system of claim 41, wherein the n potential probes provided to identify N known genotypes or haplotypes are each associated with a response vector {right arrow over (υ)}j ε{0,1}N, j=1,...,n.

43. The system of claim 42, further comprising generating a graph G on vertices corresponding to probe response vectors, wherein the graph G is a complete edge-weighted and vertex-weighted undirected graph G=(V, E) provided on n vertices, wherein n is the number of potential probes.

44. The system of claim 43, wherein the weights w of each vertex v and each edge e are constrained by: 0≦w(v), w(e)≦1.

45. The system of claim 44, wherein at least one of (i) the weight w of a vertex v is set to: w(v)=min{fraction of 0's, fraction of 1's}, or (ii) the weight w of an edge e={u,v} is set to: w(e)=Hamming distance/vector length, wherein the Hamming distance is measured between the probe response vectors corresponding to vertices u and v, and wherein the vector length is the length of the probe response vectors N.

46. The system of claim 43, further comprising modifying the graph G by thresholding the edges such that the modified graph Gmod is defined as Gmod=(V, Emod), wherein Emod={e ε E: w(e)≦ρ}, and ρ is a selected threshold value.

47. The system of claim 46, wherein, for the modified graph Gmod and the probe set size M, when executed, the processing device is further configured to:

iv) initialize a current-best list of independent sets with associated information weights,

v) initialize vertex boosting weights to vertex weights w(v),

vi) define a probability distribution on the vertex subset based on vertex boosting weights,

vii) choose a random subset of vertices of a specified size M based on the probability distribution,

viii) eliminate one of the end-point vertices in each of the edges remaining in the induced subgraph on the random subset,

ix) modify the vertex boosting weights by increasing the weights of the vertices that are retained in the subset and decreasing the weights of the vertices that were selected in procedure (vii) but eliminated in procedure (viii), and

x) repeat procedures (vi) through (ix) for at least one of a predetermined number of iterations or until no improvement to the list of top independent sets is achieved.

48. The system of claim 47, wherein, for the modified graph Gmod and the probe set size M, when executed, the processing device is further configured to repeat procedures (v) through (x) for a predetermined number of iterations, and to start each iteration with a reinitialization of the vertex boosting weights to the vertex weights w(v) in procedure (v).

49. The computer-accessible medium of claim 48, wherein at least one of (i) for a given fixed small 0<ε<<1, the probe set size M satisfies an inequality Pr(∀code pairs, Hamming distance≧1)>1−ε, or (ii) for a given fixed small 0<ε<<1and a fixed α>1, the probe set size M satisfies an inequality Pr(∀code pairs, Hamming distance≧α)>1−ε.

50. The computer-accessible medium of claim 47, wherein the threshold p has a value to enable the graph G to have a sparsity bounded by A≦sparsity≦B, wherein the sparsity is definable by the average degree of a vertex in the graph G, wherein the lower bound A is a relatively small constant, and wherein the upper bound B is a function of the number of vertices n.