Hybrid model for DNA probe design and validation using nonlinear and linear regression methods

Info

Publication number: 20070037201
Type: Application
Filed: Oct 13, 2006
Publication Date: Feb 15, 2007
Inventors: Nicholas Sampas (San Jose, CA), Peter Tsang (San Francisco, CA), Brian Giles (Fremont, CA)
Application Number: 11/580,583

Abstract

Methods and systems for selecting oligonucleotide probes for use in microarray applications are provided herein. The described methods use a combination of measured probe performance and predicted probe performance to select probes. Nucleic acid arrays containing probes selected by the described methods are described. Also included are algorithms for performing the subject methods recorded on computer-readable media and computational systems for analysis.

Description

Description

CROSS REFERENCE TO RELATED APPLICATIONS

This Application is a continuation-in-part of, and claims priority to, U.S. patent application Ser. No. 10/996,323, filed Nov. 23, 2004.

BACKGROUND

Comparative genomic hybridization (CGH) and location analysis are important applications, which allow scientists to make biological measurements involving genomics, cytogenetics, and study expression and regulation of genes in biological systems. Both CGH and location analysis entail quantifying or measuring changes in copy number of genomic sequences in biological or medical samples. CGH, is particularly important in developmental biology as well as the causes of cancer and offers great potential in the diagnostics of cancer and developmental diseases. Recently, cDNA microarrays have been used for CGH studies. An oligo-array based approach has several substantial advantages over other technologies, in that it allows the designer to position the probes anywhere within the genomic or polynucleotide sequence of interest. The probes can be placed at any set of loci or positioned to span any genomic intervals of interest at whatever density is commensurate with the real-estate or area available on the microarray (in terms of number of features). The copy numbers of DNA over the genomic regions of interest can be evaluated by analyzing the hybridization of target sequences to the surface-bound probes. The oligonucleotide probe approach also offers the flexibility of focusing in on regions within exons or introns of expressed sequences, including pre-microRNAs or intergenic regions and regulatory regions for location analysis, as well as any desirable admixture of the aforementioned.

Probes that work well on microarrays for gene expression generally do not work well for CGH arrays and are not appropriate for location analysis arrays. The overall performance of probes for CGH and location analysis arrays entails different optimization of their properties than probes utilized for gene expression. Most notably, these differences relate to the substantially increased complexity of the labeled target mixture for CGH and location analysis than for expression analysis which demands a greater specificity of the probes in discriminating against non-specific binding to competing targets. For comparison, the total number of nucleotide bases in the human transcriptome is approximately 10⁸, while the human genome contains over 3×10⁹bases. Additionally, probes selected for gene expression come from within message sequences that are transcribed as RNA, i.e. exons, while probes for CGH need be complementary, or nearly so, to contiguous targets selected from within a genome sequence e.g. introns and/or exons.

Despite great interest in CGH technology, methods for evaluating probes in silico and also empirically for use in this technology are limited. A rigorous method would be to measure signals (e.g. ratios of signals) from each polynucleotide in controlled experiments with test samples containing known copy numbers for each probe sequence on the array. For example, a method used by several probe designers for measuring array performance for sets of polynucleotides specific for sequences on the X chromosome, is to use a series of cell lines with known variable copies of the X chromosome for CGH experiments. See, e.g., M. T. Barrett et al., Proc. Natl. Acad. Sci. USA 101(51): 17765-70 (2004). These cell lines (X series) are homogeneous and contain intact copies (e.g. 1 to 5) of the X chromosome permitting a rigorous measure of the relationship between copy number and signal intensities for each X chromosome specific polynucleotide on an array. However, cell lines containing known variable numbers of intact copies of most other chromosomes are not readily available. Furthermore, the aberrant X series cell lines are slow growing and can spontaneously vary in ploidy under standard culturing conditions. Such methods are complex and time-consuming and cannot readily be used to assay the relationship between the hybridization signal of polynucleotides on an array and the genomic copy number of sequences from each chromosome in a cell.

SUMMARY

This disclosure relates to methods for predicting probe performance for microarray applications. The methods described herein optimize probe performance by measuring the probe response in a model system and applying that response to predict the response for probes that have not yet been experimentally tested.

Methods for selecting an oligonucleotide probe with the best performance in a microarray application are provided herein. In an aspect, the methods include generating candidate probes and screening the probes with one or more metrics or parameters that can predict or classify probe performance. The resulting probe scores for each metric are combined using various statistical methods, and the probe with the best combined score is selected. In aspects, the methods described herein can be modified to obtain probes within a very narrow range of predicted properties for the probes.

Algorithms for performing the described methods recorded on computer-readable medium, as well as computations analysis systems that include the same are also provided. The disclosure also includes nucleic acid arrays with oligonucleotide probes whose performance is predicted using the subject methods, and methods using such arrays.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a flowchart generally depicting the methods described herein.

FIG. 2 is a flowchart showing a method to generate candidate probes whose performance is predicted by the methods described herein.

FIG. 3 is a flowchart depicting methods for calculating slope and combining measured slope with calculated parameters to predict probe performance.

FIG. 4 shows a distribution of measured slope against duplex melting temperature.

FIG. 5 shows a plot of smoothed measured slope against duplex melting temperature.

FIG. 6 shows a trend curve with a fitted curve of a 12th-order polynomial vs. duplex melting temperature.

FIG. 7 shows a graph of the fitted slopes vs. measured slopes for combined metrics.

FIG. 8 shows measured slope plotted against duplex melting temperature trend curves and various synthetic replacement curves.

FIG. 9 shows T_mdistributions resulting from the use of combined synthetic and empirical scores.

DETAILED DESCRIPTION

Various embodiments will be described in detail with reference to the drawings, wherein like reference numerals represent like parts throughout the several views. Reference to various embodiments does not limit the scope of the claims attached hereto. Additionally, any examples set forth in this specification are not intended to be limiting and merely set forth some of the many possible embodiments for the claims.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art. Although any methods, devices and material similar or equivalent to those described herein can be used in practice or testing, the methods, devices and materials are now described.

All publications and patent applications in this specification are indicative of the level of ordinary skill in the art and are incorporated herein by reference in their entireties.

In this specification and the appended claims, the singular forms “a,” “an,” and “the” include plural reference, unless the context clearly dictates otherwise. Unless defined otherwise, all,technical and scientific terms used herein have the same meaning as commonly understood to one of ordinary skill in the art.

Definitions

The term “genome” refers to all nucleic acid sequences (coding and non-coding) and elements present in or originating from a single cell, or from each cell type in an organism, or from a virus. The term “genome” encompasses all sources of genomic sequences or elements known to those of skill in the art. The term genome also applies to any naturally occurring or induced variation of these sequences that may be present in a mutant or disease variant of any virus or cell type. These sequences include, but are not limited to, those involved in the maintenance, replication, segregation, and higher order structures (e.g. folding and compaction of DNA in chromatin and chromosomes), or other functions, if any, of the nucleic acids as well as all the coding regions and their corresponding regulatory elements needed to produce and maintain each particle, cell or cell type in a given organism.

For example, the human genome consists of approximately 3×10⁹base pairs of DNA organized into distinct chromosomes. The genome of a normal diploid somatic human cell consists of 22 pairs of autosomes (chromosomes 1 to 22) and either chromosomes X and Y (males) or a pair of X chromosomes (female) for a total of 46 chromosomes. A genome of a cancer cell may contain variable numbers of each chromosome in addition to deletions, rearrangements and amplification of any subchromosomal region or DNA sequence.

The terms “nucleic acid” and “polynucleotide” are used interchangeably herein to describe a polymer of any length, e.g., greater than about 10 bases, greater than about 100 bases, greater than about 500 bases, greater than 1000 bases, usually up to about 10,000 or more bases composed of nucleotides, e.g., deoxyribonucleotides or ribonucleotides, or compounds produced synthetically (e.g., PNA as described in U.S. Pat. No. 5,948,902 and the references cited therein) which can hybridize with naturally occurring nucleic acids in a sequence specific manner analogous to that of two naturally occurring nucleic acids, e.g., can participate in Watson-Crick base pairing interactions.

The terms “ribonucleic acid” and “RNA” as used herein mean a polymer composed of ribonucleotides.

The terms “deoxyribonucleic acid” and “DNA” as used herein mean a polymer composed of deoxyribonucleotides.

The term “oligonucleotide” as used herein denotes single stranded nucleotide multimers of from about 10 to 100 nucleotides and up to 200 nucleotides in length. Oligonucleotides are usually synthetic and, in many embodiments, are under 50 nucleotides in length.

The term “oligomer” is used herein to indicate a chemical entity that contains a plurality of nucleotide monomers, i.e., a nucleotide multimer. As used herein, the terms “oligomer” and “polymer” are used interchangeably, as it is generally, although not necessarily, smaller “polymers” that are prepared using the functionalized substrates of the invention, particularly in conjunction with combinatorial chemistry techniques. Examples of oligomers and polymers include polydeoxyribonucleotides (DNA), polyribonucleotides (RNA), other nucleic acids that are C-glycosides of a purine or pyrimidine base, polypeptides (proteins), polysaccharides (starches, or polysugars), and other chemical entities that contain repeating units of like chemical structure.

The term “sample” as used herein relates to a material or mixture of materials, typically, although not necessarily, in fluid form, containing one or more components of interest. Samples include, but are not limited to, biological samples obtained from natural biological sources, such as cells or tissue. The samples may also be derived from tissue biopsies and other clinical procedures.

The terms “nucleoside” and “nucleotide” are intended to include those moieties that contain not only the known purine and pyrimidine bases, but also other heterocyclic bases that have been modified. Such modifications include methylated purines or pyrimidines, acylated purines or pyrimidines, alkylated riboses or other heterocycles. In addition, the terms “nucleoside” and “nucleotide” include those moieties that contain not only conventional ribose and deoxyribose sugars, but other sugars as well. Modified nucleosides or nucleotides also include modifications on the sugar moiety, e.g., wherein one or more of the hydroxyl groups are replaced with halogen atoms or aliphatic groups, or are functionalized as ethers, amines, or the like.

The phrase “surface-bound polynucleotide” refers to a polynucleotide that is immobilized on a surface of a solid substrate, where the substrate can have a variety of configurations, e.g., a sheet, bead, or other structure. In certain embodiments, the collections of oligonucleotide probe elements employed herein are present on a surface of the same planar support, e.g., in the form of an array.

The phrase “labeled population of nucleic acids” refers to mixture of nucleic acids that are detectably labeled, e.g., fluorescently labeled, such that the presence of the nucleic acids can be detected by assessing the presence of the label. A labeled population of nucleic acids is “made from” a chromosome sample, the chromosome sample is usually employed as template for making the population of nucleic acids.

A “biological model system,” or “model system,” as provided herein, refers to a system for which a quantitative response in a microarray system can be expected with certainty (i.e. a system wherein a response can be detected or measured). Exemplary model systems include, without limitation, biological systems, such as titration series with different RNA samples at different concentrations, samples with known genomic aberrations, samples to be used for comparative genomic hybridization experiments, etc. The biological model systems are used to perform microarray experiments, to validate probes designed for microarray applications, to obtain sets of training data for statistical analysis, etc.

The term “array” encompasses the term “microarray” and refers to an ordered array presented for binding to nucleic acids and the like.

An “array,” includes any two-dimensional or substantially two-dimensional (as well as a three-dimensional) arrangement of spatially addressable regions bearing nucleic acids, particularly oligonucleotides or synthetic mimetics thereof, and the like. Where the arrays are arrays of nucleic acids, the nucleic acids may be adsorbed, physisorbed, chemisorbed, or covalently attached to the arrays at any point or points along the nucleic acid chain.

In those embodiments where an array includes two more features immobilized on the same surface of a solid support, the array may be referred to as addressable. An array is “addressable” when it has multiple regions of different moieties (e.g., different oligonucleotide sequences) such that a region (i.e., a “feature” or “spot” of the array) at a particular predetermined location (i.e., an “address”) on the array will detect a particular sequence. Array features are typically, but need not be, separated by intervening spaces. In the case of an array in the context of the present application, the “population of labeled nucleic acids” will be referenced as a moiety in a mobile phase (typically fluid), to be detected by “surface-bound polynucleotides” which are bound to the substrate at the various regions. These phrases are synonymous with the arbitrary terms “target” and “probe”, or “probe” and “target”, respectively, as they are used in other publications.

A “scan region” refers to a contiguous (preferably, rectangular) area in which the array spots or features of interest, as defined above, are found or detected. Where fluorescent labels are employed, the scan region is that portion of the total area illuminated from which the resulting fluorescence is detected and recorded. Where other detection protocols are employed, the scan region is that portion of the total area queried from which resulting signal is detected and recorded. For the purposes of this invention and with respect to fluorescent detection embodiments, the scan region includes the entire area of the slide scanned in each pass of the lens, between the first feature of interest, and the last feature of interest, even if there are intervening areas that lack features of interest.

The term “substrate” as used herein refers to a surface upon which marker molecules or probes, e.g., an array, may be adhered. Glass slides are the most common substrate for biochips, although fused silica, silicon, plastic, flexible web and other materials are also suitable.

An “array layout” refers to one or more characteristics of the features, such as feature positioning on the substrate, one or more feature dimensions, and an indication of a moiety at a given location. “Hybridizing” and “binding”, with respect to nucleic acids, are used interchangeably. The terms “hybridizing,” “hybridizing specifically to,” and “specific hybridization” as used herein, refer to the binding, duplexing, or hybridizing of a nucleic acid molecule preferentially to a particular nucleotide sequence under stringent conditions.

The term “stringent assay conditions” as used herein refers to conditions that are compatible to produce binding pairs of nucleic acids, e.g., probes and targets, of sufficient complementarity to provide for the desired level of specificity in the assay while being incompatible to the formation of binding pairs between binding members of insufficient complementarity to provide for the desired specificity. The term stringent assay conditions refers to the combination of hybridization and wash conditions.

A “stringent hybridization” and “stringent hybridization wash conditions” in the context of nucleic acid hybridization (e.g., as in array, Southern or Northern hybridizations) are sequence dependent, and are different under different environmental parameters. Stringent hybridization conditions that can be used to identify nucleic acids within the scope of the invention can include, e.g., hybridization in a buffer comprising 50% formamide, 5×SSC, and 1% SDS at 42° C., or hybridization in a buffer comprising 5×SSC and 1% SDS at 65° C., both with a wash of 0.2×SSC and 0.1% SDS at 65° C. Exemplary stringent hybridization conditions can also include a hybridization in a buffer of 40% formamide, 1 M NaCl, and 1% SDS at 37° C., and a wash in 1×SSC at 45° C. Alternatively, hybridization to filter-bound DNA in 0.5 M NaHPO₄, 7% sodium dodecyl sulfate (SDS), 1 mnM EDTA at 65° C., and washing in 0.1×SSC/0.1% SDS at 68° C. can be employed. Yet additional stringent hybridization conditions include hybridization at 60° C. or higher and 3×SSC (450 mM sodium chloride/45 mM sodium citrate) or incubation at 42° C. in a solution containing 30% formamide, 1 M NaCl, 0.5% sodium sarcosine, 50 mM MES, pH 6.5. Those of ordinary skill will readily recognize that alternative but comparable hybridization and wash conditions can be utilized to provide conditions of similar stringency.

In certain embodiments, the stringency of the wash conditions determine whether a nucleic acid is specifically hybridized to a probe. Wash conditions used to identify nucleic acids may include, e.g.: a salt concentration of about 0.02 M at pH 7 and a temperature of about 20° C. to about 40° C.; or, a salt concentration of about 0.15 M NaCl at 72° C. for about 15 minutes; or, a salt concentration of about 0.2×SSC at a temperature of about 30° C. to about 50° C. for about 2 to about 20 minutes; or, the hybridization complex is washed twice with a solution with a salt concentration of about 2×SSC containing 1% SDS at room temperature for 15 minutes and then washed twice by 0.1×SSC containing 0.1% SDS at 37° C. for 15 minutes; or, equivalent conditions. Stringent conditions for washing can also be, e.g., 0.2×SSC/0.1% SDS at 42° C. See Sambrook, Ausubel, or Tijssen (cited below) for detailed descriptions of equivalent hybridization and wash conditions and for reagents and buffers, e.g., SSC buffers and equivalent reagents and conditions.

A specific example of stringent assay conditions is rotating hybridization at 65° C. in a salt based hybridization buffer with a total monovalent cation concentration of 1.5 M (e.g., as described in U.S. patent application Ser. No. 09/655,482 filed on Sep. 5, 2000, the disclosure of which is herein incorporated by reference) followed by washes of 0.5×SSC and 0.1×SSC at room temperature.

Stringent hybridization conditions may also include a “prehybridization” of aqueous phase nucleic acids with complexity-reducing nucleic acids to suppress repetitive sequences. For example, certain stringent hybridization conditions include, prior to any hybridization to surface-bound polynucleotides, hybridization with Cot-1DNA, or the like.

Stringent assay conditions are hybridization conditions that are at least as stringent as the above representative conditions, where a given set of conditions are considered to be at least as stringent if substantially no additional binding complexes that lack sufficient complementarity to provide for the desired specificity are produced in the given set of conditions as compared to the above specific conditions, where by “substantially no more” is meant less than about 5-fold more, typically less than about 3-fold more. Other stringent hybridization conditions are known in the art and may also be employed, as appropriate.

The term “mixture”, as used herein, refers to a combination of elements, that are interspersed and not in any particular order. A mixture is heterogeneous and not spatially separable into its different constituents. Examples of mixtures of elements include a number of different elements that are dissolved in the same aqueous solution, or a number of different elements attached to a solid support at random or in no particular order in which the different elements are not especially distinct. In other words, a mixture is not addressable. To be specific, an array of surface-bound polynucleotides, as is commonly known in the art and described below, is not a mixture of capture agents because the species of surface-bound polynucleotides are spatially distinct and the array is addressable. “Isolated” or “purified” generally refers to isolation of a substance (compound, polynucleotide, protein, polypeptide, polypeptide, chromosome, etc.) such that the substance comprises the majority percent of the sample in which it resides. Typically in a sample a substantially purified component comprises 50%, preferably 80%-85%, more preferably 90-95% of the sample. Techniques for purifying polynucleotides, polypeptides and intact chromosomes of interest are well-known in the art and include, for example, ion-exchange chromatography, affinity chromatography, sorting, and sedimentation according to density.

The terms “assessing” and “evaluating” are used interchangeably to refer to any form of measurement, and include determining if an element is present or not. The terms “determining,” “measuring,” and “assessing,” and “assaying” are used interchangeably and include both quantitative and qualitative determinations. Assessing may be relative or absolute. “Assessing the presence of” includes determining the amount of something present, as well as determining whether it is present or absent.

The term “using” has its conventional meaning, and, as such, means employing, e.g., putting into service, a method or composition to attain an end. For example, if a program is used to create a file, a program is executed to make a file, the file usually being the output of the program. In another example, if a computer file is used, it is usually accessed, read, and the information stored in the file employed to attain an end. Similarly if a unique identifier, e.g., a barcode is used, the unique identifier is usually read to identify, for example, an object or file associated with the unique identifier.

If a surface-bound polynucleotide “corresponds to” a chromosome, the polynucleotide usually contains a sequence of nucleic acids that is unique to that chromosome. Accordingly, a surface-bound polynucleotide that corresponds to a particular chromosome usually specifically hybridizes to a labeled nucleic acid made from that chromosome, relative to labeled nucleic acids made from other chromosomes. Array features, because they usually contain surface-bound polynucleotides, can also correspond to a chromosome.

A “non-cellular chromosome composition”, as will be discussed in greater detail below, is a composition of chromosomes synthesized by mixing pre-determined amounts of individual chromosomes. These synthetic compositions can include selected concentrations and ratios of chromosomes that do not naturally occur in a cell, including any cell grown in tissue culture. Non-cellular chromosome compositions may contain more than an entire complement of chromosomes from a cell, and, as such, may include extra copies of one or more chromosomes from that cell. Non-cellular chromosome compositions may also contain less than the entire complement of chromosomes from a cell.

A “probe” means a polynucleotide which can specifically hybridize to a target nucleotide, either in solution or as a surface-bound polynucleotide.

The term “validated probe” means a probe that has been passed by at least one screening or filtering process in which experimental data related to the performance of the probes was used a part of the selection criteria.

“In silico” means those parameters that can be determined without the need to perform any experiments, by using information either calculated de novo or available from public or private databases.

The term “duplex T_m” refers to the melting temperature of two oligonucleotides which have formed a duplex structure. Duplex T_mis calculated by a simple formula where each matching GC pair gets a value of 2, and each matching AT pair gets a value of 1. The sum of these approximate values gives the melting temperature.

Approaches and Methods for Probe Selection

The present methods provide alternative and novel methods and systems for designing probes for CGH and location analysis in microarray applications that overcome the drawbacks of existing microarray probe selection techniques. General methods that utilize probe/target hybridization experiments and/or unique data analysis techniques to identify and select nucleotide probe(s) targeting polynucleotide fragments in a region of interest were described in U.S. Patent Publication No. 2006/0110744. The methods described herein provide statistical methods for combining and modifying probe scores in order to achieve desired results, such as selecting or designing probes with more robust probe performance, better probe signal, etc.

The present description provides methods, systems and computer readable media for identifying and selecting nucleic acid probes for detecting a target with a nucleic acid probe array or microarray. The methods comprise, in general terms: the selection of genomic nucleotide ranges of interest, determining appropriate target sequences for CGH and/or location analysis, generating candidate probes specific for the target sequences and analyzing candidate probes for specific probe properties by computational and/or experimental processes to optimize probe selection and reduce the number of probes to a value appropriate for placement on a microarray.

The description also provides microarrays comprising probes selected by the methods described herein. The microarrays comprise a solid support and a plurality of surface bound probes, the surface bound probes having very similar thermodynamic properties as well as similar GC content. More specifically, a large portion of the probes utilized in the microarrays of the invention, have duplex melting temperatures (T_m) which are within a narrow temperature range compared to the T_mrange of probes for other microarray systems, such as arrays for gene expression.

The methods provided herein are particularly useful with comparative genome hybridization microarrays, such as microarrays based on the human or mouse genome. These methods permit more cost-effective and efficient identification of gene regions or sections which can be associated with human disease, points of therapeutic intervention, and potential toxic side-effects of proposed therapeutic entities.

In general terms, the methods for probe selection and validation described herein comprise identifying probe properties that can be determined a priori by the probe's sequence and the sequence of the genome it is contained within, and may further comprise expanding the set of properties from those that can be determined a priori, to those that can be measured empirically through simple experiments, such as self-self experiments. The described methods may further comprise measuring the response of candidate probes to a known stimulus, where the stimulus is generated by a set of samples where the copy numbers for relatively small subsets of the genome are altered in known ways.

In designing an array comprising high-performance probes that comprehensively covers a whole genome (e.g. the human genome) the entire genomic sequence must be searched when generating specific candidate probes. This homology search is potentially the most time-consuming part of the probe design process. Ideally, a homology search would be the first part of the process, however because of the scale of the human genome executing an exhaustive search of all possible short oligo probes (<100 bases), can take computation time on the scale of a CPU year (based on ProbeSpec), for modern 3 GHz processors. This computation time can be reduced by any of a number of methods, most involving reducing the scale of the search. For example, known highly repetitive sequences can be removed by a process called RepeatMasking. Repeat-masked genomic sequences are publicly available on the web (e.g. UCSC's www.genomebrowser.org). Another approach is to reduce the number of probe sequences being searched up-front. This can be done on the basis of any known property of the probe, from thermodynamic properties, such as duplex-Tm and hairpin free energy, to position on the genome. The present description provides methods which apply known probe information as a screening process to reduce the number of probe sequences to be analyzed in a homology search, thus reducing the computation time needed to identify appropriate probes for a CGH based array.

The present systems, techniques, methods and computer readable media also provide for streamlined workflow, since researchers need only to prepare and process one microarray instead of two or more per sample, with fewer steps in processing and tracking required.

Further, greater reproducibility of results is provided for, since all data for an entire genome is generated from a single microarray, resulting in less variability in the data. When two or more microarrays associated with the same sample are processed separately, there are always questions of variability of the experimental conditions used to process each microarray.

Designing a microarray involves determining the amount of “real estate” (number of probes) that is available for the final array. The array designer also determines the amount of probes or “real estate” to use for specified regulatory regions, intergenic regions as well the amount of probes necessary to adequately cover introns and exons of the chromosomes of interest. Initially, a designer will generate 20 to 40 million candidate probes and need to filter the probes for certain probe properties or parameters to obtain a final array with approximately 40,000 probes. Intermediate arrays are manufactured in some embodiments of the methods of the invention, which have a redundancy of 3 or 4 fold over the number of probes selected for the final array, these intermediate arrays are utilized to screen candidate probes for certain probe properties by direct or indirect experimentation.

In many embodiments, the oligonucleotides (i.e. probes) contained in the features of the invention have been designed according to one or more particular parameters to be suitable for use in a given application, where representative parameters include, but are not limited to: length, melting temperature (T_m), non-homology with other regions of the genome, hybridization signal intensities, kinetic properties under hybridization conditions, etc., see e.g., U.S. Pat. No. 6,251,588, the disclosure of which is herein incorporated by reference.

Standard hybridization techniques (using high stringency hybridization conditions) are used to probe subject array. Suitable methods are described in references describing CGH techniques (Kallioniemi et al., Science 258:818-821 (1992) and WO 93/18186). Several guides to general techniques are available, e.g., Tijssen, Hybridization with Nucleic Acid Probes, Parts I and II (Elsevier, Amsterdam 1993). For a descriptions of techniques suitable for in situ hybridizations see, Gall et al. Meth. Enzymol. 21 :470-480 (1981) and Angerer et al. in Genetic Engineering: Principles and Methods (Setlow and Hollander, eds.), vol. 7, pp. 43-65 (Plenum Press, New York 1985). See also U.S. Pat. Nos. 6,335,167; 6,197,501; 5,830,645; and 5,665,549; the disclosures of which are incorporated herein by reference.

FIG. 1 shows a general description of the methods described herein. In an aspect, as in operation 100, a candidate oligonucleotide for a particular region of interest in a target nucleic acid sequence is generated. The candidate probe is then screened with one or more metrics or parameters that are predictive of probe performance, as in the operation 102, which yields a probe score for each metric. The individual probe scores are then combined to produce a combined score for the probe in operation 103. The probe with the best score is then selected, as in 104, for a subsequent microarray application.

Methods for Selecting Oligonucleotide Probes

The methods described herein are directed to selection of oligonucleotide probes for use in microarray applications. Two or more candidate oligonucleotide probes are generated and analyzed using one or more metrics that are indicative of probe performance. An individual probe score is obtained with respect to each metric, and these probe scores are then combined into a single score for the probe. Probes with combined scores closest to an optimal score value are selected as ideal or best probes (i.e. those probes which are most suited to a particular microarray experiment, in terms of ability to hybridize to the target sequences, reproducibility, repeatability, etc). Probes may be scored on any numerical scale, with the best probes having scores closest to the high end of the numerical scale. For example, an optimal score value on a scale of 0.0 to 1.0 would be about 1.0. Similarly, on a numerical scale of probe scores from 50 to 100, an optimal score value would be about 100.

In embodiments, two or more candidate probes are generated by selecting one or more target sequences within a region of interest and subsequences of the target are tiled across the entire region of interest to obtain a set of potential probes. In aspects, the subsequences of the target sequences are tiled in single base steps across the region of interest. This generates a large set of potential probes, which are reduced to a manageable number (such as greater than 2, but less than 1000, for example) by pairwise filtering.

Once the candidate probes are generated, they are analyzed using metrics that are indicative of probe performance, and each probe is assigned a probe score. These metrics include direct metrics, indirect metrics and in silico metrics. Direct metrics comprise the changes in probe response based on experimentally measured quantities, such as change in copy number. Indirect metrics used comprise changes in predicted probe response resulting from experimentally measured quantities for a target molecule, or changes in predicted probe response measured using empirical relationships based on direct responses from other probe-target molecule duplexes. The in silico metrics comprise changes in the probe response based on calculated quantities for a target molecule, or changes in probe response measured using empirical relationships based on direct responses from other probe-target molecule duplexes. To obtain a probe score from the application of one or more metrics, the slope for each candidate probe is calculated and plotted against the corresponding value for each metric to generate a trend curve. The trend curve is then fitted with a polynomial function to obtain the probe score for each metric. The order of the polynomial can range from 1 to 20.

Individual probe scores for each metric are then combined, by adding or averaging the individual scores, to give a combined probe score. The individual probe scores can also be fitted with a linear additive multivariate fitting function, or a linear multiplicative fitting function to give the combined score. In aspects, combined probe scores are obtained by combining the metrics in each category using a linear model to obtain intermediate scores. The intermediate scores are then multiplied together to give the combined score. Individual probe scores can also be combined by fitting the measured slope responses for a training data set with a change in copy number.

The combined scores can be synthetically modified to give probes with more robust predicted performance (i.e. a probe which more effectively mimics probe performance in an actual experiment). In aspects, the synthetic modification comprises generating a large candidate set of probes, and reducing the number of probes by pairwise reduction. A slope is calculated for each probe and plotted against the corresponding slope for each metric to generate a trend curve. This trend curve is fitted to give a measured probe score, and the fitted trend curve is replaced using a synthetic curve and a predicted score is obtained. The predicted scores are combined with experimentally measured scores to give the combined score value for a particular probe. The probe (or probes) with a combined score value closest to the optimal score value is selected for microarray applications.

Probe selection is performed using a computational analysis system which comprises a computer-readable medium with a program that selects probes for microarray applications as in the methods described herein. The methods can be used to produce or fabricate a microarray comprising at least two probes selected according to the methods described herein.

Generating Candidate Probes

In an embodiment, a candidate oligonucleotide probe, or set of probes, for a particular region of interest in a target nucleic acid sequence is generated, as in operation 100, an expanded representation of which is shown in FIG. 2. Briefly, operation 100 begins with the selection or identification 200 of target nucleic acid sequences within a genome. The candidate probe or candidate set of probes is any probe or set of probes within (or capable of hybridizing to) the target sequence or genome including, without limitation, genes, exons, mRNA, a region of interest within the target sequence, probes used or selected for previous experiments, upstream or downstream regulatory regions of genes, methylated regions, regions associated with putative SNPs or CNPs, sequence aberrations known to be associated with particular disease states or phenotypes, histones or binding sites in the sequence for other molecules, etc. Potential target sequences of the nucleotide sample of interest are identified, filtered and reduced to a set of appropriate target sequences for CGH and/or location analysis. The potential target sequences are filtered by size, number of repeat-masked bases and/or GC-content. Target sequences are also filtered and reduced in number by eliminating repetitive target sequences. Another parameter which can be used to filter target sequence is to eliminate potential target sequences which comprise a restriction enzyme cut site. By limiting the size of the set of target sequences, the computational time needed to generate and analyze the candidate probes is decreased.

Generating a set of candidate probes comprises selecting subsequences of the selected target nucleic acid sequences across genomic regions of interest, as in operation 202. Probes are tiled in uniform or moderately uniform spacing, in steps as small as a single base, or as large as megabases, through the genome, targeted region of the genome, or target nucleic acid sequence. For example, probes may be tiled in steps of 50-100 bases across the entire genome, but the methods described herein are not dependent on the scale of the tiling The smaller the scale of the tiling, the larger the number of potential candidate probes forming a plurality of candidate probes. The number of potential candidate probes over an interval should exceed the number expected to be selected over that same interval. Candidate probes are selected from the plurality of candidate probes based on parameters, for example, a narrow range of a specific parameter such as probe length. Probe parameters utilized to select candidate probes from a plurality of potential candidate probes may include, but are not limited to, target specificity, thermodynamic properties, expression and association with genes, homology and kinetic properties.

In some embodiments, the probe parameters include, but are not limited to, a range of T_Mof about 0.25° C. to about 5° C., a T_Mvalue of about 65° C. to about 85° C., a nucleotide length of 20 to 200 nucleotides, a range GC content % of less than 10%, and/or % GC content about 30-40%. When length of the probe is a criteria, probes have a nucleotide length of about 20 nucleotides to about 200 nucleotides, usually about 40 nucleotides to 100 nucleotides, and more usually 50 to 65 nucleotides.

Typically, 30 to 60-mer candidate probes are selected, but the candidate probes may range from about 20-mer to about 200-mer. Typically, probes may be selected over spacings of approximately half the length of the probe. For example, for a 60-mer candidate probe, 30 bp intervals would be selected over the entire genome, or regions of interest. Usually, the repeat-masked regions are skipped, as they are usually insufficiently unique to be of use. Also, if the assay involves the use of a restriction digest, the restriction sites within the sequence for the restriction enzymes specified within the protocol are also typically excluded, or those probes subsequently excluded from the candidate set.

The large number of candidate probes generated by this process is then reduced to a smaller set of candidate probes using a reduction method, such as the pairwise reduction method, for example, as shown in operation 204. The pairwise reduction method evaluates a pair of candidate probes for a probe property and scores the probes within the pair against each other according to the probe property analyzed. The pairwise reduction process reduces the number of probes by a factor of X, where X may be any number that significantly reduces the number of probes. For example, the pairwise reduction process may reduce the number of probes by a factor of 5, 10, 15, 20, 25, 30 and so on. The number of candidate probes can also be reduced by any other method or algorithm that uses the position of the probe and a combined or overall score to discriminate between probes.

Following the reduction process, the candidate probes are optionally experimentally validated, as in 206. The experimental validation process involves experiments which measure the properties of a probe that provide a good indication of the probe's performance (i.e., suitability) in a microarray experiment, in the absence of direct experiments or data. Experimentally measurable probe properties include, without limitation, raw signal intensity, reproducibility of signal intensity, dye bias, susceptibility of non-specific binding, etc. The process for probe selection, pairwise reduction and experimental validation are described in detail in U.S. Patent Publication No. 2006/0110744 and WO 2004/059845, the disclosures of which are incorporated herein by reference.

Once at least two candidate probes are selected, the candidate probes are analyzed with one or more metrics that predict or indicate probe performance to generate a probe score for each metric.

Analyzing Candidate Probes with Metrics

As shown in FIG. 1, embodiments of the methods described herein include a process 102 for analyzing candidate probes with one or more probe performance parameters or metrics. The term “parameter” or “metric” refers to a quantity or property that is indicative of a probe's performance in a microarray experiment. Three types of metrics are used with the methods described herein: direct metrics, indirect metrics, and in silico metrics.

Direct metrics are those that directly measure probe performance. Direct metrics measure performance by observing the change in probe response, as measured by the signal or ratio of signals, or log of the ratio of signals, with respect to a reference sample, with a change in copy number of the target nucleic acid sequence or region of interest, using multiple hybridization experiments on multiple arrays, with the conditions maintained as similar as possible between arrays. The change in probe response is measured as a change in signal in a differential model system, such as a dye-swap or dye-flip experiment, for example, where DNA copy number is changed in known or predictable ways. For example, in an experiment to evaluate the performance of probes on the X chromosome using a normal pair of female-male samples, the probe is expected to produce a 2:1 signal ratio, as there are twice as many X chromosome target molecules in the female sample as in the male sample, and there are no Y chromosome target molecules in the female sample. Similarly, other differential model systems such as cell lines with well-known chromosomal aberrations, extra or missing chromosomes or regions of chromosomes, will also produce copy number changes in a predictable manner. Biological model systems that do not exist in nature, but are created using cell sorting techniques, or by mixing collections of BACs, cDNAs or other biologically-derived DNA samples can also be used for measuring probe performance.

Indirect metrics are measured parameters or metrics that are indicative of probe performance. Indirect metrics comprise observing the change in probe response in relatively simple experiments using a non-differential model. Indirect metrics (or indirect empirical parameters) include signal strength (in one or both channels from which signal is measured in a microarray experiment), dye bias (the LogRatio associated with a dye label rather than the LogRatio associated with copy number), differential signals obtained from experiments under various conditions (multiple annealing times for probes to target nucleic acid sequences, wash times, wash temperatures, etc.), for example. For dye bias measurements, the LogRatios for dye-flip experiments are averaged, rather than subtracted as they would be to calculate the effective LogRatios for copy number changes. Experiments using indirect metrics are considered non-differential, because, for most of the genome, the changes in probe response do not reflect changes in copy number (i.e. no change in copy number is expected, between the sample and a reference sequence). Rather, indirect metrics predict the performance of a probe in terms of sensitivity and specificity. For example, a measured signal that is too strong could represent cross-hybridization of the probes to multiple regions of the genome. On the other hand, a measured signal that is too weak is indicative of noise, or susceptible to changes in the condition or the quantity of DNA.

In silico metrics are calculated parameters that are indicative of probe performance. In silico metrics are those metrics that are calculated in the absence of any experimental data. These metrics are derived from the sequence of the probes themselves, and from the sequences of the genome, or the transcriptome of the organism being studied. In silico metrics for each candidate probe are obtained from the sequences directly, based on known laws of physics and chemistry, such as those related to thermodynamics. In silico metrics used in the methods described herein include, without limitation, duplex melting temperature (T_mor DuplexTm) between a probe and its complementary sequence, maximal subsequence duplex melting temperature of a probe (MaxSubSeqTm; the maximal T_mfor any subsequence of length M within a longer sequence of length N), hairpin thermodynamic properties of the probe (i.e., hairpin melting temperature, Gibbs free energy, number of bases within turns, loops, stems, etc.), and sequence complexity (where complexity refers to the number of bases in the probe that are contained within short simple repeats, such as homopolymers, dimers, trimers, tetramers, etc., for example). For example, with the methods described herein, complexity typically refers to the number of bases contained within repeat units with six nucleotides, i.e. hexamers, but the methods described herein can generally be employed with repeats with any number of nucleotides.

The direct, indirect and in silico metrics are described in detail in U.S. Patent Publication No. 2006/0110744, the disclosure of which is incorporated herein by reference. The analytical process involves calculating a slope, or the responsiveness of a probe to a change in copy number of its complementary target sequence, for each candidate probe, as in 300, based on the response of the probe in an experiment with respect to a particular metric. The slope for each of a set of probes can be measured using a model system where the relative copy numbers of the target molecules for each probe in the set is known in each sample. The measured slope is calculated for each probe within the set, for example, X-chromosome probes, in the case of male and female samples. The slope can be estimated most simply by calculating the ratio of the signals in two samples with two different copy numbers of targets. It can also be the ratio of log-signals to the log-copy numbers, or the ratio of log ratios of signals. In a more complex system, a number of samples can be hybridized, where each pair of samples has a different set of copy numbers for each respective set of probes. For example, in a male-female model system, some sample pairs can be male referenced to female, others can be female referenced to male, and still others can be male referenced to male, or female referenced to female. This provides multiple data points for each probe. The slope for a two-color assay is then calculated by means of a linear regression of the ratios (of signals) for each probe as a function of the ratios of known target copy numbers in each sample. The y-intercept provided by such regression is also useful, as it provides the dye-bias. By analogy, in a single-color assay, the regression is between the measured signals and the known copy number.

In embodiments, the slope is calculated (as in 300) from the performance of a probe analyzed using a direct metric, by observing the change in probe response, as measured by the signal or ratio of signals, or log of the ratio of signals, with respect to a reference sample, with a change in copy number of the target nucleic acid sequence or region of interest, using multiple hybridization experiments on multiple arrays, with the conditions maintained as similar as possible between arrays. The change in probe response is measured as a change in signal in a differential model system, such as a dye-swap or dye-flip experiment, for example, where DNA copy number is changed in known or predictable ways. For example, in an experiment to evaluate the performance of probes on the X chromosome using a normal pair of female-male samples, the probe is expected to produce a 2:1 signal ratio, as there are twice as many X chromosome target molecules in the female sample as in the male sample, and there are no Y chromosome target molecules in the female sample. Similarly, other differential model systems such as cell lines with well-known chromosomal aberrations, extra or missing chromosomes or regions of chromosomes, will also produce copy number changes in a predictable manner. Biological model systems that do not exist in nature, but are created using cell sorting techniques, or by mixing collections of BACs, cDNAs or other biologically-derived DNA samples can also be used for measuring probe performance.

In embodiments, using a direct metric, the change in signal is measured with respect to measured quantities such as LogRatio (i.e. the log of the ratio of red to green channels), LogIntensity (the log product of red and green channel intensities), and dye bias (the average of LogRatios for a dye-swap pair; obtained by subtracting LogRatios), for example. For the most robust probe performance, the change in probe response reflects the specific LogRatio change associated with changes in copy number in dye-flip experiments, as measured by subtracting LogRatios.

In embodiments, the slope for a probe is calculated, as in 300, based on the performance of a probe analyzed using an indirect metric, by observing the change in probe response in relatively simple experiments using a non-differential model. Indirect metrics (or indirect empirical parameters) include signal strength (in one or both channels from which signal is measured in a microarray experiment), dye bias (the LogRatio associated with a dye label rather than the LogRatio associated with copy number), differential signals obtained from experiments under various conditions (multiple annealing times for probes to target nucleic acid sequences, wash times, wash temperatures, etc.), for example. For dye bias measurements, the LogRatios for dye-flip experiments are averaged, rather than subtracted as they would be to calculate the effective LogRatios for copy number changes. Experiments using indirect metrics are considered non-differential, because, for most of the genome, the changes in probe response do not reflect changes in copy number (i.e. no change in copy number is expected, between the sample and a reference sequence). Rather, indirect metrics predict the performance of a probe in terms of sensitivity and specificity. For example, a measured signal that is too strong could represent cross-hybridization of the probes to multiple regions of the genome. On the other hand, a measured signal that is too weak is indicative of noise, or susceptible to changes in the condition or the quantity of DNA.

In embodiments, the slope calculation in operation 300 is based on the performance of a probe analyzed using in silico parameters or metrics. In silico metrics are those metrics that are calculated in the absence of any experimental data. These metrics are derived from the sequence of the probes themselves, and from the sequences of the genome, or the transcriptome of the organism being studied. In silico metrics for each candidate probe are obtained from the sequences directly, based on known laws of physics and chemistry, such as those related to thermodynamics. In silico metrics used in the methods described herein include, without limitation, duplex melting temperature (T_mor DuplexTm) between a probe and its complementary sequence, maximal subsequence duplex melting temperature of a probe (MaxSubSeqTm; the maximal T_mfor any subsequence of length M within a longer sequence of length N), hairpin thermodynamic properties of the probe (i.e., hairpin melting temperature, Gibbs free energy, number of bases within turns, loops, stems, etc.), and sequence complexity (where complexity refers to the number of bases in the probe that are contained within short simple repeats, such as homopolymers, dimers, trimers, tetramers, etc., for example). For example, with the methods described herein, complexity typically refers to the number of bases contained within repeat units with six nucleotides, i.e. hexamers, but the methods described herein can generally be employed with repeats with any number of nucleotides.

In other embodiments, in silico parameters or metrics associated with the homology of a probe are used. These metrics include, without limitation, homology score (i.e. the distance to the nearest hit, not including the first target sequence, within the target sequence of interest or genome), homology signal-to-background, expressed on a log scale (HomLogS2B, described in U.S. Patent Publication No. 2006/0110744), and predicted homology response (S_Hom). The predicted homology response is similar to the HomLogS2B, but instead of predicting the signal-to-background, this score predicts the slope response of a probe based on homology calculations alone, under the assumption that thermodynamic and other properties of the probe are ideal. The predicted homology score is defined by Equation 1: $\begin{matrix} S_{Hom} = \frac{\sum_{j = 1}^{TargetSeq .} P ({mm}_{j})}{\sum_{i = 1}^{Genome} P ({mm}_{i})} & (1) \end{matrix}$
where P(mm_j) is a penalty term representing the signal contribution (under the specified hybridization conditions) for the hybridization of the probe of interest to each sufficiently complementary mismatch sequence within a specified target sequence or genome. The summation in the denominator in Equation 1 is over all the sequences in the genome, or within the complex set of sequences expected to be in a sample or set of samples. The numerator in Equation 1 represents the target sequence of interest. In the most specific case, the target sequence refers to the small specific sequence for which the probe is being designed (i.e. within a particular locus within a narrow region of a specific chromosome or region of interest in the genome for which the probe is designed).

In the specific case, the equation can be simplified as shown in Equation 2: $\begin{matrix} S_{Hom} = \frac{1}{\sum_{i = 1}^{Genome} P ({mm}_{i})} & (2) \end{matrix}$
The function P(mm_j) can be calculated using a model for the hybridization between two oligonucleotide sequences using nearest neighbor models. The term is dependent on the number of mismatches, the distribution of mismatches through the aligned sequences, the specific mismatched bases, and the length of the overlap. Although all possible sequences within the target nucleic acid sequence or genome should be considered, in practice, only those sequences that are homologous enough to the probe sequence are considered. For example, with 60-mer probes, all subsequences in the genome that align with fewer than about 20 bases are considered.

This model can be further simplified, by approximating the homology slope response by using the distances or number of mismatches between the probe and the nearest hit (i.e. closest in homology) sequence, as shown in Equation 3: $\begin{matrix} S_{Hom} = \frac{\sum_{d = 0}^{D} P_{d} M_{d}}{\sum_{d = 0}^{D} P_{d} N_{d}} & (3) \end{matrix}$
where N_drepresents the total number of hits at a distance d, where d is defined as the number of single-base difference between the probe of interest and the target nucleic acid sequence or region of interest in the genome, and D is the maximum distance that needs to be considered. The denominator represents the signal contributions of all probes in the complex set of sequences, including the target sequence. The numerator represents either the target for the probe sequence, or, if a model system is being used, the region of the model system sequence that is being varied. For example, if the model system is a whole chromosome, then M_drepresents all the hits within the chromosome at a distance d from the probe of interest. P_dis the signal penalty for each mismatch at a distance d. A perfect match has P_d=1, and the value of P_ddecreases towards zero as the number of mismatches increase (i.e. as the system becomes more destabilized). This is an approximation based on the assumption that the average signal reduction across a large number of mismatches is a good representation for any single mismatch. That is, each mismatched base (or insertion or deletion) can be assigned a constant penalty P, giving Equation 4 as the relationship between a single-base penalty and distance:
P_d≈P^d (4)

In still other embodiments, in silico parameters or metrics that combine homology with thermodynamic properties may be used. For example, maxTemp, defined as the duplex melting temperature (or T_m) between the probe and the longest contiguous match within each homologous sequence in the background genome, can be used as an in silico metric for probe performance. In other embodiments, the melting temperature of the closest mismatch to the probe sequence in the genome (MMClosestDuplexTm) as calculated from the nearest neighbor model can also be used to predict probe performance.

The methods described herein for selecting an oligonucleotide probe for a microarray application include a step for screening candidate probes against probe performance metrics, as indicated in FIG. 1, at operation 102. This operation is further depicted in FIG. 3. The screening process begins with the calculation or determination of a slope for each candidate probe, based on each metric, as indicated in operation 300. Using the X chromosome as a model system to characterize probe performance, empirical measurements of signal changes or LogRatio changes are made. From these empirical results, the slope for each probe is calculated. The slope is defined in either linear or logarithmic space as the ratio of a measured signal or LogRatio to a known or deliberate change in copy number. For probes with measurements at multiple distinct copy numbers, the slope is calculated from the signals or ratios on the y-axis and the known or expected copy number (or fold-change) on the x-axis. For example, where there are only two copy number values, the slope is the difference between the y-axis values and x-axis values. For probes with ideal response/performance, the slope approaches 1.00. For data points at more than two copy numbers, the slope for each probe is calculated from the best-fit line for a plot of the signal, signal ratios or LogRatios. In embodiments, the slope for data generated from more than two copy numbers is analyzed using statistical methods that eliminate outliers, such as a fitting method that weights data points by variance, for example.

The calculated slope is then plotted against each metric to give a trend curve, as in 302, which can be used to determine the relationship between a given metric and the performance of the probe. The trend curve is then smoothed fitted with an appropriate theoretical function, such as a set of polynomials with order as high as 20, as in 304, in order to determine the effect variables have on the slope for a given metric. Any set of orthonormal basis functions, as known to those of skill in the art, can be used for the fit. The smoothed or fitted trend curve can then be used to generate a probe score, with each probe being assigned a score, as in 306. The probe scores are assigned based on an arbitrary numerical scale. A probe score at or near the highest end of the scale indicates optimal or best probe performance. For example, the probe scores could lie between 0 to 1 on a numerical scale, and probes are selected if the probe score is closer to 1.0 (i.e. a score closest to 1 implies ideal or best probe performance in a given microarray experiment). Similarly, a numerical scale from 50 to 100 could be used, where probes with scores closest to 100 are selected. In other words, the probe with the best or optimal score is selected depending on the scale employed. Any number of scales, with any variation of numerical ranges, can be employed.

Generating Trend Curves for Measured and Predicted Slope

In embodiments, in order to characterize the relationship between the performance of a probe and the metric used to gauge that performance, the slope points calculated for each probe based on empirical data are plotted against the corresponding values for each metric, as indicated in FIG. 3, at operation 302. For example, using the differential model system of chromosome X and female-male pairs, data is obtained where the target copy numbers are changed by a predictable ratio (i.e. 2:1). When the measured slope for a set of probes is plotted against a given metric, a distribution plot is obtained. For example, FIG. 4 shows a distribution plot of the calculated slope from different arrays against duplex melting temperature (or DuplexT_m), an in silico metric. Useful information with regard to the relationship between probe performance and DuplexT_mexists if the distribution contains discrete data points (i.e. data points that do not cluster in a round and fuzzy manner). The calculated slope is then plotted against each metric to give a trend curve, as in 302, which can be used to determine the relationship between a given metric and the performance of the probe. The trend curve is then smoothed fitted with an appropriate theoretical function, such as a set of polynomials with order as high as 20, as in 304, in order to determine the effect variables have on the slope for a given metric. Any set of orthonormal basis functions, as known to those of skill in the art, can be used for the fit. The smoothed or fitted trend curve can then be used to generate a probe score, with each probe being assigned a score, as in 306. The probe scores are assigned based on an arbitrary numerical scale. A probe score at or near the highest end of the scale implies optimal or best probe performance. For example, the probe scores could lie between 0 to 1 on a numerical scale, and probes are selected if the probe score is closer to 1.0 (i.e. a score closest to 1 implies ideal or best probe performance in a given microarray experiment). Similarly, a numerical scale from 50 to 100 could be used, where probes with scores closest to 100 are selected. In other words, the probe with the best or optimal score is selected, but the scale on which the probes are scored is not significant.

The probe with the best or optimal score is assumed to be a “good” probe, i.e. one that is particularly suitable for use in a specific microarray experiment. For example, although not limited to this aspect, the best probe selected according to the present methods may be the one that hybridizes most strongly to the target sequences. The actual underlying relationship (between probes and scores for each metric) can be extracted from these distributions by generating a trend curve. Trend curves can be obtained from the slope date for each metric by a number of methods, including, without limitation, polynomial fits, cubic-spline fits, Fourier transforms, inverse transforms, smooth functional curves (for example, exponentials, arctangents, etc.), Boltzmann distribution curves, etc. Any curve that approximately follows the trend of the data is useful. In embodiments, a straight line fit is appropriate. In other embodiments, the data can also be smoothed and fitted using methods like moving averages, moving medians, LOWESS, LOESS, etc., for example.

An example of a trend curve used in the methods described herein is shown in FIG. 5 (for DuplexT_mvs. measured slope). Each point in the trend curve represents the median value for data sorted by rank on the x-axis and then smoothed in 1% bins using a non-linear polynomial fitting method (i.e., the range of data is split into equal-sized bins, each bin containing about 1% of the data). From the trend curve, it is possible to see a relationship between a given metric and the performance of the probe. For example, the trend curve in FIG. 5 indicates that probe performance is best for probes selected on the basis of DuplexT_mclose to 80° C.

In embodiments, it is useful to make the trend curves for a given metric more pronounced, to determine the independent effect that a variable may have on the calculated slope. The response for a given metric can be improved by filtering out a set of values (for a second metric) that are not viable for good probes, or by tuning in on a narrow range where selected probes are expected to be found. For example, if most of the selected probes are expected to occur within a narrow range of DuplexT_m, then a trend curve can be generated by selecting probes within that narrow range for a particular metric. In embodiments, the trend curves are fitted with polynomials as high as 20th order, as in operation 304 in FIG. 3. An example of such a fitted slope for DuplexT_mis shown in FIG. 6.

Statistical Methods for Combining Probe Scores

In embodiments, the trend curves are used to generate individual probe scores that are then combined to give a combined probe score assigned to each candidate probe, as indicated in operation 306 of FIG. 3. The combined or common probe score varies from approximately zero to approximately 1, with a score closer to 1 implying ideal probe performance. In the simplest form, individual probe scores for each metric (S_m(p_i)) are combined into a single combined score (S_c(p_i); for each probe p_i) by adding or averaging the scores for each metric, according to the Equation 5: $\begin{matrix} S_{c} = \sum_{m} S_{m} (p_{i}) & (5) \end{matrix}$
As long as the score for a given metric increases (or decreases) in the same direction, combining the individual probe scores by adding or averaging is sufficient to provide consistent improvements in probe performance with increasing values of the combined score. Once a probe score has been assigned to each probe (or subset of probes), it is straightforward to select probes with the highest score within a window of interest (i.e. a region of interest in the target nucleic acid sequence or genome). In embodiments, probes can also be selected via pairwise filtering/pairwise elimination, a process for reducing the number of probes in a large set, described in detail in U.S. Patent Publication No. 2006/0110744, which is incorporated by reference herein.

In other embodiments, individual probe scores obtained from the trend curves are combined by fitting multiple scores for a training data set, using the change in the measured slope response with a change in copy number in a model system, as provided in Equation 6: $\begin{matrix} S_{c} = \sum_{m} C_{m} S_{m} (p_{i}) & (6) \end{matrix}$
A number of different methods are available to combine and fit multivariate date in this manner, including, without limitation, principle component analysis (PCA), partial Least-squares (PLS), chemometrics, as well as other methods.

In still other embodiments, a linear fitting function is used, involving taking the inverse of a matrix (as implemented in Matlab). In this approach, the vector of the measured slope for each probe is represented as Y (one value per probe), and the matrix of the scores as M (number of metrics +1, # of probes), where all but one of the columns of M are vectors of scores for each metric, and the last column is a vector of ones, representing an additive constant. Equation 7 describes the basic relationship between the score S and the matrix of the scores M:
S=CM (7)
where the matrix C is a linear vector, with the coefficient C_mrepresenting each element of the matrix (one term for each metric plus the constant term). Multiplying both sides by the inverse of M and solving for C gives the following (Equation 8):
C=SM⁻¹ (8)
where M M⁻¹is I, the identity matrix, and M⁻¹is approximated using pinv(M), the Moore-Penrose pseudoinverse of M. M⁻¹is implemented as a Matlab function that involves a singular value decomposition (i.e., a common mathematical method to invert a matrix or solve a set of linear equations, available with many commercially available ). This approach allows any number of metrics to be included in the score calculation, as long as the metrics provide information in improving the performance of the selected probes. An example of fitted data obtained from using the above linear matrix functions is shown in FIG. 7, which depicts probe scores obtained by plotting the combined fitted slopes for four different metrics (DuplexTm, complexity, HomLogS2B and MaxSubSeqTm) against the measured slope for the X chromosome model system.

In embodiments, improved probe performance with respect to various metrics may also be obtained using a multiplicative fitting function, rather than an additive function. In a multiplicative curve fit, several individual scores, or combined scores, are multiplied together to produce a combined score. The metrics in each category are first combined using a linear method (such as the additive fitting already described) to produce intermediate scores. These intermediate scores are then combined using a multiplicative approach.

In embodiments, the scores associated with different probes are combined linearly to give an overall score for a particular metric. The overall score for each metric is then combined in a multiplicative fit with the overall score for other metrics. For example, the thermodynamic scores related to the duplex melting temperature (DuplexTm) are combined linearly to give an overall duplex-thermodynamics score D; the homology scores are also combined to give an overall homology score H, and any structural scores for the probe are combined independently giving P, with target structural scores (if any) combined to give an overall target score T. Each of these phenomena can independently lead to decreased probe performance. For example, a nearly ideal probe I with perfect homology scores, but poor thermodynamics scores may only have a slope of 50%. Similarly, a probe with perfect thermodynamic scores but poor homology score also may have only a slope of 50%. It follows then that a probe with both relatively poor thermodynamics (50%) and relatively poor homology scores (50%) will have a slope of 25% rather than 50%, as would be predicted by a linear model. This process will yield an overall probe score that varies between approximately zero and one. The coefficients for the combining of additive terms and the offsets are fitted to the data according to the following equation (Equation 9):
S=C_mDPHT+C_a (9)
where C_mare the multiplicative coefficients, and C_ais an additive coefficient.

When observing the relationship between probe performance, as measured by the slope response, it is seen that the slope trend continues to improve as the trend tends towards lower T_m. This means that probes with very low T_mwould show ideal performance. However, while this is true with respect to the model system, and systems where a large quantity of high quality DNA is plentiful, it will not necessarily be true for many biological and clinical samples where the DNA quantity is very low, or where the DNA is degraded (as in a biopsy sample, for example). Therefore, in embodiments, to increase the robustness of the methods described herein, the score curves are modified, to take into account effects associated with real samples, such as, but not limited to, DNA degradation or low DNA concentration. The methods produce more robust results, if the modified score curves more accurately reflect the performance of real probes in a real biological sample. In particular, the methods are modified in order to produce consistently high signals and robust results, while minimizing the negative impact on probe response.

In embodiments, the methods herein are modified by replacing the fitted DuplexTm trend curves, as in FIG. 8, with various synthetic curves, in order to determine the effects of the modification on probe T_mdistributions and signal distributions. In FIG. 8, the solid line represents the fitted Tm-slope response curve (i.e. not a synthetic curve), while the dotted line represent a synthetically generated asymmetric Lorentzian curve, with half-width to left of center at 20° C., and half-width to right of center at 10° C. The dash-dotted line is a symmetric triangle function, with half width at 10° C., while the dashed line is a symmetric exponential decay function with a half-width of 7° C. All the synthetic curves in FIG. 8 are centered at 80° C. In an aspect, the generation of synthetic curves begins with a candidate pool with a large number of X chromosome probes (about 1.4 million). Pairwise filtering, as discussed earlier, is used to select different sets of approximately evenly spaced probes from a candidate set on the basis of the combined score for each set of probes (i.e. probes with the optimal score on an arbitrary numerical scale). This method helps enrich the candidate pool with probes with higher scores (i.e. “good” probes). Briefly, the pairwise filtering method uses the probe's combined score as the target value or parameter, with the pairwise algorithm selecting one of each pair of probes that has the closest score to the target value (i.e. 1). The goal is to select probes within a relatively narrow T_mrange, based on the idea that probes with near ideal performance will typically fall within a narrow T_mrange.

The score curve synthesized in this manner are used to generate selected measured slope distributions shown in FIG. 9, which shows the probe T_mdistributions that result from the use of the various combined scores in the pairwise reduction by about a factor of 180:1. The thick solid line is a distribution of T_mvalues of the candidate probes. The thin solid line shows a distribution of selected probe melting temperatures when the fitted T_mslope response is used as one component of the combined score in the selection of probes. Each of the following, the synthetic curves replaces only the fitted T_m-slope response component in its contribution to the total combined score. The weights Ci of the various scores are kept constant. The dotted line is T_mdistribution using the asymmetric Lorentzian function component, the dash-dotted line is T_mdistribution using the symmetric triangle function component, and the dashed line is T_mdistribution using the symmetric exponential decay function component.

It can be seen from FIG. 9 that the peaks in the distribution shift from low values with the original scoring system to values closer to the optimal 80 degree T_m, with the modified scoring system (the sharp curve with a peak at 80 degrees is an artifact resulting from the candidate probe selection method, and not a function of the modified scoring system).

In embodiments, the methods described herein use a combination of experimentally measured slopes and predicated slopes to select probes for a microarray application. Such a combination is possible because the predicted slope has the same units and varies over the same range as the measured slope. Consequently, as experimental data becomes available, the predicted slope can be replaced with the measured slope when performing probe selection. This approach can be applied in a number of different ways. For example, the predicted slope cam simply be replaced with the measured slope when experimental data is collected. In another embodiment, probes with measured slopes may be preferred over those with comparable predicted slopes by applying a numerical bias to the score, thereby reducing the risk of selected a probe with good predicted parameters but poor actual performance. In yet another embodiment, uncertainty values are assigned to both the predicted and measured slopes, and the score values and uncertainties are taken into account for probe selection.

Arrays

The present description also provides nucleic acid microarrays produced using the subject methods, as described herein. The subject arrays include at least two distinct nucleic acids that differ by monomeric sequence immobilized on, e.g., covalently on, different and known locations on the substrate surface. In certain embodiments, each distinct nucleic acid sequence of the array is typically present as a composition of multiple copies of the polymer on the substrate surface, e.g., as a spot on the surface of the substrate. The number of distinct nucleic acid sequences, and hence spots or similar structures, present on the array may vary, but is generally at least 2, usually at least 5 and more usually at least 10, where the number of different spots on the array may be as a high as 100, 1000, 10,000, 100,000, 1,000,000 or higher, depending on the intended use of the array. The spots of distinct polymers present on the array surface are generally present as a pattern, where the pattern may be in the form of organized rows and columns of spots, e.g., a grid of spots, across the substrate surface, a series of curvilinear rows across the substrate surface, e.g., a series of concentric circles or semi-circles of spots, and the like. The density of spots present on the array surface may vary, but will generally be at least about 10 and usually at least about 100 spots/cm², where the density may be as high as 10⁶or higher. In other embodiments, the polymeric sequences are not arranged in the form of distinct spots, but may be positioned on the surface such that there is substantially no space separating one polymer sequence/feature from another. An exemplary array is described in U.S. Patent Publication No. 20050095596, which is incorporated herein by reference.

Arrays can be fabricated using drop deposition from pulsejets of either polynucleotide precursor units (such as monomers) in the case of in situ fabrication, or the previously obtained polynucleotide. Such methods are described in detail in, for example, the previously cited references including U.S. Pat. No. 6,242,266, U.S. Pat. No. 6,232,072, U.S. Pat. No. 6,180,351, U.S. Pat. No. 6,171,797, U.S. Pat. No. 6,323,043, U.S. patent application Ser. No. 09/302,898 filed Apr. 30, 1999 by Caren et al., and the references cited therein. These references are incorporated herein by reference. Other drop deposition methods can be used for fabrication, as previously described herein.

A feature of the subject arrays is that they include one or more, usually a plurality of, oligonucleotide probes predicted by the statistical methods described herein. The oligonucleotide probes selected according to the subject methods are suitable for use in a plurality of different gene expression or genomic microarray applications. The statistical regression method evaluates probe performance, without using any assumptions about the functional relationship between the oligonucleotide sequence and the predictive parameters. Oligonucleotide probes that “cluster” (i.e. consistently produce the same response) will perform substantially similarly under a plurality of different experimental conditions.

The arrays as described herein can be used in a variety of different microarray applications, including gene expression experiments and genomic analysis. In using an array, the array will typically be exposed to a sample (for example, a fluorescently labeled analyte, such as a sample containing genomic DNA) and the array then read. Reading of the array may be accomplished by illuminating the array and reading the location and intensity of resulting fluorescence at each feature of the array to detect any binding complexes on the surface of the array. For example, a scanner may be used for this purpose that is similar to the AGILENT MICROARRAY SCANNER available from Agilent Technologies, Palo Alto, Calif. Other suitable apparatus and methods are described in U.S. patent application Ser. No. 09/846,125 “Reading Multi-Featured Arrays” by Dorsel et al.; and Ser. No. 09/430,214 “Interrogating Multi-Featured Arrays” by Dorsel et al. As previously mentioned, these references are incorporated herein by reference.

However, arrays may be read by any other method or apparatus than the foregoing, with other reading methods including other optical techniques (for example, detecting chemiluminescent or electroluminescent labels) or electrical techniques (where each feature is provided with an electrode to detect hybridization at that feature in a manner disclosed in U.S. Pat. No. 6,221,583 and elsewhere). Results from the reading may be raw results (such as fluorescence intensity readings for each feature in one or more color channels) or may be processed results such as obtained by rejecting a reading for a feature which is below a predetermined threshold and/or forming conclusions based on the pattern read from the array (such as whether or not a particular target sequence may have been present in the sample or an organism from which a sample was obtained exhibits a particular condition). The results of the reading (processed or not) may be forwarded (such as by communication) to a remote location if desired, and received there for further use (such as further processing).

Systems

The methods described herein are carried out in part with the aid of a computer-based system, driven by software specific to the methods. A “computer-based system” refers to the hardware, software, and data storage used to analyze the information of the present disclosure. Typical hardware of the computer-based systems of the present disclosure comprises a central processing unit (CPU), input, output, and data storage. A skilled artisan can readily appreciate that any one of the currently available computer-based system are suitable for use in the present disclosure. The data storage means may comprise any manufacture comprising a recording of the present information as described above, or a memory access means that can access such a manufacture. In certain instances a computer-based system may include one or more wireless devices.

Data from at least one of the detecting and deriving steps, as described above, is transmitted to a remote location. By “remote location” is meant a location other than the location at which the array is present and hybridization occur. For example, a remote location could be another location (e.g. office, lab, etc.) in the same city, another location in a different city, another location in a different state, another location in a different country, etc. As such, when one item is indicated as being “remote” from another, what is meant is that the two items are at least in different buildings, and may be at least one mile, ten miles, or at least one hundred miles apart. “Communicating” information means transmitting the data representing that information as electrical signals over a suitable communication channel (for example, a private or public network). “Forwarding” an item refers to any means of getting that item from one location to the next, whether by physically transporting that item or otherwise (where that is possible) and includes, at least in the case of data, physically transporting a medium carrying the data or communicating the data. The data may be transmitted to the remote location for further evaluation and/or use. Any convenient telecommunications means may be employed for transmitting the data, e.g., facsimile, modem, internet, etc.

To “record” data, programming or other information on a computer-readable medium refers to a process for storing information on a recordable storage medium, using any such methods as known in the art. Examples include magnetic media such as hard drives, tapes, disks, and the like. Optical media can include CDs, DVDs, and the like. Any convenient data storage structure may be chosen, based on the means used to access the stored information. A variety of data processor programs and the formats can be used for storage, e.g., word processing text file, database format, etc.

A “processor” references any hardware and/or software combination that will perform the functions required of it. For example, any processor herein may be a programmable digital microprocessor such as available in the form of an electronic controller, mainframe, server or personal computer (desktop or portable). Where the processor is programmable, suitable programming can be communicated from a remote location to the processor, or previously saved in a computer program product (such as a portable or fixed computer readable storage medium, whether magnetic, optical or solid state device based). For example, a magnetic medium or optical disk may carry the programming, and can be read by a suitable reader communicating with each processor at its corresponding station.

In aspects, the methods described herein are performed using computer-readable media containing programming stored thereon implementing the subject methods. The computer-readable media may be, for example, in the form of a computer disk or CD, a floppy disk, a magnetic “hard card”, a server, or any other computer-readable media capable of containing data or the like, stored electronically, magnetically, optically or by other means. Accordingly, stored programming embodying steps for carrying out the subject methods may be transferred to a computer such as a personal computer (PC), (i.e. accessible by a researcher or the like), by physical transfer of a CD, floppy disk, or like medium, or may be transferred using a computer network, server, or any other interface connection, e.g., the Internet.

In an embodiment, the system described herein may include a single computer or the like with a stored algorithm capable of evaluating probe performance, as described herein, i.e. a computational analysis system that performs statistical regression analysis on a set of training data. In certain embodiments, the system is further characterized in that it provides a user interface, where the user interface presents to a user the option of selecting among one or more different, or multiple different inputs. For example, in the systems described herein, the user has the option of selecting various predictive parameters, such as composition factors, thermodynamic factors, kinetic factors, and mathematical combinations of such factors, as well as analogous parameters for the intended genomic targets. Computational systems that may be readily modified to become systems of the subject invention include those described in U.S. Pat. No. 6,251,588, the disclosure of which is incorporated herein by reference.

The various embodiments described above are provided by way of illustration only and should not be construed to limit the invention. Those skilled in the art will readily recognize various modifications and changes that may be made to the present methods without following the example embodiments and applications illustrated and described herein, and without departing from the true spirit and scope of the claims attached hereto.

Claims

1. A method for selecting an oligonucleotide probe for use on a microarray, comprising:

generating two or more candidate oligonucleotide probes;

analyzing the two or more candidate probes with one or more metrics that indicate probe performance to obtain an individual probe score for each metric;

combining the individual probe score for each metric into a single combined score for the probe; and

selecting the probe with a combined score closest to an optimal score value for use on a microarray, wherein the optimal score value is the score at, or nearest to, the highest end of a numerical scale of probe scores.

2. The method of claim 1, wherein the optimal score value is about 1.0 on a scale of probe scores ranging from 0.0 to 1.0.

3. The method of claim 1, wherein the optimal score value is about 100 on a scale of probe scores ranging from 50 to 100.

4. The method of claim 1, wherein generating a candidate set of oligonucleotide probes comprises:

selecting one or more target sequences within a region of interest; and

tiling subsequences of each target sequence across each region of interest to generate the candidate set of potential probes.

5. The method of claim 4, further comprising:

generating a large set of potential probes by tiling the target sequences in single base steps across the region of interest; and

applying pairwise reduction to reduce the number of probes by a factor of greater than about 2 and less than about 1000.

6. The method of claim 1, wherein the metrics used to analyze the candidate probes comprise direct metrics, indirect metrics, in silico metrics, or combinations thereof.

7. The method of claim 6, wherein direct metrics used to analyze the candidate probes comprise the changes in probe response based on experimentally measured quantities, further comprising known changes in copy number of a target molecule.

8. The method of claim 6, wherein indirect metrics used to analyze the candidate probes comprise changes in predicted probe response resulting from experimentally measured quantities for a target molecule.

9. The method of claim 6, wherein indirect metrics used to analyze the candidate probes comprise changes in predicted probe response measured using empirical relationships based on direct responses from other probe-target molecule duplexes.

10. The method of claim 6, wherein in silico metrics used to analyze the candidate probes comprise changes in probe response based on calculated quantities for a target molecule.

11. The method of claim 6, wherein in silico metrics used to analyze the candidate probes comprise changes in probe response measured using empirical relationships based on direct responses from other probe-target molecule duplexes.

12. The method of claim 1, wherein analyzing the candidate probes with one or more metrics to obtain individual probe scores further comprises:

calculating the slope for each candidate probe;

plotting the slope against the corresponding value for each of the metrics to obtain a trend curve; and

fitting the trend curve with a polynomial function with order n to generate an individual probe score.

13. The method of claim 12, wherein the order n of the polynomial function ranges from n=1 to n=20.

14. The method of claim 1, wherein combining individual probe scores for each metric to obtain a combined score comprises adding or averaging the probe score for each metric.

15. The method of claim 1, wherein combining the individual probe scores to obtain a combined score comprises fitting the scores with a linear additive multivariate fitting function.

16. The method of claim 15, wherein combining the individual probe scores further comprises fitting measured slope responses for a well-characterized training data set to a change in copy number.

17. The method of claim 1, wherein combining the individual probe scores to obtain a combined score comprises fitting the scores with a linear multiplicative curve-fitting function.

18. The method of claim 17, wherein combining the individual probe scores further comprises:

combining metrics in each category using a linear model to obtain intermediate scores; and

multiplying together the intermediate scores to generate the combined score.

19. The method of claim 1, wherein combining the individual probe scores further comprises synthetically modifying the combined score to obtain probes with more robust performance, the synthetic modification further comprising:

generating a candidate set of probes;

applying pairwise reduction to reduce the number of probes in the candidate set;

calculating the slope for each probe;

plotting the slope against the corresponding value for each of the metrics to obtain a trend curve;

fitting the trend curve to generate a measured probe score;

replacing the fitted trend curve with a synthetic curve; and

using the synthetic curve to generate a predicted score for each probe.

20. The method of claim 1, wherein selecting the probe for use in a microarray application comprises:

combining experimentally measured scores with predicted scores to obtain a combined score value for the probe; and

selecting the probe with a combined score value closest to an optimal score value, wherein the optimal score value is the score at, or nearest to, the highest end of a numerical scale of probe scores.

21. A computer-readable medium having recorded thereon a program that selects a probe for use in microarray applications according to the method of claim 1.

22. A computational analysis system comprising the computer-readable medium according to claim 21.

23. A method of fabricating a nucleic acid microarray, comprising producing at least two different oligonucleotide probes on a microarray substrate, wherein at least one of the two different oligonucleotide probes is a probe selected according to the method of claim 1.

24. A nucleic acid microarray produced according to the method of claim 23.