Classification of pixels in a microarray image based on pixel intensities and a preview mode facilitated by pixel-intensity-based pixel classification
One disclosed embodiment is a method based on an iteratively employed Bayesian-probability-based pixel classification, used to refine an initial feature mask that specifies those pixels in a region of interest, including and surrounding a feature in the scanned image of a microarray, that together compose a pixel-based image of the feature within the region of interest. In a described embodiment, a feature mask is prepared using only the pixel-based intensity data for a region of interest, a putative position and size of the feature within the region of interest, and mathematical models of the probability distribution of background-pixel and feature-pixel signal noise and mathematical models of the probabilities of finding feature pixels and background pixels at various distances from the putative feature position. In a described embodiment, preparation of a feature mask allows a feature-extraction system to display feature sizes and locations to a user prior to undertaking the computationally intensive and time-consuming task of feature-signal extraction from the pixel-based intensity data obtained by scanning a microarray.
The present invention is related to processing of microarray data and, in particular, to a method and system for partitioning pixels in an image of a microarray into a set of feature pixels and a set of background pixels.
BACKGROUND OF THE INVENTIONThe present invention is related to methods and systems for determining which pixels, in a digital image of a microarray, are associated with features of the microarray, and which pixels are background pixels associated with inter-feature regions of a microarray. A general background of microarray technology is first provided, in this section, to facilitate discussion of microarray-data processing, in following subsections. It should be noted that microarrays are also referred to as “microarrays” and simply as “arrays.” These alternate terms may be used interchangeably in the context of microarrays and microarray technologies. Art described in this section is not admitted to be prior art to this application.
Array technologies have gained prominence in biological research and are likely to become important and widely used diagnostic tools in the healthcare industry. Currently, microarray techniques are most often used to determine the concentrations of particular nucleic-acid polymers in complex sample solutions. Molecular-array-based analytical techniques are not, however, restricted to analysis of nucleic acid solutions, but may be employed to analyze complex solutions of any type of molecule that can be optically or radiometrically scanned and that can bind with high specificity to complementary molecules synthesized within, or bound to, discrete features on the surface of an array. Because arrays are widely used for analysis of nucleic acid samples, the following background information on arrays is introduced in the context of analysis of nucleic acid solutions following a brief background of nucleic acid chemistry.
Deoxyribonucleic acid (“DNA”) and ribonucleic acid (“RNA”) are linear polymers, each synthesized from four different types of subunit molecules. The subunit molecules for DNA include: (1) deoxy-adenosine, abbreviated “A,” a purine nucleoside; (2) deoxy-thymidine, abbreviated “T,” a pyrimidine nucleoside; (3) deoxy-cytosine, abbreviated “C,” a pyrimidine nucleoside; and (4) deoxy-guanosine, abbreviated “G,” a purine nucleoside. The subunit molecules for RNA include: (1) adenosine, abbreviated “A,” a purine nucleoside; (2) uracil, abbreviated “U,” a pyrimidine nucleoside; (3) cytosine, abbreviated “C,” a pyrimidine nucleoside; and (4) guanosine, abbreviated “G,” a purine nucleoside.
The DNA polymers that contain the organization information for living organisms occur in the nuclei of cells in pairs, forming double-stranded DNA helixes. One polymer of the pair is laid out in a 5′ to 3′ direction, and the other polymer of the pair is laid out in a 3′ to 5′ direction. The two DNA polymers in a double-stranded DNA helix are therefore described as being anti-parallel. The two DNA polymers, or strands, within a double-stranded DNA helix are bound to each other through attractive forces including hydrophobic interactions between stacked purine and pyrimidine bases and hydrogen bonding between purine and pyrimidine bases, the attractive forces emphasized by conformational constraints of DNA polymers. Because of a number of chemical and topographic constraints, double-stranded DNA helices are most stable when deoxy-adenylate subunits of one strand hydrogen bond to deoxy-thymidylate subunits of the other strand, and deoxy-guanylate subunits of one strand hydrogen bond to corresponding deoxy-cytidilate subunits of the other strand.
FIGS. 2A-B illustrates the hydrogen bonding between the purine and pyrimidine bases of two anti-parallel DNA strands.
Two DNA strands linked together by hydrogen bonds forms the familiar helix structure of a double-stranded DNA helix.
Double-stranded DNA may be denatured, or converted into single stranded DNA, by changing the ionic strength of the solution containing the double-stranded DNA or by raising the temperature of the solution. Single-stranded DNA polymers may be renatured, or converted back into DNA duplexes, by reversing the denaturing conditions, for example by lowering the temperature of the solution containing complementary single-stranded DNA polymers. During renaturing or hybridization, complementary bases of anti-parallel DNA strands form WC base pairs in a cooperative fashion, leading to reannealing of the DNA duplex. Strictly A-T and G-C complementarity between anti-parallel polymers leads to the greatest thermodynamic stability, but partial complementarity including non-WC base pairing may also occur to produce relatively stable associations between partially-complementary polymers. In general, the longer the regions of consecutive WC base pairing between two nucleic acid polymers, the greater the stability of hybridization between the two polymers under renaturing conditions.
The ability to denature and renature double-stranded DNA has led to the development of many extremely powerful and discriminating assay technologies for identifying the presence of DNA and RNA polymers having particular base sequences or containing particular base subsequences within complex mixtures of different nucleic acid polymers, other biopolymers, and inorganic and organic chemical compounds. One such methodology is the array-based hybridization assay.
Once an array has been prepared, the array may be exposed to a sample solution of target DNA or RNA molecules (410-413 in
Finally, as shown in
One, two, or more than two data subsets within a data set can be obtained from a single microarray by scanning the microarray for one, two or more than two types of signals. Two or more data subsets can also be obtained by combining data from two different arrays. When optical scanning is used to detect fluorescent or chemiluminescent emission from chromophore labels, a first set of signals, or data subset, may be generated by scanning the microarray at a first optical wavelength, a second set of signals, or data subset, may be generated by scanning the microarray at a second optical wavelength, and additional sets of signals may be generated by scanning the molecular at additional optical wavelengths. Different signals may be obtained from a microarray by radiometric scanning to detect radioactive emissions one, two, or more than two different energy levels. Target molecules may be labeled with either a first chromophore that emits light at a first wavelength, or a second chromophore that emits light at a second wavelength. Following hybridization, the microarray can be scanned at the first wavelength to detect target molecules, labeled with the first chromophore, hybridized to features of the microarray, and can then be scanned at the second wavelength to detect target molecules, labeled with the second chromophore, hybridized to the features of the microarray. In one common microarray system, the first chromophore emits light at a red visible-light wavelength, and the second chromophore emits light at a green, visible-light wavelength. The data set obtained from scanning the microarray at the red wavelength is referred to as the “red signal,” and the data set obtained from scanning the microarray at the green wavelength is referred to as the “green signal.” While it is common to use one or two different chromophores, it is possible to use one, three, four, or more than four different chromophores and to scan a microarray at one, three, four, or more than four wavelengths to produce one, three, four, or more than four data sets.
In a raw and/or partially processed scanned image of a microarray, features may be asymmetrical, may contain highly non-uniform intensities due to a large number of different possible procedural, experimental, and instrumental errors and instabilities, and may be substantially offset from their expected positions within the general grid-like, regular pattern in which features are deposited. All of these effects can lead to difficulties in employing the currently used, intensity-based feature-pixel/background-pixel partitioning methods described above. For these reasons, there is a need for other methods for partitioning pixels within a digital image of a microarray into sets of feature pixels and background pixels.
SUMMARY OF THE INVENTIONIn one embodiment of the present invention, a method based on an iteratively employed Bayesian-probability-based pixel classification is used to refine an initial feature mask that specifies those pixels in a region of interest, including and surrounding a feature in the scanned image of a microarray, that together compose a pixel-based image of the feature within the region of interest. In the described embodiment, a two-dimensional Boolean array is employed as a feature mask, each cell within the two-dimensional array indicating whether or a corresponding pixel within the pixel-based scanned image of the region of interest surrounding and including the feature is a feature pixel or a background pixel. The feature mask is prepared using the pixel-based intensity data for a region of interest, a putative position and size of the feature within the region of interest, and mathematical models of the probability distribution of background-pixel and feature-pixel signal noise and mathematical models of the probabilities of finding feature pixels and background pixels at various distances from the putative feature position. Preparation of a feature mask allows a feature-extraction system to display feature sizes and locations to a user prior to undertaking the computationally intensive and time-consuming task of feature-signal extraction from the pixel-based intensity data obtained by scanning a microarray.
BRIEF DESCRIPTION OF THE DRAWINGS
FIGS. 2A-B illustrate the hydrogen bonding between the purine and pyrimidine bases of two anti-parallel DNA strands.
FIGS. 8A-E illustrate feature-pixel/background-pixel partitioning problems when the partitioning is based on feature-pixel/background-pixel intensity differentials.
FIGS. 17A-B illustrate the probability distributions for signal noise and for pixels intensities with respect to pixel location, respectively.
One embodiment of the present invention provides a method and system for discriminating between pixels, in a digital image of a microarray, associated with features and pixels in inter-feature regions of the microarray image referred to as background pixels. In a first subsection, below, additional information about microarrays is provided. Those readers familiar with microarrays may skip over this first subsection. In a second subsection, embodiments of the present invention are provided through examples, graphical representations, and with reference to several flow-control diagrams. In a third subsection, a C++-like pseudocode implementation of one embodiment of the present invention is provided, to illustrate details omitted from the higher-level discussion in the previous subsection. An alternative embodiment of the present invention is included, in Appendix A.
Additional Information About Molecular ArraysAn array may include any one-, two- or three-dimensional arrangement of addressable regions, or features, each bearing a particular chemical moiety or moieties, such as biopolymers, associated with that region. Any given array substrate may carry one, two, or four or more arrays disposed on a front surface of the substrate. Depending upon the use, any or all of the arrays may be the same or different from one another and each may contain multiple spots or features. A typical array may contain more than ten, more than one hundred, more than one thousand, more ten thousand features, or even more than one hundred thousand features, in an area of less than 20 cm2 or even less than 10 cm2. For example, square features may have widths, or round feature may have diameters, in the range from a 10 μm to 1.0 cm. In other embodiments each feature may have a width or diameter in the range of 1.0 μm to 1.0 mm, usually 5.0 μm to 500 μm, and more usually 10 μm to 200 μm. Features other than round or square may have area ranges equivalent to that of circular features with the foregoing diameter ranges. At least some, or all, of the features may be of different compositions (for example, when any repeats of each feature composition are excluded the remaining features may account for at least 5%, 10%, or 20% of the total number of features). Interfeature areas are typically, but not necessarily, present. Interfeature areas generally do not carry probe molecules. Such interfeature areas typically are present where the arrays are formed by processes involving drop deposition of reagents, but may not be present when, for example, photolithographic array fabrication processes are used. When present, interfeature areas can be of various sizes and configurations.
Each array may cover an area of less than 100 cm2, or even less than 50 cm2, 10 cm2 or 1 cm2. In many embodiments, the substrate carrying the one or more arrays will be shaped generally as a rectangular solid having a length of more than 4 mm and less than 1 m, usually more than 4 mm and less than 600 mm, more usually less than 400 mm; a width of more than 4 mm and less than 1 m, usually less than 500 mm and more usually less than 400 mm; and a thickness of more than 0.01 mm and less than 5.0 mm, usually more than 0.1 mm and less than 2 mm and more usually more than 0.2 and less than 1 mm. Other shapes are possible, as well. With arrays that are read by detecting fluorescence, the substrate may be of a material that emits low fluorescence upon illumination with the excitation light. Additionally in this situation, the substrate may be relatively transparent to reduce the absorption of the incident illuminating laser light and subsequent heating if the focused laser beam travels too slowly over a region. For example, a substrate may transmit at least 20%, or 50% (or even at least 70%, 90%, or 95%), of the illuminating light incident on the front as may be measured across the entire integrated spectrum of such illuminating light or alternatively at 532 nm or 633 nm.
Arrays can be fabricated using drop deposition from pulsejets of either polynucleotide precursor units (such as monomers) in the case of in situ fabrication, or the previously obtained polynucleotide. Such methods are described in detail in, for example, U.S. Pat. No. 6,242,266, U.S. Pat. No. 6,232,072, U.S. Pat. No. 6,180,351, U.S. Pat. No. 6,171,797, U.S. Pat. No. 6,323,043, U.S. patent application Ser. No. 09/302,898 filed Apr. 30, 1999 by Caren et al., and the references cited therein. Other drop deposition methods can be used for fabrication, as previously described herein. Also, instead of drop deposition methods, photolithographic array fabrication methods may be used. Interfeature areas need not be present particularly when the arrays are made by photolithographic methods as described in those patents.
A microarray is typically exposed to a sample including labeled target molecules, or, as mentioned above, to a sample including unlabeled target molecules followed by exposure to labeled molecules that bind to unlabeled target molecules bound to the array, and the array is then read. Reading of the array may be accomplished by illuminating the array and reading the location and intensity of resulting fluorescence at multiple regions on each feature of the array. For example, a scanner may be used for this purpose, which is similar to the AGILENT MICROARRAY SCANNER manufactured by Agilent Technologies, Palo Alto, Calif. Other suitable apparatus and methods are described in U.S. patent applications: Ser. No. 10/087447 “Reading Dry Chemical Arrays Through The Substrate” by Corson et al., and in U.S. Pat. Nos. 6,518,556; 6,486,457; 6,406,849; 6,371,370; 6,355,921; 6,320,196; 6,251,685; and 6,222,664. However, arrays may be read by any other method or apparatus than the foregoing, with other reading methods including other optical techniques, such as detecting chemiluminescent or electroluminescent labels, or electrical techniques, where each feature is provided with an electrode to detect hybridization at that feature in a manner disclosed in U.S. Pat. No. 6,251,685, U.S. Pat. No. 6,221,583 and elsewhere.
A result obtained from reading an array may be used in that form or may be further processed to generate a result such as that obtained by forming conclusions based on the pattern read from the array, such as whether or not a particular target sequence may have been present in the sample, or whether or not a pattern indicates a particular condition of an organism from which the sample came. A result of the reading, whether further processed or not, may be forwarded, such as by communication, to a remote location if desired, and received there for further use, such as for further processing. When one item is indicated as being remote from another, this is referenced that the two items are at least in different buildings, and may be at least one mile, ten miles, or at least one hundred miles apart. Communicating information references transmitting the data representing that information as electrical signals over a suitable communication channel, for example, over a private or public network. Forwarding an item refers to any means of getting the item from one location to the next, whether by physically transporting that item or, in the case of data, physically transporting a medium carrying the data or communicating the data.
As pointed out above, array-based assays can involve other types of biopolymers, synthetic polymers, and other types of chemical entities. A biopolymer is a polymer of one or more types of repeating units. Biopolymers are typically found in biological systems and particularly include polysaccharides, peptides, and polynucleotides, as well as their analogs such as those compounds composed of, or containing, amino acid analogs or non-amino-acid groups, or nucleotide analogs or non-nucleotide groups. This includes polynucleotides in which the conventional backbone has been replaced with a non-naturally occurring or synthetic backbone, and nucleic acids, or synthetic or naturally occurring nucleic-acid analogs, in which one or more of the conventional bases has been replaced with a natural or synthetic group capable of participating in Watson-Crick-type hydrogen bonding interactions. Polynucleotides include single or multiple-stranded configurations, where one or more of the strands may or may not be completely aligned with another. For example, a biopolymer includes DNA, RNA, oligonucleotides, and PNA and other polynucleotides as described in U.S. Pat. No. 5,948,902 and references cited therein, regardless of the source. An oligonucleotide is a nucleotide multimer of about 10 to 100 nucleotides in length, while a polynucleotide includes a nucleotide multimer having any number of nucleotides.
As an example of a non-nucleic-acid-based microarray, protein antibodies may be attached to features of the array that would bind to soluble labeled antigens in a sample solution. Many other types of chemical assays may be facilitated by array technologies. For example, polysaccharides, glycoproteins, synthetic copolymers, including block copolymers, biopolymer-like polymers with synthetic or derivitized monomers or monomer linkages, and many other types of chemical or biochemical entities may serve as probe and target molecules for array-based analysis. A fundamental principle upon which arrays are based is that of specific recognition, by probe molecules affixed to the array, of target molecules, whether by sequence-mediated binding affinities, binding affinities based on conformational or topological properties of probe and target molecules, or binding affinities based on spatial distribution of electrical charge on the surfaces of target and probe molecules.
Scanning of a microarray by an optical scanning device or radiometric scanning device generally produces a scanned image comprising a rectilinear grid of pixels, with each pixel having a corresponding signal intensity. These signal intensities are processed by an array-data-processing program that analyzes data scanned from an array to produce experimental or diagnostic results which are stored in a computer-readable medium, transferred to an intercommunicating entity via electronic signals, printed in a human-readable format, or otherwise made available for further use. Molecular array experiments can indicate precise gene-expression responses of organisms to drugs, other chemical and biological substances, environmental factors, and other effects. Molecular array experiments can also be used to diagnose disease, for gene sequencing, and for analytical chemistry. Processing of microarray data can produce detailed chemical and biological analyses, disease diagnoses, and other information that can be stored in a computer-readable medium, transferred to an intercommunicating entity via electronic signals, printed in a human-readable format, or otherwise made available for further use.
Overview of the Present InventionWhen a microarray is scanned, data may be collected as a two-dimensional digital image of the microarray, each pixel of which represents the intensity of phosphorescent, fluorescent, chemiluminescent, or radioactive emission from an area of the microarray corresponding to the pixel. A microarray data set may comprise a two-dimensional image or a list of numerical, alphanumerical pixel intensities, or any of many other computer-readable data sets. An initial series of steps employed in processing scanned, digital microarray images includes constructing a regular coordinate system for the digital image of the microarray by which the features within the digital image of the microarray can be indexed and located. For example, when the features are laid out in a periodic, rectilinear pattern, a rectilinear coordinate system is commonly constructed so that the positions of the centers of features lie as closely as possible to intersections between horizontal and vertical gridlines of the rectilinear coordinate system. Then, regions of interest (“ROIs”) are computed, based on the initially estimated positions of the features in the coordinate grid, and centroids for the ROIs are computed in order to refine the positions of the features. Once the position of a feature is refined, feature pixels can be differentiated from background pixels within the ROI, and the signal corresponding to the feature can then be computed by integrating the intensity over the feature pixels.
The techniques described above generally rely on discriminating feature pixels from background pixels on the basis of differences in intensity between feature pixels and background pixels. The centroids of ROIs are moments of intensity in a two-dimensional pixel space. Unfortunately, there are many cases where discriminating features based on intensity leads to inaccurate and even grossly inaccurate partitioning of pixels between feature pixels and background pixels, and may lead to an inability to identify feature pixels, especially in the case of low-signal features surrounded by noisy background pixels.
FIGS. 8A-E illustrate feature-pixel/background-pixel partitioning problems when the partitioning is based on feature-pixel/background-pixel intensity differentials.
Visual inspection of the pixel intensities in the microarray-image region shown in
Embodiments of the present invention are described, in overview, in this subsection with reference to control-flow diagrams provided as
Returning to
Returning to
In the iterative, feature-mask refinement step, 912, the routine “generateFeatureMask” carries out a Bayesian-probability-based method for repartitioning pixels based on mathematical models for the probability distributions of feature and background pixel intensity noise and mathematical models for probability distributions for feature-pixel and background-pixel intensities with respect to the distances of pixels from a putative location of the center of the feature within the ROI, as independently determined by a gridding method.
Next, in step 918, the routine “generateFeatureMask” determines the centroid and feature radius of the feature represented in the final feature mask, an example of which is shown in
The Bayesian-probability-based repartitioning, undertaken in step 1508 of the iterative feature-mask refinement routine illustrated in
where
-
- P(F/i, x) is the probability that a pixel is a feature pixel given the pixel's intensity i and location x;
- P(B/i, x) is the probability that a pixel is a background pixel given the pixel's intensity i and location x;
- P(F, i, x) is the probability of a pixel being a feature pixel having intensity i and location x;
- P(B, i, x) is the probability of a pixel being a background pixel having intensity i and location x;
- P(i, x) is the probability of a pixel having intensity i and location x;
- P(i/x, F) is the probability of a pixel having intensity i given the pixel's location x and the fact that the pixel is a feature pixel;
- P(i/x, B) is the probability of a pixel having intensity i given the pixel's location x and the fact that the pixel is a background pixel;
- P(F, x) is the probability of a pixel being a feature pixel and having a location x;
- P(B, x) is the probability of a pixel being a background pixel and having a location x;
- P(F/x) is the prior probability that a pixel belongs to feature given its location x;
- P(B/x) is the prior probability that a pixel belongs to the background given its location x; and
- P(x) is the probability that a pixel has location x.
The probability distribution of pixel intensities due to noise for both feature and background pixels are modeled as Gaussian distributions:
where
-
- σF is the standard deviation of the feature-pixel intensities;
- σF2 is the variance of the feature-pixel intensities;
- μF is the mean of the feature-pixel intensities;
- σB is the standard deviation of the background-pixel intensities;
- σB2 is the variance of the background-pixel intensities;
- μB is the mean of the background-pixel intensities.
The prior probabilities are modeled as functions of the distance of a pixel from the putative feature centroid:
P(F/x)=fF(|xputative−x|)
P(B/x)=fB(|xputative−x|)
The exact functions fF and fB are provided in the C++-like pseudocode implementation, in the following subsection and can be replaced by more complex functions requiring more computation time.
FIGS. 17A-B illustrate the general probability distributions for the prior probabilities. In
Up to this step, all of the above mentioned calculations are performed for two channels separately. The suffix red/green for each equation is avoided to minimize unnecessary complexity in notation. A joint probability distribution of the two channels is then computed. In the described embodiment, the two channels are considered independent and hence the joint probability density function is the product of the probability density functions for the red and green channels.
The Bayesian-probability calculations and calculation of the joint probability density function are simplified by observing that, for repartitioning, one only needs to determine the relative probabilities of a pixel being a feature or background pixel, given its intensity and location, as shown below:
If the log ratio of this joint probability of a pixel being a feature pixel to the joint probability of a pixel being a background is calculated, as described in the above equation, then the criteria for assigning a pixel to the feature-pixel category is as follows:
while the criteria for assigning a pixel to the background-pixel category is as follows:
where all the probability density functions are the joint probability density functions. By computing the log of the ratio, one can avoid the fairly expensive computation of the exponential term and sequential double precision multiplications.
The generation of feature masks by the above-described method provides a number of advantages during analysis of microarray data. First, generation of such feature masks is a robust and accurate method that can be performed during feature extraction to identify the features within the ROIs. In addition, because the feature-mask generation method described above relies only on putative feature sizes and positions and the intensity data within an ROI intensity data set, feature masks for ROI's can be easily generated in advance of complex feature-extraction calculations in order to provide a preview of the detected features, including detected feature sizes and locations, to users prior to undertaking feature extraction.
In this subsection, a straightforward C++-like pseudocode implementation of one embodiment of the present invention is provided for illustration only. A commercial implementation is provided in Appendix A. This pseudocode implementation is provided for clarity and illustration, and employs programming constructs and methodologies that most simply illustrate the feature-mask generation method that represents one embodiment of the present invention. As with any software implementation, greater efficiencies may be achieved by employing less clear, but more efficient techniques, well known to ordinarily skilled software developers. For example, nested for-loop iteration of elements in two-dimensional arrays may be more efficiently handled by special assembly-language iterators. As another example, various techniques may be employed to optimize arithmetic calculations to avoid the expense of floating-point operations without sacrificing the accuracy of calculated results.
The described method that represents one embodiment of the present invention generates a Boolean-valued feature mask from a pixel-based ROI intensity data set, as described in the previous subsection. The ROI data set is represented, in the following pseudocode implementation, by an instance of the class “scanned ROI” that follows:
The function-member implementations of the public methods of the above class “scannedROI” are not provided, as they are outside the scope of the present discussion. An instance of the class “scannedROI” can be accessed through the listed public function members on lines 5-12, above, that include: (1) “pixel,” a function member that returns the intensity of the pixel at two-dimensional-array coordinates “x,y”; (2) “width,” a function member that returns the width of the two-dimensional-array representation of the ROI, in pixels; (3) “height,” a function member that returns the height of the two-dimensional-array representation of the ROI, in pixels; (4) “total,” a function member that returns the total number of pixels in the ROI; (5) “getX,” a function member that returns the putative x-coordinate for the center of the feature within the ROI; (6) “getY,” a function member that returns the y coordinate for the putative center of the feature within the ROI; (7) “getRad,” a function member that returns the putative radius of the feature within the ROI; and (8) a constructor for the class “scannedROI.” The above-listed set of function members provides all the information needed by the feature-mask generation routine in order to prepare a Boolean-valued feature mask indicating the positions of the feature pixels within the ROI.
Next, a number of integer constants are provided:
- 1 const int FlipLim=26;
- 2 const int IterationLim=10;
- 3 const int MainLoopIterationLim=10;
- 4 const int OutlierIterationLim=10;
These constants are representative threshold values that indicate convergence or that limit the number of iterations of loops within function members, as discussed below.
Next, a declaration for the class “featureFinder” that implements one embodiment of the present invention is provided:
The class “featureFinder” includes the following private data members: (1) “s,” a pointer to an output stream through which results are output to files; (2) “roi,” a pointer to a region of interest from which a feature mask is generated; (3) “fVar,” “bVar,” “fSD,” “bSD,” “fAv” and “bAv,” declared above on lines 7-9, which store the variances, standard deviations, and means for the intensity distributions of feature pixels and background pixels; (4) “curX,” “curY,” and “curRad,” declared above on line 10, that store a feature position and feature radius calculated from a generated feature mask; (5) “featureMask,” a two-dimensional Boolean-valued representation of the ROI, as described in a previous subsection; and (6) “tMask,” a second Boolean-valued representation of the ROI, used for storing intermediate results and for storing indications of intensity outliers. The class “featureFinder” includes the following private function members: (1) “KMeans,” declared above on line 14, a function member that performs an initial partitioning of ROI pixels into a set of feature pixels and a set of background pixels; (2) “holeFill,” declared above on line 15, a function member that performs a hole-filling operation on a generated feature mask; (3) “andNeighborhood,” declared above on line 16, a function member that calculates the Boolean AND of the feature-mask values corresponding to pixels adjoining a specified pixel along edges of the specified pixel; (4) “orNeighborhood,” a function member similar to the previously described function member that computes the Boolean OR of the Boolean values corresponding to neighboring pixels of the specified pixel; (5) “computeStatistics,” declared above on line 18, a function member that computes the statistical values stored in the data members declared on lines 7-9, above; (6) “initialAcceptance,” a function-member that returns a Boolean value indicating whether or not an initially generated feature mask represents an acceptable first pass at identifying feature pixels within the ROI; (7) “alternateInitialClassification,” a function member that performs an alternative initial partitioning of pixels into feature pixels and background pixels; (8) “rejectOutliers,” a function member that identifies and removes intensity outliers from consideration in subsequent calculations; (9) “BayesianClassify,” declared above on line 22, a function-member that carries out an iterative Bayesian-probability-based refinement of an initially generated feature mask; and (10) “findCentroid,” a function member that determines the coordinates of the center of the feature based on the generated feature mask.
The public function members for the class “featureFinder” include: (1) “newROI,” declared above on line 26, which receives a pointer to an instance of the class “scannedROI” representing a pixel-based ROI intensity data set from which a feature mask can be prepared; (2) “generateFeatureMask,” declared above on line 27, a function member that generates a feature mask, as described in the previous subsection, and returns calculated location and radius of the feature identified in the feature mask via the variable parameters “x” “y” and “rad”; and (3) a constructor.
An implementation of the private function member KMeans is next provided:
The function member “KMeans” iteratively partitions pixels into a set of feature pixels and a set of background pixels. First, in the nested for-loop on lines 10-18, the pixels of the ROI are scanned to determine the highest and lowest intensity values within the ROI. The highest intensity value is stored in the variable “high,” while the lowest intensity value detected in the ROI is stored in the variable “low.” Then, on line 19, the variable “pivot” is initialized to be the intensity value intermediate between the highest and lowest intensity values detected within the ROI. Next, in thefor-loop of lines 20-53, the pixels are iteratively partitioned until either a number of iterations equal to the constant “IterationLim” have been executed, or until the number of pixels reclassified from feature to background and from background to feature falls below the threshold value “FlipLim,” as detected on line 51. In the nested for-loops of lines 28-50, the intensity of each pixel in the ROI is compared to the current value of the variable “pivot,” on line 33. If the intensity value of the pixel is greater than or equal to the current value of the variable “pivot,” the pixel is classified as a feature pixel, and otherwise is classified as a background pixel. If, in the current iteration of thefor-loop of lines 20-53, the classification of a pixel changes, then the value stored in the variable “flips” is incremented, on line 37. In addition, a running total of the number of feature pixels and background pixels, and a running sum of the intensities of the feature pixels and background pixels, is maintained in the variables “high,” “low,” “sumHigh,” and “sumLow,” updated on lines 39-48. If convergence is not reached in the current iteration of thefor-loop of lines 20-53, as detected on line 51, then a new value for the variable “pivot” is calculated on line 52 as an intensity value intermediate between the mean of the feature pixels and the mean intensity of the background pixels.
Next, an implementation of the function member “andNeighborhood” is provided. A detailed annotation of this implementation is not provided, as implementation is quite straightforward. The function member “andNeighborhood” computes the Boolean AND of the Boolean values corresponding to pixels adjacent to a pixel specified by the input arguments “x” and “y.” Input parameter “t” specifies whether the Boolean and operation is conducted on the two-dimensional Boolean array “feature mask” or on the two-dimensional Boolean array “tMask.” An implementation for the similar function member “orNeighborhood” is not provided, as that implementation is almost identical with implementation of the function member “andNeighborhood,” with the exception that all “&&” operators are replaced with “∥” operators. It should also be noted that more efficient implementations of “andNeighborhood” and “orNeighborhood” are possible, in which the treatment of special-case corner and edge pixels in the implementation, provided below, can be avoided.
The function member “holeFill” successfully applies a dilation operation, in the nested for-loops of lines 5-11, an erosion operation in the nested for-loops of lines 12-18, and a final hole-filling operation in the nested for-loops of lines 19-28. The dilation operation results are placed into the two-dimensional Boolean array “tMask,” on line 9, which results are then used in the erosion operation, the results of which are, in turn, migrated back to the two-dimensional array “featureMask.” The dilation operation tends to flip Boolean values “0” to Boolean values “1” for pixels in the neighborhood of feature pixels, the erosion operation expands areas of Boolean-value “0” within the feature mask, and the hole-filling operation detects and removes feature pixels surrounded by background pixels and background pixels surrounded by feature pixels.
Next, an implementation for the function member “initialAcceptance” is provided:
The function member “initialAcceptance” determines a theoretical, area-based radius by taking the square root of the number of feature pixels, divided by π, on line 16. If the difference between the area-based radius and the putative feature radius reported by the instance of the class “scannedROI” is less than 25 percent of the putative feature radius, then the function member “initialAcceptance” returns a Boolean value “TRUE” indicating that the initially generated feature mask is acceptable. Otherwise, a Boolean value “FALSE” is returned. Note that the exact threshold for acceptability may differ from the difference of less than 25 percent of the putative feature radius in alternative embodiments.
Next, an implementation of the function member “alternateInitialClassification” is provided below:
The function member “alternateInitialClassification” simply generates an initial feature mask with Boolean-value “1” values corresponding to the pixels within the putative feature as described by the feature location and feature radius reported by an instance of the class “scannedROI.” This is accomplished by considering each pixel, in the nested for-loops of lines 6-16, and determining whether the distance of the pixel from the putatitve feature center is less than or equal to the putative radius of the feature, as reported by the instance of the class “scannedROI.”
Next, an implementation of the function member “computeStatistics” is provided below:
The function member “computeStatistics” traverses all of the pixels within the ROI in the nested for-loops of lines 11-31, accumulating a sum of feature-pixel intensities, a sum of the squares of feature-pixel intensities, and a number of feature pixels, and similar quantities for background pixels. From these values, the means, variances, and standard deviations for the feature pixels and background pixels are calculated on lines 32-37.
Next, an implementation for the function member “rejectOutliers” is provided, below:
The function member “rejectOutliers” iteratively identifies intensity outliers in the while-loop of lines 14-49. Initially, outlier rejection is accepted if less than 25 percent of the total number of pixels are rejected as intensity outliers. However, if a larger percentage of pixels are rejected, then outlier rejection is again performed by widening the intensity criteria for non-intensity-outlying pixels. Note that the two-dimensional Boolean array “tMask” is used for indicating intensity outliers with a Boolean value “1.” The range of acceptable intensities for feature pixels is stored in the two variables “initial_fLow” and “initial_fHigh,” and the range for acceptable background intensities as stored in the variables “initial_bLow” and “initial_bHigh.” Thus, in the nested for-loops of lines 19-43, all pixels are considered, and those pixels for which intensities lie outside of the acceptable range are flagged as intensity outliers. If, following consideration of all pixels, more than 25 percent of the total pixels are classified as outliers, the ranges of acceptable intensities are broadened on lines 44-47.
Next, an implementation of the function member “BayesianClassify” is provided, below:
The function member “BayesianClassify” computes the Bayesian probability for each pixel as being a feature pixel and as being a background pixel. When the ratio of the Bayesian probability of a pixel being a feature pixel provided by the probability of the feature being a background pixel is greater than or equal to one, the pixel is classified as a feature pixel. The detailed mathematical models for the Bayesian probability calculation are discussed in the previous subsection. In the nested for-loops of lines 15-74, the Bayesian probabilities for each non-outlying pixel are calculated, and the pixel is classified according to those calculated probabilities. First, the probability of the observed intensity of the pixel given the location of the pixel and the fact of the pixel as a feature pixel is calculated on lines 22-24. A corresponding probability of the observed intensity of the pixel, given the location of the pixel and the fact that the pixel is a background pixel, is calculated on lines 25-27. Then, the probabilities of the pixel being a background pixel, given its location, and being a feature pixel, given its location, are calculated on lines 33-51. The logarithms of the probabilities, rather than the probabilities are calculated and maintained by the function member “BayesianClassify.” Logarithmic values are more easily and efficiently manipulated. Finally, if the pixel was originally classified as a feature pixel, as determined on line 53, then, if the ratio of the pixel being a feature pixel versus the probability of the feature being a background pixel is greater than or equal to one or, in other words, the log ratio is greater than or equal to zero, then the feature remains classified as a feature pixel. Otherwise, the feature is classified as a background pixel, and the variable “numFlipped” is incremented in order to maintain a count of the total number of pixels reclassified by the function member “BayesianClassify.” Similarly, if the pixel is initially classified as a background pixel, the value is flipped if the log ratio is greater than or equal to zero. Finally, on line 72, the function member “BayesianClassify” returns a Boolean value indicating whether or not the number of pixels reclassified is less than one percent of the total number of pixels. If so, then the Boolean value “TRUE” is returned. In alternative embodiments, a different convergence criteria may be used.
Next, an implementation of the function member “findCentroid” is provided, below:
The function member “findCentroid” computes the first moment of intensity for the pixels identified as feature pixels with respect to the x coordinate axis and the y coordinate axis in order to determine the x,y coordinates of the center of the feature. The radius of the feature is calculated based on the total area of the feature mask represented by the Boolean value “1.”
A simple implementation of the function member “newROI” is provided, below, without additional annotation:
Next, an implementation of the function member “generateFeatureMask” is provided, below:
The function member “generateFeatureMask” closely corresponds to the control-flow diagrams discussed in the previous subsection. First, on line 6, the function member “KMeans” is called to initially classify pixels as being either feature pixels or background pixels. Next, the function member “holeFill” is called, on line 7, in order to fill holes within the initially generated feature mask. If the initially generated feature mask is not acceptable, as determined by calling the function member “initialAcceptance” on line 8, then an alternative initial feature mask is generated by calling the function member “alternateInitialclassification” on line 9. In the while-loop of lines 10-17, the function member “generateFeatureMask” iteratively computes statistics for feature and background pixels, on line 12, rejects intensity outliers, on line 13, re-computes statistics, on line 14, and reclassifies pixels based on Bayesian probabilities calculated for each pixel on line 15. The while-loop terminates either when the repartitioning of pixels by the function member “BayesianClassify” indicates convergence of the reclassification, or when the maximum of iterations, represented by the constant “MainLoopIterationLim,” has been executed. Next, on line 18, the function member “holeFill” is called in order to fill any holes arising from the reclassification of pixels. Then, on line 19, the function member “findCentroid” is called in order to compute the centroid and radius of the feature as represented by the generated feature mask. When the computed location and size of the feature is close to the putative feature location and size, as determined on lines 20-30, then the calculated feature position and size is reported back to the caller of the function member “generateFeatureMask” via the variable arguments.
Finally, for completeness, a constructor for the class “featureFinder” is provided without further annotation, below:
A simple, example routine in which a feature mask is generated from a pixel-based ROI data intensity set is next provided, without further annotation:
Although the present invention has been described in terms of a particular embodiment, it is not intended that the invention be limited to this embodiment. Modifications within the spirit of the invention will be apparent to those skilled in the art. For example, an almost limitless number of software implementations for the method of feature-mask generation can be crafted by ordinarily skilled software developers using different modularization, control structures, data structures, programming languages, and various other different characteristics and techniques. In the described embodiment, various limits to the number of loop iterations and constants that specify convergence criteria are employed, but, in alternative embodiments, constants may have different values or entirely different means for constraining the overall calculation that may be employed. As discussed above, generation of feature masks by the method of the present invention may be employed within feature-extraction packages both to provide a preview mode in which users can interactively review and change feature positions and sizes prior to the time-consuming and computationally expensive feature extraction steps as well as to accurately identify feature sizes and locations within scanned images within the feature extraction steps. In the described embodiment, the method of the present invention is focused on a single pixel-based ROI intensity data set, but the method can be expanded for employment on multi-signal intensity data sets, including one or more regions of interest. The classification may be stored in Boolean arrays or bit-map arrays, or in alternative data structures.
The foregoing description, for purposes of explanation, used specific nomenclature to provide a thorough understanding of the invention. However, it will be apparent to one skilled in the art that the specific details are not required in order to practice the invention. The foregoing descriptions of specific embodiments of the present invention are presented for purpose of illustration and description. They are not intended to be exhaustive or to limit the invention to the precise forms disclosed. Obviously many modifications and variations are possible in view of the above teachings. The embodiments are shown and described in order to best explain the principles of the invention and its practical applications, to thereby enable others skilled in the art to best utilize the invention and various embodiments with various modifications as are suited to the particular use contemplated. It is intended that the scope of the invention be defined by the following claims and their equivalents:
Appendix A The following is an alternative implementation of the present invention. The following code is neither annotated nor further described, but is readily understood by ordinarily skilled software developers, especially given the thorough description the invention in previous subsections. The following code contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.
Claims
1. A method for classifying pixels of a microarray image with observed intensities within a region of interest, the method comprising:
- initially classifying pixels in the region of interest as either feature pixels or background pixels based on the intensities of the pixels; and
- iteratively computing, for pixels within the region of interest, probabilities that the pixels are feature pixels and probabilities that the pixels are background pixels, based on pixel locations and intensities, and accordingly classifying the pixels as either feature pixels or background pixels.
2. The method of claim 1 wherein a feature-pixel and background-pixel classification is stored in a feature mask.
3. The method of claim 2 wherein the feature mask includes binary values corresponding to pixels in the region of interest, a first binary value indicating that a corresponding pixel is a feature pixel and a second binary value indicating that a corresponding pixel is a background pixel.
4. The method of claim 1 wherein classifying pixels in the region of interest as either feature pixels or background pixels based on the observed intensities of the pixels further includes:
- determining a high pixel intensity and a low pixel intensity for the region of interest;
- determining an intermediate point between the high pixel intensity and a low pixel intensity;
- classifying pixels with observed pixel intensities greater than or equal to the intermediate point as feature pixels and classifying pixels with observed pixel intensities less than the intermediate point as background pixels; and
- iteratively reclassifying pixels based on an intermediate intensity between the mean intensity of feature pixels and the mean intensity of background pixels.
5. The method of claim 1 further including identifying hole pixels that are feature pixels surrounded by background pixels and background pixels surrounded by feature pixels and reclassifying hole pixels in order to increase the continuity of feature-pixel and background-pixel classification with respect to location within the region of interest.
6. The method of claim 1 wherein iteratively computing, for pixels within the region of interest, probabilities that the pixels are feature pixels and probabilities that the pixels are background pixels, based on pixel locations and intensities, and accordingly classifying the pixels as either feature pixels or background pixels further includes:
- iteratively computing intensity-based outlier feature-pixel statistics and outlier background-pixel statistics; from the most recently computed intensity-based feature-pixel statistics and background-pixel statistics, determining, for each pixel, a Bayesian posterior probability P(F/i,x) that the pixel is a feature pixel and a Bayesian posterior probability P(B/i,x) that the pixel is a background pixel and classifying the pixel as a feature pixel when P(F/i,x)>=P(B/i,x);
- until either a maximum number of iterations are performed or until fewer than a threshold number of pixels are reclassified from feature-pixel to background-pixel and from background-pixel to feature-pixel status in the most recently executed iteration.
7. The method of claim 6
- wherein the Bayesian posterior probability P(F/i,x) is calculated as
- P ( F / i, x ) = P ( F, i, x ) P ( i, x ) = P ( i / x, F ) P ( F, x ) P ( i, x ) = P ( i / x, F ) P ( F / x ) P ( x ) P ( i, x );
- wherein the Bayesian posterior probability P(B/i,x) is calculated as
- P ( B / i, x ) = P ( B, i, x ) P ( i, x ) = P ( i / x, B ) P ( B, x ) P ( i, x ) = P ( i / x, B ) P ( B / x ) P ( x ) P ( i, x );
- and wherein a pixel is classified as a feature pixel when
- P ( F / i, x ) P ( B / i, x ) >= 1.
8. The method of claim 7 wherein Bayesian posterior probabilities P(F/i,x) and P(B/i,x) are calculated for each channel of a two-channel microarray, and a joint probability distribution function for two channels is then computed, which is then used for classifying pixels as feature pixels or background pixels.
9. Computer instructions encoded in a computer-readable medium that implement the method of claim 1.
10. A data structure containing a feature-pixel and background-pixel classification carried out by the method of claim 1 stored in a computer-readable medium.
11. A feature extraction program that includes a feature-location-and-size determination step that includes the method for classifying pixels with observed intensities within a region of interest of claim 1.
12. Data produced by the feature extraction program of claim 11, stored in a printed medium or a computer readable medium, or encoded in electromagnetic signals, and transferred to a remote location.
13. Data produced by the feature extraction program of claim 11, stored in a printed medium or a computer readable medium, or encoded in electromagnetic signals, and received from a remote location.
14. A feature-extraction system comprising:
- a means for receiving and storing a scanned image of a microarray;
- a gridding means for determining putative feature positions and sizes within the scanned image of the microarray;
- feature-mask-generating logic that classifies pixels as feature-pixels and background-pixels based on pixel locations and intensities;
- preview-mode display logic that displays feature positions and sizes obtained from the generated feature mask, solicits feedback from a user, and corrects the feature positions and sizes; and
- a feature extraction module that extracts signal data from the scanned image of the microarray following user acceptance of initial feature locations and sizes displayed in preview mode.
15. The feature-extraction system of claim 14 wherein the feature-mask-generating logic classifies pixels as feature-pixels and background-pixels based on pixel locations and intensities by:
- initially classifying pixels in a region of interest as either feature pixels or background pixels based on the intensities of the pixels; and
- iteratively computing, for pixels within the region of interest, probabilities that the pixels are feature pixels and probabilities that the pixels are background pixels, based on pixel locations and intensities, and accordingly classifying the pixels as either feature pixels or background pixels.
16. The feature-extraction system of claim 15 wherein a feature-pixel and background-pixel classification is stored in a feature mask.
17. The feature-extraction system of claim 15 wherein classifying pixels in the region of interest as either feature pixels or background pixels based on the observed intensities of the pixels further includes:
- determining a high pixel intensity and a low pixel intensity for the region of interest;
- determining an intermediate point between the high pixel intensity and a low pixel intensity;
- classifying pixels with observed pixel intensities greater than or equal to the intermediate point as feature pixels and classifying pixels with observed pixel intensities less than the intermediate point as background pixels; and
- iteratively reclassifying pixels based on an intermediate intensity between the mean intensity of feature pixels and the mean intensity of background pixels.
18. The feature-extraction system of claim 15 wherein iteratively computing, for pixels within the region of interest, probabilities that the pixels are feature pixels and probabilities that the pixels are background pixels, based on pixel locations and intensities, and accordingly classifying the pixels as either feature pixels or background pixels further includes:
- iteratively computing intensity-based outlier feature-pixel statistics and outlier background-pixel statistics; from the most recently computed intensity-based feature-pixel statistics and background-pixel statistics, determining, for each pixel, a Bayesian posterior probability P(F/i,x) that the pixel is a feature pixel and a Bayesian posterior probability P(B/i,x) that the pixel is a background pixel and classifying the pixel as a feature pixel when P(F/i,x) >=P(B/i,x);
- until either a maximum number of iterations are performed or until fewer than a threshold number of pixels are reclassified from feature-pixel to background-pixel and from background-pixel to feature-pixel status in the most recently executed iteration.
19. The feature-extraction system of claim 18
- wherein the Bayesian posterior probability P(F/i,x) is calculated as
- P ( F / i, x ) = P ( F, i, x ) P ( i, x ) = P ( i / x, F ) P ( F, x ) P ( i, x ) = P ( i / x, F ) P ( F / x ) P ( x ) P ( i, x );
- wherein the Bayesian posterior probability P(B/i,x) is calculated as
- P ( B / i, x ) = P ( B, i, x ) P ( i, x ) = P ( i / x, B ) P ( B, x ) P ( i, x ) = P ( i / x, B ) P ( B / x ) P ( x ) P ( i, x );
- and wherein a pixel is classified as a feature pixel when
- P ( F / i, x ) P ( B / i, x ) >= 1.
20. The feature-extraction system of claim 19 wherein Bayesian posterior probabilities P(F/i,x) and P(B/i,x) are calculated for each channel of a two-channel microarray, and a joint probability distribution function for two channels is then computed, which is then used for classifying pixels as feature pixels or background pixels.
Type: Application
Filed: Jan 22, 2004
Publication Date: Apr 20, 2006
Inventors: Jayati Ghosh (San Jose, CA), Xiangyang Zhou (Mountain View, CA)
Application Number: 10/763,645
International Classification: G06K 9/34 (20060101); G06K 9/00 (20060101);