System and method for computer-designing optimum oligo-nucleic acid sequences from nucleic acid base sequences, and oligo-nucleic acid array mounted with the designed oligo-nucleic acid sequences
A computer software program and a method are provided to design an optimum oligo-nucleic acid sequence candidate from a nucleic acid base sequence. The program comprises a first command for receiving and storing a specified tolerance of double-chain bond temperature; a second command for computing the double-chain bond temperature of a partial sequence at each length, as extending the length; and a third command for determining whether or not the double-chain bond temperature computed by the second command falls within the tolerance specified by the first command, and in case it falls within the range, for outputting the partial sequence with the length as an oligo-nucleic acid sequence candidate. Based on the inputted tolerance of double-chain bond temperature, oligo-nucleic acid sequences with various lengths that meet the double-chain bond temperature condition can be obtained. An oligo-nucleic acid array mounted with the designed oligo-nucleic acid sequences is also provided.
Latest Kabushikigaisha DYNACOM Patents:
[0001] This application claims priority under 35 U.S.C. 119 based upon Japanese Patent Application Serial No. 2001-225181, filed on Jun. 20, 2001. The entire disclosure of the aforesaid application is incorporated herein by reference.
BACKGROUND OF THE INVENTION[0002] The present invention generally relates to a computer software program and a method to design optimum oligonucleic acid sequence candidates from nucleic acid base sequences.
[0003] To analyze the expression in cells of a gene that is the object of an experiment, an element called DNA chip is generally used. This DNA chip is formed by arranging on a glass or silicon substrate DNA fragments and/or RNA fragments having thousands to tens of thousands of different pieces of base sequence information.
[0004] The nucleic acid sequence of a plurality of DNA fragments and/or RNA fragments arranged on a DNA chip is called a capture, and is appropriately arranged so that binding, i.e., hybridization will occur with the specific gene, which is the object of the experiment. With this type of DNA chips, for instance, when a healthy cell has turned to a sick cell, it is possible to find the expressed gene causing the illness by examining which gene in this cell has hybridization.
[0005] The nucleic acid sequence of the DNA fragments used as a capture is generally selected from a library. A library is an aggregate of DNA samples or an aggregate of cDNA samples prepared by cloning fragments of genes obtained from a cell or the like. Note that cDNA (complementary DNA) means the bases of DNA sequences that can be combined with all bases of the messenger RNA, i.e., a synthesized DNA that is complementary to the messenger RNA.
[0006] However, it is difficult in terms of time, cost and technology for researchers to obtain actual samples that can be a capture, since they would have to obtain existing DNA fragments from cells. Therefore, researchers have recently begun using a method wherein an oligo-base sequence with a length of approximately several tens of bases is determined by use of the sequence information on the genome whose sequence information has already been read out, or the sequence information called EST (Expressed Sequence Tag) that identifies the sequence information of the poly A sequence end portion of the messenger RNA. (Poly A is a sequence present in the RNA end portion in the form of ---AAAAOH.) The oligo-base sequence so determined is then chemically synthesized and mounted on a substrate. Note that an oligo-nucleic acid means a nucleic acid having a relatively short base sequence (e.g., approximately 200 base pairs).
[0007] In the past, to determine an appropriate oligo-nucleic acid sequence, researchers partially extracted sequences form a library or from the gene used as the object of an experiment, compared these sequences through visual inspection, and searched for the similarities and differences present in the sequences. However, in recent years, DNA chips and DNA arrays have higher levels of integration, meaning that more fragments of nucleic acids are integrated. Searches through visual inspection are not realistic any more. Thus, computers are now commonly used to determine the base sequences of the nucleic acid fragments to be arranged on a substrate.
[0008] An example of a conventional technology to realize this method is shown in WO94/11837 that disclosed an oligo-probe design station, which can design common probes and specific probes through computer processing by use of the data in gene sequence data sources.
[0009] However, this type of conventional computer-processing technology simply computes and provides hybridization strength modeling, upon which a user selects an appropriate probe. This type of technology is not capable of improving, for instance, the accuracy of the bond temperature of the probe.
[0010] When many different probes are designed for DNA chips or for other purposes, all of these probes must have the same double-chain bond temperature. The condition for the double-chain bond temperature is given by the Tm value. Note here that the Tm value is the temperature at which 50 percent of double bonds are present in the double chain, as determined by the GC content, among others. However, the GC content varies depending on the base sequence and its length. Therefore, it is difficult to determine a sequence that has the specific sequence within the base length, and at the same time, meets the synthesis condition and the temperature condition.
[0011] According to the technology disclosed in the aforementioned WO94/11837, the strength of hybridization between oligo-nucleic acid sequence candidates and the specific gene is obtained based on the double-chain bond temperature; and the information is presented to a user so that the user can select the probe that meets the optimum temperature condition. However, since the oligo-nucleic acid sequence candidates used in this technology are determined by neglecting the double-chain bond temperature condition, the above process cannot decrease the variance of the double-chain bond temperatures of the oligo-nucleic acid sequence candidates. Thus, if we try to obtain a large number of probes from the oligo-nucleic acid sequence candidates, the variance will be significantly large. According to the analyses made by the inventors, the error range of double-chain bond temperatures of the oligo-nucleic acid base sequences obtained in the prior art will be as large as ±20 degrees. On the other hand, if we try to decrease this error range, we will end up with insufficient number of oligo-nucleic acid base sequences.
[0012] Another application that requires determination of oligo-nucleic acid sequence is designing of probes to provide a gene amplification means in the PCR (Polymerase Chain Reaction) method, among others. In the PCR method, to search for a specific base sequence part and to amplify the part, required is a design of suitable probe base sequences as long as several tens of bases for the initial positions at both ends of the amplified part. Similarly to the case of designing the base sequence for a capture, in this application also, specific sequence parts must be designed so as to avoid double-chain bonds outside of the applicable part. Furthermore, all the double-chain bond temperatures must be under the same temperature condition.
[0013] For the purpose above, the designed probe must be a specific sequence that amplifies only the desired part of the applicable gene or of the intermingled nucleic acids. In the case where a plurality of sequences need to be concurrently amplified, it is necessary that each sequence is an appropriate sequence for the desired bonded part and it also meets the double-chain bond temperature condition. The aforementioned patent disclosed a technology related to probe designs in the PCR method as well. However, because of the aforementioned reasons, it does not offer a solution that meets the appropriate double-chain bond temperature condition.
[0014] As mentioned above, according to the prior art, it is difficult to determine a sequence, which has the specific sequence within the base length and also meets the synthesis condition as well as the appropriate temperature condition. The present invention was made considering the situation above. The object is to offer a computer program and a method, which can concurrently determine many oligo-nucleic acid sequences having a high level of accuracy in the value of double-chain bond temperature or the melting temperature (Tm).
[0015] A more detailed object of this invention is to offer a computer program and a method, wherein the desired tolerance for the melting temperature is specified, and the oligo-nucleic acid sequences that meet the temperature condition can be determined.
BRIEF SUMMARY OF THE INVENTION[0016] To address the aforementioned issue, according to the first aspect of the present invention, provided is a computer software program to design an optimum oligo-nucleic acid sequence candidate from a nucleic acid base sequence. The program comprises a first command for receiving and storing the specified tolerance of double-chain bond temperature; a second command for computing, as extending the length of a partial sequence in the nucleic acid base sequence, the double-chain bond temperature at each length; and a third command for determining whether or not the double-chain bond temperature computed by the second command falls within the tolerance specified by the first command, and in case it falls within the range, for outputting the partial sequence with said length as an oligo-nucleic acid sequence candidate.
[0017] According to this configuration, based on the inputted tolerance of double-chain bond temperature, an oligo-nucleic acid sequence that meets the double-chain bond temperature condition can be obtained while varying the starting point and the length. In this manner, it will be possible to determine and output many oligo-nucleic acid sequences that meet the double-chain bond temperature condition.
[0018] According to a preferred embodiment of the present invention, the second command, within the nucleic-acid sequences being analyzed, successively shifts the starting point of the partial sequence of which the double-chain bond temperature is obtained, and extends the length of the partial sequence from the shifted starting point. In this case, the second command preferably shifts the starting point to the next starting point without extending the length further when, according to the third command, the double-chain bond temperature of the partial sequence has been determined to be outside of the tolerance.
[0019] According to another embodiment of the present invention, the program further comprises a fourth command for displaying a plurality of oligo-nucleic acid sequence candidates outputted by the third command.
[0020] According to another embodiment of the present invention, this program further comprises a fifth command for comparing homomeric genes among a plurality of nucleic acid sequences, and for identifying the sequence parts that are specific and those that are non-specific to each nucleic acid sequence. The second command successively shifts the starting point of the partial sequence of which the double-chain bond temperature is obtained, and extends the length of the partial sequence from the shifted starting point.
[0021] In this case, the second command preferably shifts the starting point to the next starting point without further extending the length when, according to the third command, the double-chain bond temperature of the partial sequence has been determined to be outside of the tolerance.
[0022] The second command further comprises a sixth command which, when the extended partial sequence contains said non-specific sequence part, determines whether to compute the double-chain bond temperature of said partial sequence or to set it as an oligo-nucleic acid sequence candidate. And the second command preferably shifts the starting point of said partial sequence to the next specific sequence part when, according to the sixth command, it is determined that no double-chain bond temperature is to be computed or that no oligo-nucleic acid sequence candidate is to be set. In this case, the sixth command determines whether to compute the double-chain bond temperature of said part or to set it as an oligo-nucleic acid sequence candidate based on the ratio between the specific sequence part and the non-specific sequence part or on the respective sequence numbers contained in said extended partial sequence.
[0023] According to another embodiment, the fourth command, when the partial sequence that falls within the tolerance of double-chain bond temperature contains said non-specific sequence part, outputs the partial sequence as a low-grade oligo-nucleic acid sequence candidate. Furthermore, the second command, when no partial sequence with a length that reaches the double-chain bond temperature can be obtained from each specific sequence part, preferably extends the length of the partial sequence to the non-specific sequence part, and allows the partial sequence that falls within the tolerance of double-chain bond temperature to be outputted as a low-grade oligo-nucleic acid sequence candidate.
[0024] According to another aspect of the present invention, provided is a method to design an optimum oligo-nucleic acid sequence candidate from a nucleic acid base sequence. The method comprises the steps of (a) specifying the tolerance of double-chain bond temperature; (b) computing the double-chain bond temperature of a partial sequence at each length as extending the length in said nucleic acid base sequence; and (c) determining whether or not the double-chain bond temperature computed by the computing means falls within the tolerance specified by the specifying means, and if it falls within the range, outputting the partial sequence with said length as an oligo-nucleic acid sequence candidate.
[0025] According to this configuration, a method can be implemented corresponding to the aforementioned first aspect.
[0026] According to the third principal aspect of the present invention, an oligo-nucleic acid array mounted with a plurality of oligo-nucleic acids is provided. The oligo-nucleic acids have sequence lengths that are designed so as to have a predetermined double-chain bond temperature.
[0027] According to this configuration, an oligo-nucleic acid array mounted with oligo-nucleic acid sequences with various lengths, which are designed based on the double-chain bond temperature, can be obtained. In this manner, an oligo-nucleic acid array having a very high accuracy in double-chain bond temperature can be realized.
[0028] Having described the invention, the following examples are given to illustrate specific applications of the invention including the best mode now known to perform the invention. These specific examples are not intended to limit the scope of the invention described in this application.
BRIEF DESCRIPTION OF THE DRAWINGS[0029] FIG. 1 is a system configuration diagram illustrating an embodiment of this invention.
[0030] FIG. 2 illustrates an input screen to input the condition for determining an oligo-nucleic acid sequence.
[0031] FIG. 3 is a type diagram illustrating a nucleic acid base sequence being analyzed.
[0032] FIG. 4 is a type diagram illustrating the procedure for determining an oligo-nucleic acid sequence candidate from base sequences being analyzed.
[0033] FIG. 5 is a flow chart to explain the procedure for determining an oligo-nucleic acid sequence.
[0034] FIG. 6 is a front view illustrating an oligo-nucleic acid array obtained by this embodiment.
[0035] FIG. 7 is a flow chart to explain the procedure for determining an oligo-nucleic acid sequence pertaining to another embodiment.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS[0036] A preferred embodiment of the present invention will be described in detail below with reference to the accompanying diagrams. The diagrams illustrate only an example of the embodiments of the invention. Unless otherwise specified, the terms in the explanation will bear the meaning that is ordinarily used by those skilled in the art wherein this invention belongs.
[0037] FIG. 1 is a block diagram of the entire program to explain the present invention.
[0038] This program comprises a data storage unit 8 and a program storage unit 9 both connected to a bus 7, which comprises a CPU 1, a RAM 2, input devices 3 including a keyboard and a mouse, output devices 4 including a display and a printer, and a modem 5.
[0039] In the data storage unit 8, the components pertinent to this invention are: an oligo-nucleic acid sequence determining condition 11, an analyzed nucleic acid base sequence file 12, a reference-only base sequence file 15, similarity discrimination results 13 of the nucleic acid base sequences analyzed and the oligo-nucleic acid sequence candidates 14.
[0040] In the oligo-nucleic acid sequence determining condition 11, at least a double-chain bond temperature 16, an oligo-nucleic acid length condition 17, and a low-grade threshold value 18 are stored. In this embodiment, the double-chain bond temperature 16 is set as a range, based on a desired double-chain bond temperature Tm, of, for instance, the highest tolerated temperature Tmu=Tm+3° C., and the lowest tolerated temperature Tml=Tm−3° C. The length condition 17 is set as a range of, for instance, 50˜100 base length (shortest is 50 base length, and longest is 100 base length) to effectively prevent miss-hybridization.
[0041] The low-grade threshold value 18 is the ratio of the number of sequences in the non-specific part allowed to be contained in an oligo-nucleic acid sequence candidate to the number of sequences in the specific part contained in the same oligo-nucleic acid sequence candidate. In this embodiment, it is, for instance, set to 50%. Then, the oligo-nucleic acid sequences partially containing sequences of the non-specific part is outputted as “low grade,” and is distinguished from oligo-nucleic acid sequence candidates fully composed of the specific part.
[0042] The analyzed nucleic acid base sequence file 12 consists of data containing a plurality of nucleic acid base sequences, which a user has been interested in and has collected. The reference-only base sequence file 15 consists of base sequences exclusively for reference, which have been optionally added and set from external databases such as the CDNA/EST database. These sequence files 12 may also contain data downloaded from one or more specific external databases 19 connected through the modem 5.
[0043] The similarity discrimination results 13 are the results including the specific sequence parts and nonspecific sequence parts of each base sequence, which are identified through determination of the similarities between the nucleic acid base sequences, and between the nucleic acid sequences analyzed and the reference-only base sequences. The oligo-nucleic acid sequence candidates 14 are oligo-nucleic acid sequence candidates with various base lengths computed based on the similarity discrimination results 13 and the oligo-nucleic acid sequence determining condition 11.
[0044] In the program storage unit 9, the components pertinent to this invention are: an oligo-nucleic acid base sequence determining condition input unit 20, a specific part sequence filter unit 21, a double-chain bond temperature condition filter unit 22, and an oligo-nucleic acid base sequence determination result display unit 23.
[0045] In actuality, these components 20˜23 include a certain region secured in a recording medium such as a hard disk, and one or more program commands of computer software are stored in the region. Any time when the CPU 1 calls them onto the RAM 2 to run them, they will perform the function of this invention. Next, the detailed configuration and function of the aforementioned components will be explained along with the actual oligo-nucleic acid base sequence determining procedure that is executed by this computer program.
[0046] The oligo-nucleic acid base sequence determining condition input unit 20 displays a screen for a user to input conditions on, for instance, the display (output device 4). An example of this screen is shown with Key 25 in FIG. 2. The screen includes an input box 26 for nucleic acid base sequence file name, input boxes 27 and 28 respectively for the upper limit and the lower limit of double-chain bond temperature, input boxes 29 and 30 respectively for the shortest and the longest values for the sequence length condition, and an input box 32 to specify the external database name. When the user enters values in respective input boxes 26˜32 and presses the OK button 31 afterward, the nucleic acid base sequence file 12 (external database 19) is specified, and at the same time, the oligo-nucleic acid sequence determining condition 11 is stored in the data storage unit 8.
[0047] The specific part sequence filter unit 21 has the function of reading information on each nucleic acid base sequence from the analyzed nucleic acid base sequence file 12 and from the reference-only base sequence file 15, and evaluating the similarities among the base sequences. The similarities are evaluated by simply comparing the character strings corresponding to the bases. Since accurate one-to-one comparisons of similarities and differences are required of the base sequences in selecting appropriate sequences, a homology search including insertion and deletion, which is frequently used in the gene sequence search, is not suitable. Sequences are compared strictly without assuming insertion or deletion. Therefore, a search means that neglects gaps is suitable.
[0048] If the BLAST method is used, a method that neglects gaps should be used, and the expected value (E-value) that varies depending on the database size should be set considerably loose (high) so that even a small partial concordance can be extracted. Note here that the E-value means the expected value for a fragment of the gene to be found when a database of a specific size is searched. Furthermore, by referring to scores of the fragments found, those with a higher score than the score with the threshold value are set as similar sequences. Note here that the score means a quantity that shows the level of concordance (the length of sequence in concordance or the level of similarity) of an object.
[0049] FIG. 3 illustrates one of the nucleic acid base sequences analyzed. In this figure, for the convenience of explanation, one nucleic acid base sequence analyzed is folded and displayed in several lines. Each base information of the nucleic acid, such as A, C, G, T(U), is indicated by a square.
[0050] The specific sequence filter unit 21 will register, according to the homology search by the aforementioned BLAST method, sequences that are partially homologous to other nucleic acid base sequences analyzed or the reference-only base sequences, as non-specific partial sequences (or common partial sequences). In the FIG. 3, the parts that are colored black (shown with Key 33 in the figure) indicate non-specific partial sequences. The white squares (shown with Key 34 in the figure) indicate specific partial sequences.
[0051] Instead of the BLAST method, a technique of character string concordance search may be applied, wherein an appropriate sequence width is set as a window width, and is shifted for comparisons.
[0052] By using a method such as above, parts that are in concordance with each other at or above a desired threshold value are extracted, and the hit results are registered as non-specific partial sequences (33). At the same time, the score based on the length of the character string in concordance and the information on the position of concordance are also registered. If necessary, repeated sequence parts are preferably excluded as non-specific partial sequences.
[0053] The specific partial sequence filter unit 21, after comparing all nucleic acid sequences analyzed, tabulates the results (parts with high levels of concordance or similarity). In this manner, the remaining sequence parts after deleting non-specific sequence parts 33 with a high level of similarity are outputted as the filtered specific partial sequences (or different partial sequences). These are the parts indicated by Key 34 in the figure. FIG. 3 illustrates the result of such filtering. The base sequences after this similarity discrimination are stored in the data storage unit 8 as the similarity discrimination results 13.
[0054] The double-chain bond temperature condition filter unit 22 performs the function of determining oligo-nucleic acid sequences with a length that meets the specified double-chain bond temperature condition from the nucleic acid base sequences obtained as the above similarity discrimination results (13).
[0055] This double-chain bond temperature condition filter unit 22, as illustrated in FIG. 1, comprises a starting point setting unit 35, a length setting unit 36, a double-chain bond temperature computing unit 37, and an oligo-nucleic acid sequence candidate determining unit 38.
[0056] The double-chain bond temperature computing unit 37 computes the double-chain bond temperature of the oligo-base sequence, which has the starting point set by the starting point setting unit 35 and has the length set by the length setting unit 36. As a method to compute the double-chain bond temperature, for instance, the Nearest-Neighbor method (Santa Lucia, J. Jr. Proc. Natl. Acad. Sci. USA, 95, 1460-1465, 1998) is used for those with 36 bases or less; and for those with 37 bases or more, the method described in J. Sambrook, E. F. Fritsch, T, Molecular Cloning, p. 11.46: a laboratory Manual, Cold Spring Harbor Laboratory Press, 1989 is preferably used at the present time. However, any other methods may be used.
[0057] The oligo-nucleic acid sequence candidate determining unit 38 receives the computation result each time the double-chain bond temperature computing unit 37 computes the double-chain bond temperature. Then, it outputs, as candidates, the oligo-nucleic acid sequences that fall within the temperature range and the length range inputted by the oligo-nucleic acid sequence determining condition 11. By applying this process while shifting the starting point and extending the length of the sequence as described above, oligo-nucleic acid base sequence candidates of various lengths that fall within the desired double-chain bond temperature condition can be obtained.
[0058] Next, based on FIG. 4˜FIG. 6, one of the nucleic acid base sequences analyzed as well as the procedure to extract oligo-nucleic acid base sequence candidates from it by estimating the temperature will be described in detail.
[0059] FIG. 4 is a diagram illustrating this procedure. In this figure, Key 41 indicates a similarity discrimination result as in FIG. 3. In the sequences in this similarity discrimination result, the double-chain bond temperature is computed successively from the starting point (n=1) in the first specific sequence part 34 as the length of the sequence being extended. Then, when the temperature falls within the pre-specified temperature range, the part is saved as an oligo-nucleic acid base sequence candidate as illustrated in 42. Then, until the sequence part is extended to a point where the temperature surpasses the upper limit temperature Tmu, the candidates are being saved.
[0060] In this example, to simplify the illustration, those that are shorter than the length 17, set under the aforementioned condition, are displayed as the candidates. Base sequences that meet the aforementioned set length condition will be left as the candidates. When the upper limit Tmu is reached, the initial position is shifted by one, and from the new starting point, the double-chain bond temperature will be computed in the same manner as the length of the sequence being successively extended. In this manner, another group of candidates will be obtained as indicated by Key 43 in the figure.
[0061] When the specific part sequence 34 is so short that the number of obtained candidates with the lengths that meets the aforementioned double-chain bond temperature condition is small, bases with the non-specific sequence part with a small score are gradually added to extend the length until it meets the double-chain bond temperature condition. The result is then displayed as an obtained low-grade candidate for the oligo-nucleic acid base sequence. Specifically, by referring to the low-grade threshold value 18, the sequence is extended until the non-specific part surpasses the threshold value 18. When it surpasses the value, the initial position is shifted.
[0062] The lengths of oligo-nucleic acid base sequences outputted as candidates are preferably between 50 bases and 100 bases. In this embodiment, with a length such as between 60 and 70 or so as the threshold value, it will be probabilistically difficult for samples other than the applicable nucleic acid base sequences to cause hybridization to the probe. Thus, noises can be reduced.
[0063] Next, by referring to the flow chart in FIG. 5, an actual processing procedure will be explained.
[0064] In the explanation and the flow chart below, each of the constants and variables are defined as follows:
[0065] n: Sequence number from the starting of each base nucleic acid base sequence (1, 2, 3, 4 . . . in FIG. 4)
[0066] nm: Last base number of each base nucleic acid base sequence
[0067] PR(n): In case of specific sequence part=1; In case of non-specific sequence part=0
[0068] ip: initial position of oligo-nucleic sequence for which double-chain bond temperature is obtained
[0069] ep: Ending position of oligo-nucleic acid sequence for which double-chain bond temperature is obtained
[0070] Tm (ip, ep): Double-chain bond temperature of the sequence between the initial position ip and the ending position ep
[0071] Tmu: Upper limit value for tolerated melting temperature
[0072] Tml: Lower limit value for tolerated melting temperature
[0073] Ls: Shortest sequence length
[0074] Ll: Longest sequence length
[0075] Ln: Low-grade threshold value (Ratio of tolerated nonspecific sequence part to specific sequence part)
[0076] First in Step S, the PR value is successively set while scanning from n=1 to n=nm. In this manner, each base in the sequence will be set as follows: if it is present in the specific partial sequence region, PR (n)=1 (white part 34 in FIG. 4); and if it is present in the non-specific partial sequence region, PR (n)=0 (Black part 33 in FIG. 4).
[0077] Next, in Step S2, the initial values for the initial position number ip and the ending position number ep, ip=1 and ep=1 are set. In the next step S3, it is determined whether or not the ending position number, ep, has reached the last base number, nm, of the applicable nucleic acid base sequence; and if it has not reached the number, in the following step S4, it is determined whether or not the length of the partial sequence between ip and ep has surpassed the upper limit sequence length Ll.
[0078] If it has surpassed the value, the process will proceed to Step S12 (will be explained later), wherein the initial position will be shifted. If it has not surpassed the value, in Step S5, based on the low-grade threshold value Ln, it will be checked whether the ratio of nonspecific sequence part to specific sequence part in the sequence between ip and ep is larger than Ln. In this example, Ln is 50%. Therefore, it is checked whether or not the ratio of the number of bases having PR (n)=0 to the number of bases having PR (n)=1 is larger than 50% in the sequence between ip and ep.
[0079] If it is larger, it will proceed to Step S12 wherein the initial position will be shifted. If it is smaller, the process will proceed to Step S6 to obtain double-chain bond temperature. In Step S6, the double-chain bond temperature Tm (ip, ep) value for the sequence between ip and ep is computed, and the process will proceed to Step S7. In Step S7, it is determined whether or not the Tm (ip, ep) value is larger than the upper limit value Tmu. If it is larger, without leaving this sequence as a candidate, the process will proceed to the step S12 to shift the initial position. If it is smaller, the process will proceed to the next step S8. In general, the longer the sequence is, the higher the Tm value is; thus when the Tm (ip, ep) value is higher than Tmu, further extending the partial base sequence is meaningless.
[0080] Next, in Step S8, it is checked whether or not the Tm (ip, ep) value is higher than Tml. If it is higher, the double-chain bond temperature of this sequence is determined to fall between the upper limit value Tmu and the lower limit value Tml, and the process will proceed to the next step S9. In Step S9, it is determined whether or not the length of this sequence is longer than the lower limit Ls; if it is longer, this sequence (ip, ep) will be determined to be an oligo-nucleic acid sequence candidate, and will be stored in the data storage unit 8. If a non-specific sequence part is contained in this oligo-nucleic acid sequence candidate (when the ratio in Step S5 is 1 or higher), said sequence will be saved with a low-grade flag on it.
[0081] When the double-chain bond temperature is determined lower than the lower limit value Tml in Step S8, or when the sequence length is determined to be short in Step S9, the process will proceed to Step S11, and the ending position number will increase by one (ep=ep+1). Then, the steps S3˜S10 will be repeated. To compute the double-chain bond temperature Tm (ip, ep), the previous computation result Tm (ip, ep−1) can be used for faster computation.
[0082] By repeating these steps, base sequences of various lengths with the same starting point will be saved as candidates.
[0083] If the sequence does not meet the condition in Step S4, S5 or S7, the process of shifting the starting point will be applied in Step S12. Thus, in S12, (1) the starting point is shifted by one (ip=ip+1); and (2) the ending position number ep is adjusted to this initial position number ip (ep=ip). In this manner, the starting point is shifted, and the length is reset. Then, by repeating Steps S3˜S10, oligo-nucleic acid sequences with the shifted starting point will be successively outputted as candidates.
[0084] If the starting point has entered the non-specific region 33, Ln is determined to be 100% in Step S5, so that the double-chain bond temperature will not be computed until it comes out of this non-specific region. In this manner, the non-specific region is skipped. That is, the process skips the non-specific part sequence regions, and only the partial base sequences whose double-chain bond temperature falls between Tml and Tmu can be saved in the storage.
[0085] When the initial position ip has shifted to the ending position nm of the nucleic acid base sequence, this fact is detected in Step S3, and all the steps will end.
[0086] According to this type of processing based on the double-chain bond temperature, candidates can be determined with a wide variety of lengths of oligo-nucleic acid base sequences. Thus, even when designing is made with a narrow range of the double-chain bond temperature, many oligo-nucleic acid sequences can be obtained.
[0087] The oligo-nucleic acid base sequence candidates obtained in this manner can be retrieved from the data storage unit 8 by the oligo-nucleic acid base sequence determination result display unit 23, and can be displayed on a display (output device 4).
[0088] These designing results will be displayed for each nucleic acid base sequence analyzed. If necessary, they may be sorted (classified) by the length. Or, by evaluating the ease of taking a secondary structure, displaying those that are hard to take a secondary structure may be preferably displayed first.
[0089] FIG. 6 illustrates an oligo-nucleic acid array 71 on which the oligo-nucleic acid base sequences determined in this embodiment are mounted. This oligo-nucleic acid array 71 is formed by mounting, in predetermined compartments 74 on a glass substrate 73 coated with poly L lysine 72, the sequences selected from the oligo-nucleic acid base sequence candidates by use of a spot device. According to this type of oligo-nucleic acid array 71, although the lengths of oligo-nucleic acid base sequences are different from one compartment 74 to another, each spot has the appropriate double-chain bond temperature range so that it is possible to obtain stable results with no miss-hybridization.
[0090] In this embodiment, an example was illustrated wherein oligo-nucleic acid base sequences are mounted on a glass substrate (73). However, for the substrate, other materials such as a resin may be used instead of glass. Also, similar effects can be realized in an array wherein the sequence is spotted to membrane or any two-dimensional array wherein each oligo-nucleic acid base sequence is embedded in a partitioned region so as to be present in an individual area.
[0091] Obviously, many modifications and variations of the present invention are possible in light of the above teachings. It is therefore to be understood that, within the scope of the appended claims, the invention may be practiced otherwise than as specifically described.
[0092] For instance, in the embodiment above, before the double-chain bond temperature was obtained, the non-specific region was skipped at Step S5. The scope of the invention is not limited to this method. That is, in this method, the starting point of an oligo-nucleic acid base sequence candidate is always a specific sequence part; however, it may start from a non-specific part. Thus, for instance, as illustrated in FIG. 7, Step S5 may be executed after Step S9 in the aforementioned embodiment.
[0093] According to this configuration, irrespective of whether the first base is a specific sequence part or a non-specific sequence part, as long as the ratio of non-specific part to specific part is Ln (for instance 50%) or smaller, the sequence will be saved as a candidate. Candidates partially containing non-specific sequence parts as in this example will be outputted as low-grade candidates (Step S10). As in this case, it is possible to increase the number of candidates by loosening the condition.
[0094] As explained above, according to the present invention, a computer program and a method are provided to concurrently determine many oligo-nucleic acid sequences having a high level of accuracy in double-chain bond temperature or melting temperature (Tm).
[0095] Also, an oligo-nucleic acid array mounted with the oligo-nucleic acid sequences obtained in this method is provided.
Claims
1. A computer software program for designing an optimum oligo-nucleic acid sequence candidate from a nucleic acid base sequence, said program comprising:
- a first command for receiving and storing a specification for a tolerance of double-chain bond temperature;
- a second command for computing the double-chain bond temperature of a partial sequence at each length as extending the length of the partial sequence in said nucleic acid base sequence; and
- a third command for determining whether or not the double-chain bond temperature computed by the second command falls within the tolerance obtained by the first command, and, if it does, for outputting the partial sequence of the length as an oligo-nucleic acid sequence candidate.
2. The computer software program according to claim 1,
- wherein the second command successively shifts a starting point of the partial sequence for which the double-chain bond temperature is obtained in said nucleic acid sequences, and extends the length of the partial sequence from the shifted starting point.
3. The computer software program according to claim 2,
- wherein the second command, when the third command determines that the double-chain bond temperature of the partial sequence is outside of the tolerance, shifts the starting point to the next point without extending the length any longer.
4. The computer software program according to claim 1,
- said program further comprises:
- a fourth command for displaying a plurality of oligo-nucleic acid sequence candidates outputted by the third command.
5. The computer software program according to claim 1,
- said program further comprises:
- a fifth command for comparing homomeric genes among a plurality of nucleic acid sequences and for identifying sequence parts specific and non-specific to each nucleic acid sequence, whereby the second command successively shifts a starting point of said specific sequence part in the partial sequence for which the double-chain bond temperature is obtained, and extends the length of the partial sequence from the shifted starting point.
6. The computer software program according to claim 5,
- wherein the second command, when the third command determines that the double-chain bond temperature of the partial sequence is outside of the tolerance, shifts the starting point to the next point without extending the length any longer.
7. The computer software program according to claim 5,
- wherein the second command further comprises a sixth command, which, when the extended partial sequence contains said non-specific sequence parts, determines whether to compute the double-chain bond temperature of said partial sequence or to set it as an oligo-nucleic acid sequence candidate; and the second command, when the sixth command determines that no double-chain bond temperature is to be computed or no oligo-nucleic acid sequence candidate is to be set, shifts the starting point of said partial sequence to the beginning of the next specific sequence part.
8. The computer software program according to claim 7,
- wherein the sixth command determines, based on the ratio of the specific sequence parts and the non-specific sequence parts or the respective sequence numbers contained in said extended partial sequence, whether to compute the double-chain bond temperature of said extended partial sequence or to set it as an oligo-nucleic acid sequence candidate.
9. The computer software program according to claim 7,
- wherein the fourth command, when said non-specific sequence parts are contained in the partial sequence that falls within the tolerance of the double-chain bond temperature, outputs the partial sequence as a low-grade oligo-nucleic acid sequence candidate.
10. The computer software program according to claim 5,
- wherein the second command, when no partial sequence with the length that has the double-chain bond temperature within the tolerance can be obtained from each specific sequence part, extends the length of the partial sequence to the non-specific sequence parts, and outputs the partial sequence that falls within the tolerance of the double-chain bond temperature as a low-grade oligo-nucleic acid sequence candidate.
11. A method for designing an optimum oligo-nucleic acid sequence candidate from a nucleic acid base sequence, said method comprising the steps of:
- (a) specifying a tolerance of double-chain bond temperature;
- (b) computing, within said nucleic acid base sequence, the double-chain bond temperature of a partial sequence at each length as extending the length; and
- (c) determining whether or not the double-chain bond temperature computed by said computing means falls within the tolerance specified by said specifying means, and if it does, outputting the partial sequence of the length as an oligo-nucleic acid sequence candidate.
12. The method according to claim 11,
- wherein Step (b) successively shifts a starting point of the partial sequence for which the double-chain bond temperature is obtained in said nucleic acid sequence, and extends the length of the partial sequence from the shifted starting point.
13. An oligo-nucleic acid array mounted with a plurality of oligo-nucleic acids having sequence lengths designed so as to have a predetermined double-chain bond temperature.
Type: Application
Filed: Jun 19, 2002
Publication Date: Dec 26, 2002
Applicant: Kabushikigaisha DYNACOM (Chiba)
Inventors: Hitoshi Fujimiya (Mobara-shi), Junichiro Miura (Mobara-shi), Yoshiaki Aoki (Mobara-shi)
Application Number: 10174630
International Classification: G06G007/48; G06G007/58; G06F019/00; G01N033/48; G01N033/50;