Ribonucleic acid interference molecules

Info

Publication number: 20080125583
Type: Application
Filed: Feb 10, 2006
Publication Date: May 29, 2008
Applicant: International Business Machines Corporation (Armonk, NY)
Inventors: Isidore Rigoutsos (Astoria, NY), Tien Huynh (Yorktown, NY), Kevin Charles Miranda (McDowall)
Application Number: 11/352,152

Abstract

Ribonucleic acid interference molecules are provided. For example, in one aspect of the invention, at least one nucleic acid molecule comprising at least one of one or more precursor sequences having SEQ ID NO: 1 through SEQ ID NO: 103,948 and one or more mature sequences having SEQ ID NO: 1 through SEQ ID NO: 126,499 is provided. One or more of the at least one of one or more precursor sequences and one or more mature sequences may be computationally predicted, e.g., from publicly available genomes, using a pattern discovery method. In another aspect of the invention, a method for regulating gene expression comprises the following step. At least one nucleic acid molecule comprising at least one of one or more precursor sequences having SEQ ID NO: 1 through SEQ ID NO: 103,948, each one of the precursor sequences containing one or more mature sequences having SEQ ID NO: 1 through SEQ ID NO: 126,499, is used to regulate the expression of one or more genes, e.g., by inducing post-transcriptional silencing of the one or more genes.

Description

Description

CROSS REFERENCE TO RELATED APPLICATIONS

This application claims the benefit of U.S. Provisional Application No. 60/652,499, filed Feb. 11, 2005, the disclosure of which is incorporated by reference herein.

This application is related to U.S. patent application entitled “System and Method for Identification of MicroRNA Target Sites and Corresponding Targeting MicroRNA,” (attorney docket no. YOR920060077US1), and to U.S. patent application entitled “System and Method for Identification of MicroRNA Precursor Sequences and Corresponding Mature MicroRNA Sequences from Genomic Sequences” (attorney docket no. YOR920060075US1), both filed concurrently herewith, the disclosures of which are incorporated by reference herein.

FIELD OF THE INVENTION

The present invention relates to genes and, more particularly, to ribonucleic acid interference molecules and their role in gene expression.

BACKGROUND OF THE INVENTION

The ability of an organism to regulate the expression of its genes is of central importance to life. A breakdown in this homeostasis leads to disease states, such as cancer, where a cell multiplies uncontrollably, to the detriment of the organism. The general mechanisms utilized by organisms to maintain this gene expression homeostasis are the focus of intense scientific study.

It recently has been discovered that some cells are able to down-regulate their gene expression through certain ribonucleic acid (RNA) molecules. Namely, RNA molecules can act as potent gene expression regulators either by inducing mRNA degradation or by inhibiting translation; this activity is summarily referred to as post-transcriptional gene silencing or PTGS for short. An alternative name by which it is also known is RNA interference, or RNAi. PTGS/RNAi has been found to function as a mediator of resistance to endogenous and exogenous pathogenic nucleic acids and also as a regulator of the expression of genes inside cells.

The term ‘gene expression,’ as used herein, refers generally to the transcription of messenger-RNA (mRNA) from a gene, and its subsequent translation into a functional protein. One class of RNA molecules involved in gene expression regulation comprises microRNAs, which are endogenously encoded and regulate gene expression by either disrupting the translation processes or by degrading mRNA transcripts, e.g., inducing post-transcriptional repression of one or more target sequences.

The RNAi/PTGS mechanism allows an organism to employ short RNA sequences to either degrade or disrupt translation of complementary mRNA transcripts. Early studies suggested only a limited role for RNAi, that of a defense mechanism against pathogens. However, the subsequent discovery of many endogenously-encoded microRNAs pointed towards the possibility of this being a more general, in nature, control mechanism. Recent evidence has led the community to hypothesize that a wider spectrum of biological processes are affected by RNAi, thus extending the range of this presumed control layer.

A better understanding of the mechanism of the RNA interference process would benefit drug design, the fight against disease, and the understanding of host defense mechanisms.

SUMMARY OF THE INVENTION

Ribonucleic acid interference molecules are provided. For example, in one aspect of the invention, at least one nucleic acid molecule comprising at least one of one or more precursor sequences having SEQ ID NO: 1 through SEQ ID NO: 103,948 and one or more mature sequences having SEQ ID NO: 1 through SEQ ID NO: 126,499 is provided. For example, molecules may be one or more instances of a precursor type, one or more instances of a mature type, or some combinations thereof. One or more of the sequences may be computationally predicted, e.g., from publicly available genomes, using a pattern discovery method.

It is to be understood that “SEQ ID NO.” stands for sequence identification number. Each sequence identification number corresponds to a sequence stored in a text file on the accompanying CDROM.

In another aspect of the invention, a method for regulating gene expression comprises the following step. At least one nucleic acid molecule comprising at least one of one or more precursor sequences having SEQ ID NO: 1 through SEQ ID NO: 103,948, each one of the precursor sequences containing one or more mature sequences having SEQ ID NO: 1 through SEQ ID NO: 126,499, is used to regulate the expression of one or more genes.

The method may further comprise inserting the at least one nucleic acid molecule into an environment where the at least one nucleic acid molecule can be produced biochemically. The method may further comprise inserting the at least one nucleic acid molecule in to an environment where the at least one nucleic acid molecule can be produced biochemically, giving rise to one or more interfering ribonucleic acids which affect one or more target sequences.

One or more of the sequences may be synthetically removed from the genome that contains them naturally. One or more of the sequences may be synthetically introduced in a genome that does not contain them naturally.

One or more target sequences may be encoded by the same genome as the one or more sequences. One or more target sequences may be encoded by a different genome from the one or more sequences. One or more target sequences are naturally occurring. One or more target sequences may be synthetically constructed.

One or more sequences may be transcribed, giving rise to one or more interfering ribonucleic acids which induce post-transcriptional repression of one or more target sequences. One or more sequences may be transcribed, giving rise to one or more interfering ribonucleic acids which induce gene silencing of one or more target sequences. One or more of the sequences may be synthetically constructed.

In a yet another aspect of the invention, at least one nucleic acid molecule comprises at least a portion of a precursor sequence having one of SEQ ID NO: 1 through SEQ ID NO: 103,948, wherein the portion comprises an amount of the sequence that does not significantly alter a behavior of the complete precursor sequence.

In a further aspect of the invention, at least one nucleic acid molecule comprises at least a portion of a mature sequence having one of SEQ ID NO: 1 through SEQ ID NO: 126,499, wherein the portion comprises an amount of the sequence that does not significantly alter a behavior of the complete mature sequence.

A more complete understanding of the present invention, as well as further features and advantages of the present invention, will be obtained by reference to the following detailed description.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS

The teachings of the present invention relate to ribonucleic acid (RNA) molecules and their role in gene expression regulation. The term ‘gene expression,’ as used herein, refers generally to the transcription of messenger-RNA (mRNA) from a gene, and, e.g., its subsequent translation into a functional protein. One class of RNA molecules involved in gene expression regulation comprises microRNAs, which are endogenously encoded and regulate gene expression by either disrupting the translation processes or by degrading mRNA transcripts, e.g., inducing post-transcriptional repression of one or more target sequences. MicroRNAs are transcribed by RNA polymerase II as parts of longer primary transcripts known as pri-microRNAs. Pri-microRNAs are subsequently cleaved by Drosha, a double-stranded-RNA-specific ribonuclease, to form microRNA precursors or pre-microRNAs. Pre-microRNAs are exported by Exportin-5 from the nucleus into the cytoplasm where they are processed by Dicer. Dicer is a member of the RNase III family of nucleases that cleaves the pre-microRNA and forms a double-stranded RNA with overhangs at the 3′ of both ends that are one to four nucleotides long. The mature microRNA is derived from either the leading or the lagging arm of the microRNA precursor. Finally, a helicase separates the double-stranded RNA species into single-stranded and the strand containing the mature microRNA becomes associated with an effector complex known as RISC (for RNA-induced silencing complex). The RISC+microRNA construct base pairs with its target in a sequence-specific manner using Watson-Crick pairing (and the occasional formation of G:U pairs). If the microRNA is loaded into an Argonaute-2 RISC, the target is cleaved at the binding site and degraded. In the presence of mismatches between a microRNA and its target, post-transcriptional gene silencing is effected through translational inhibition.

According to the teachings presented herein, the target sequence(s) may be naturally occurring. Alternatively, the target sequences may be synthetically constructed. A target sequence may be synthetically constructed so as to test prediction methods and/or to induce the RNAi/PTGS control of genes of interest. Additionally, a target sequence may be synthetically constructed so as to control multiple genes with a single RNA molecule, and also possibly to modify, in a combinatorial manner, the kinetics of the reaction by, for example, introducing multiple target sites. Similarly, the precursor sequence(s) may be either naturally occurring or synthetically constructed. For example, a precursor sequence of interest may be synthetically constructed and introduced into a cell that lacks that particular precursor. Further, when any of the above sequences are naturally occurring, they may be synthetically removed, for analysis purposes, from the genome that contains them, e.g., using standard molecular techniques.

As mentioned above, the present application is related to U.S. patent application entitled “System and Method for Identification of MicroRNA Target Sites and Corresponding Targeting MicroRNA,” (attorney docket no. YOR920060077US1), and to U.S. patent application entitled “System and Method for Identification of MicroRNA Precursor Sequences and Corresponding Mature MicroRNA Sequences from Genomic Sequences” (attorney docket no. YOR920060075US1), both filed concurrently herewith, the disclosures of which are incorporated by reference herein.

In such related applications, several important questions are addressed. For example, for a given nucleotide sequence, is it part of or does it contain a microRNA precursor? Or, given the sequence of a microRNA precursor, where is the segment which will give rise to the mature microRNA? Further, is there more than one mature microRNA produced by a particular precursor, and if so, where are the segments which, after transcription, will give rise to these additional mature microRNAs? Another question addressed is the following: given the 3′ untranslated region (3′UTR) of a given gene, which region(s) of it will function as a target(s) for some mature microRNA? This last question can also be asked when we are instead presented with the 5′ untranslated region (5′UTR) or the amino acid coding region of a given gene. Also, for a given putative target site, which microRNA, if any, will bind to the putative target site?

For the purposes of this discussion, we only focus on the problem of whether a specific nucleotide sequence corresponds to a microRNA precursor or to a mature microRNA. A method for answering this question is described in the above referenced YOR920060075US1 patent application.

Summarily, the method comprises a first phase during which patterns are generated by processing an appropriate training set using a pattern discovery algorithm. If the training set comprises sequences of microRNA precursors, then the generated patterns, after appropriate attribute-based filtering, will be microRNA-precursor specific. If the training set comprises sequences of mature microRNAs, then the generated patterns, after appropriate attribute-based filtering, will be mature-microRNA specific. Alternatively, the training set can comprise putative mature microRNAs or putative microRNA precursors. In a preferred embodiment, two training sets were used, one comprising sequences of known microRNA precursors and one comprising sequences of known mature microRNAs.

The basic idea of this pattern-based method is to replace the training set of sequences with an “equivalent” representation that consists of patterns. The patterns can be derived using a pattern discovery algorithm, such as the Teiresias algorithm. See, for example, U.S. Pat. No. 6,108,666 issued to A. Floratos and I. Rigoutsos, entitled “Method and Apparatus for Pattern Discovery in 1-Dimensional Event Streams,” the disclosure of which is incorporated by reference herein. The patterns are, ideally, maximal in composition and length (properties which are, by default, guaranteed by the Teiresias algorithm).

The generated microRNA-precursor-specific or mature-microRNA-specific patterns can then be used as predicates to identify, in a de novo manner, microRNA precursors from genomic sequence, or mature microRNAs in the sequence of a putative microRNA precursor. This is exploited in the method's second phase during which the patterns at hand are sought in the sequence under consideration: to determine whether a given nucleotide sequence S is part of, or encodes, a microRNA precursor the microRNA-precursor-specific patterns are used; and to determine whether a given nucleotide sequence S corresponds to, or contains a mature microRNA mature-microRNA-specific patterns are used.

In general, one anticipates numerous instances of microRNA-precursor-specific patterns in sequences that correspond to microRNA precursors whereas background and unrelated sequences should receive few or no such hits. If the number of pattern instances exceeds a predetermined threshold, then the corresponding segment of the sequence that receives the pattern support (and possibly an appropriately sized flanking region) is reported as a putative microRNA precursor. Analogous comments can be made about mature-microRNA-specific patterns and sequences containing mature microRNAs.

In the present application, pattern-discovery techniques, such as those described above, have been used in conjunction with recently released, publicly available genomic sequences to predict microRNA precursor and mature miRNA sequences related to the following organisms: C. elegans (Wormbase release 140); D. melanogaster (Berkely Drosophila Genome Project release 3.2); M. musculus (Ensembl assembly based on the NCBI 31 assembly); and H. sapiens (Ensembl assembly based on the NCBI 31 assembly). Namely, precursor sequences having SEQ ID NO: 1 through SEQ ID NO: 57,431 derived from the genome of H. sapiens are presented; precursor sequences having SEQ ID NO: 57,432 through SEQ ID NO: 101,967 derived from the genome of M. musculus are presented; precursor sequences having SEQ ID NO: 101,968 through SEQ ID NO: 103,203 derived from the genome of D. melanogaster are presented; precursor sequences having SEQ ID NO: 103,204 through SEQ ID NO: 103,948 derived from the genome of C. elegans are presented; mature sequences having SEQ ID NO: 1 through SEQ ID NO: 69,388 derived from the genome of H. sapiens are presented; mature sequences having SEQ ID NO: 69,389 through SEQ ID NO: 124,057 derived from the genome of M. musculus are presented; mature sequences having SEQ ID NO: 124,058 through SEQ ID NO: 125,536 derived from the genome of D. melanogaster are presented; and mature sequences having SEQ ID NO: 125,537 through SEQ ID NO: 126,499 derived from the genome of C. elegans are presented.

These predicted precursor and mature sequences are submitted herewith in electronic text format as the files ALL_MATURES.txt, created on Friday, Feb. 10, 2006, having a size of 11.4 Megabytes, and ALL_PRECURSORS.txt, Friday, Feb. 10, 2006, having a size of 14.5 Megabytes, on compact disc (CDROM), the contents of which are incorporated by reference herein. Two identical copies of the sequences are submitted herewith.

With respect to the sequences submitted herewith, for each precursor sequence that is listed, five features are presented (in addition to the sequence ID number and the corresponding organism name). One, the chromosome number (e.g., the chromosome identifier) is displayed. Two, the precursor start and end points on the corresponding chromosome are denoted. Three, the strand, either forward or reverse, on which the precursor will be found, is listed. Four, since the sequences displayed are predicted to fold into hairpin-like shapes, the predicted folding energy (also known as the energy required to denature the precursor) is presented. Five, each precursor sequence is presented.

As above, with respect to the sequences submitted herewith, for each mature sequence predicted, six features are presented (in addition to the sequence ID number and the corresponding organism name). One, as above, the chromosome number, (e.g., chromosome identifier) is displayed. Two, as above, the start and end points of the corresponding precursor sequence on the corresponding chromosome are denoted. Three, as above, the strand, either forward or reverse, on which the corresponding precursor will be found is listed. Four, as above, since the sequences displayed are derived from precursors which are predicted to fold into hairpin-like shapes, the folding energy of the corresponding precursor (also known as the energy required to denature the precursor) is presented. Five, the start and end points of the mature sequence on the corresponding chromosome are denoted. Six, each mature sequence is presented.

All of the sequences presented herein, whether precursors or matures, are deoxyribonucleic acid (DNA) sequences. One of ordinary skill in the art would easily be able to derive the RNA transcripts corresponding to these DNA sequences. As such, the RNA forms of these DNA sequences are considered to be within the scope of the present teachings. Also, it should be understood that the locations of the described sequences are given in the form of global coordinates, i.e., in terms of distances from the leftmost tip of the forward strand in the chromosome where the sequence at hand is located. In other words, all of the stated coordinates use the beginning of each chromosome's forward strand as a point of reference. If a sequence is reported to be on the reverse strand between locations X and Y, then one can actually generate the actual nucleotide sequence for it by excising the string contained between locations X and Y of the forward strand and then generating its reverse complement. These global coordinates are likely to change from one release of the genomic assembly to the next. Nonetheless, even though its actual location may change, the actual sequence that corresponds to a microRNA precursor or a mature microRNA is expected to remain unique and thus the corresponding sequence's new location will still be identifiable (except of course for the case where the sequence at hand corresponds to a segment that has been removed from the genomic assembly that is being examined).

One of ordinary skill in the art would also recognize that sequences that are either homologous or orthologous to the sequences presented herein, e.g., sequences that are related by vertical descent from a common ancestor or through other means (e.g., through horizontal gene transfer), will likely be present in genomes other than the ones mentioned herein. Such homologous/orthologous sequences are expected to generally differ from the sequences listed herein by only a small number of locations. Thus, the teachings presented herein should be construed as being broadly applicable to such homologous/orthologous sequences from species other than those listed above.

According to an exemplary embodiment, nucleic acid molecules may be generated based on the predicted precursor and mature sequences. The nucleic acid molecules generated may then be used to regulate gene expression. For example, as described generally above, mechanisms exist by which RNA molecules effect the expression of genes. By way of example only, the nucleic acid molecules generated may regulate the expression of a gene, or genes, by inducing post-transcriptional silencing of the gene, e.g. as described above. Using the predicted precursor and mature sequences to study gene expression may be conducted using techniques and procedures commonly known to those skilled in the art.

Although illustrative embodiments of the present invention have been described herein, it is to be understood that the invention is not limited to those precise embodiments, and that various other changes and modifications may be made by one skilled in the art without departing from the scope or spirit of the invention. For example, one may modify one or more of the described precursor sequences by adding or removing a number of nucleotides which is small enough to not significantly or radically alter the original sequence's behavior. The percentage of the resulting portion, with respect to the complete original sequence, that does not significantly alter such behavior depends on the sequence under consideration. Or one may insert one or more of the described mature sequences in an appropriately constructed “container sequence” (e.g., a precursor-like construct that is different than the precursor where this mature sequence naturally occurs) that still permits the excision of effectively the same mature sequence and thus the generation of an active molecule whose action is essentially unchanged with respect to that of the molecule corresponding to the starting mature sequence.

Claims

1. At least one nucleic acid molecule, comprising:

at least one of one or more precursor sequences having SEQ ID NO: 1 through SEQ ID NO: 103,948 and one or more mature sequences having SEQ ID NO: 103,949 through SEQ ID NO: 230,447.

2. The at least one nucleic acid molecule of claim 1, wherein one or more of the at least one of one or more precursor sequences and one or more mature sequences have been computationally predicted using a pattern discovery method.

3. The at least one nucleic acid molecule of claim 1, wherein one or more of the at least one of one or more precursor sequences and one or more mature sequences regulate gene expression in one or more genes by inducing post-transcriptional silencing of the one or more genes.

4. The at least one nucleic acid molecule of claim 1, wherein one or more of the sequences encode ribonucleic acid sequences.

5. The at least one nucleic acid molecule of claim 1, wherein one or more of the sequences encode interfering ribonucleic acid sequences.

6. The at least one nucleic acid molecule of claim 1, wherein the precursor sequences having SEQ ID NO: 1 through SEQ ID NO: 57,431 are derived from a genomic sequence corresponding to H. sapiens.

7. The at least one nucleic acid molecule of claim 1, wherein the precursor sequences having SEQ ID NO: 57,432 through SEQ ID NO: 101,967 are derived from a genomic sequence corresponding to M. musculus.

8. The at least one nucleic acid molecule of claim 1, wherein the precursor sequences having SEQ ID NO: 101,968 through SEQ ID NO: 103,203 are derived from a genomic sequence corresponding to D. melanogaster.

9. The at least one nucleic acid molecule of claim 1, wherein the precursor sequences having SEQ ID NO: 103,204 through SEQ ID NO: 103,948 are derived from a genomic sequence corresponding to C. elegans.

10. The at least one nucleic acid molecule of claim 1, wherein the mature sequences having SEQ ID NO: 103,949 through SEQ ID NO: 173,336 are derived from a genomic sequence corresponding to H. sapiens.

11. The at least one nucleic acid molecule of claim 1, wherein the mature sequences having SEQ ID NO: 173,337 through SEQ ID NO: 228,005 are derived from a genomic sequence corresponding to M. musculus.

12. The at least one nucleic acid molecule of claim 1, wherein the mature sequences having SEQ ID NO: 228.006 through SEQ ID NO: 229,484 are derived from a genomic sequence corresponding to D. melanogaster.

13. The at least one nucleic acid molecule of claim 1, wherein the mature sequences having SEQ ID NO: 229,485 through SEQ ID NO: 230,447 are derived from a genomic sequence corresponding to C. elegans.

14. A method for regulating gene expression, the method comprising the step of:

using at least one nucleic acid molecule, comprising at least one of one or more precursor sequences having SEQ ID NO: 1 through SEQ ID NO: 103,948 or one or more mature sequences having SEQ ID NO: 103,949 through SEQ ID NO: 230,447 to regulate the expression of one or more genes.

15. The method of claim 14, further comprising the step of inserting the at least one nucleic acid molecule into an environment where the at least one nucleic acid molecule can be produced biochemically.

16. The method of claim 14, further comprising the step of inserting the at least one nucleic acid molecule in to an environment where the at least one nucleic acid molecule can be produced biochemically, giving rise to one or more interfering ribonucleic acids which affect one or more target sequences.

17. The method of claim 14, wherein one or more of the sequences are synthetically removed from the genome that contains them naturally.

18. The method of claim 14, wherein one or more of the sequences are synthetically introduced in a genome that does not contain them naturally.

19. The method of claim 14, wherein one or more target sequences are encoded by a same genome as the one or more sequences.

20. The method of claim 14, wherein one or more target sequences are encoded by a different genome from the one or more sequences.

21. The method of claim 14, wherein one or more sequences are transcribed, giving rise to one or more interfering ribonucleic acids which induce post-transcriptional repression of one or more target sequences.

22. The method of claim 14, wherein the one or more sequences are transcribed, giving rise to one or more interfering ribonucleic acids which induce gene silencing of one or more target sequences.

23. The method of claim 14, wherein one or more target sequences are naturally occurring.

24. The method of claim 14, wherein one or more target sequences are synthetically constructed.

25. The method of claim 14, wherein one or more of the sequences are synthetically constructed.

26. At least one nucleic acid molecule, comprising:

at least a portion of a precursor sequence having one of SEQ ID NO: 1 through SEQ ID NO: 103,948, wherein the portion comprises an amount of the sequence that does not significantly alter a behavior of the complete precursor sequence.

27. At least one nucleic acid molecule, comprising: at least a portion of a mature sequence having one of SEQ ID NO: 103,949 through SEQ ID NO: 230,447, wherein the portion comprises an amount of the sequence that does not significantly alter a behavior of the complete mature sequence.