BIOLOGICAL DATA NETWORKS AND METHODS THEREFOR
A system and method for facilitating processing of a request in a biological data network comprised of a plurality of biological data units stored at a plurality of network nodes is disclosed herein. The disclosed method includes receiving, at a first network node included within the plurality of network nodes, the request from a client device wherein the first network node is configured to communicate with other network nodes included within the plurality of network nodes. The method also includes performing a first processing operation with respect to at least one of the biological data units based upon the request. Upon a determination being made that the processing of the request is complete, a response is sent from the first network node to the client device.
Latest ANNAI SYSTEMS, INC. Patents:
- Biological data networks and methods therefor
- Method and systems for processing polymeric sequence data and related information
- Method and systems for processing polymeric sequence data and related information
- Method and systems for processing polymeric sequence data and related information
- Method and systems for processing polymeric sequence data and related information
The present application claims the benefit of priority under 35 U.S.C. §119(e) of U.S. Provisional Patent Application Ser. No. 61/451,086, entitled BIOLOGICAL DATA NETWORK, filed on Mar. 9, 2011, of U.S. Provisional Patent Application Ser. No. 61/539,942, entitled SYSTEM AND METHOD FOR SECURE, HIGHSPEED TRANSFER OF VERY LARGE FILES, filed Sep. 27, 2011, and of U.S. Provisional Patent Application Ser. No. 61/539,931, entitled SYSTEM AND METHOD FOR FACILITATING NETWORK-BASED TRANSACTIONS INVOLVING SEQUENCE DATA, filed Sep. 27, 2011, the content of each of which is hereby incorporated by reference herein in its entirety for all purposes. This application is related to U.S. Utility patent application Ser. No. 12/837,452, entitled METHODS AND SYSTEMS FOR PROCESSING GENOMIC DATA, filed on Jul. 15, 2010, which claims priority to U.S. Provisional Patent Application Ser. No. 61/358,854, entitled METHODS AND SYSTEMS FOR PROCESSING GENOMICS DATA, filed on Jun. 25, 2010, and to U.S. Utility patent application Ser. No. 12/828,234, entitled METHODS AND SYSTEMS FOR PROCESSING GENOMIC DATA, filed on Jun. 30, 2010, which claims priority to U.S. Provisional Patent Application Ser. No. 61/358,854, entitled METHODS AND SYSTEMS FOR PROCESSING GENOMICS DATA, filed on Jun. 25, 2010, the content of each of which is hereby incorporated by reference herein in its entirety for all purposes. This application is also related to U.S. Utility patent application Ser. No. 13/223,077, entitled METHODS AND SYSTEMS FOR PROCESSING POLYMERIC SEQUENCE DATA AND RELATED INFORMATION, filed on Aug. 31, 2011, and to U.S. Utility patent application Ser. No. 13/223,084, entitled METHODS AND SYSTEMS FOR PROCESSING POLYMERIC SEQUENCE DATA AND RELATED INFORMATION, filed on Aug. 31, 2011, and to U.S. Utility patent application Ser. No. 13/223,088, entitled METHODS AND SYSTEMS FOR PROCESSING POLYMERIC SEQUENCE DATA AND RELATED INFORMATION, filed on Aug. 31, 2011, and to U.S. Utility patent application Ser. No. 13/223,092, entitled METHODS AND SYSTEMS FOR PROCESSING POLYMERIC SEQUENCE DATA AND RELATED INFORMATION, filed on Aug. 31, 2011, and to U.S. Utility patent application Ser. No. 13/223,097, entitled METHODS AND SYSTEMS FOR PROCESSING POLYMERIC SEQUENCE DATA AND RELATED INFORMATION, filed on Aug. 31, 2011, the content of each of which is hereby incorporated by reference herein in its entirety for all purposes. This application is also related to U.S. Utility patent application Ser. No. ______, entitled BIOLOGICAL DATA NETWORKS AND METHODS THEREFOR, filed on Mar. 9, 2012, and to U.S. Utility patent application Ser. No. ______, entitled BIOLOGICAL DATA NETWORKS AND METHODS THEREFOR, filed on Mar. 9, 2012, and to U.S. Utility patent application Ser. No. ______, entitled BIOLOGICAL DATA NETWORKS AND METHODS THEREFOR, filed on Mar. 9, 2012, and to U.S. Utility patent application Ser. No. ______, entitled BIOLOGICAL DATA NETWORKS AND METHODS THEREFOR, filed on Mar. 9, 2012, and to U.S. Utility patent application Ser. No. ______, entitled BIOLOGICAL DATA NETWORKS AND METHODS THEREFOR, filed on Mar. 9, 2012, and to U.S. Utility patent application Ser. No. ______, entitled BIOLOGICAL DATA NETWORKS AND METHODS THEREFOR, filed on Mar. 9, 2012.
FIELDThis application is generally directed to processing and networking polymeric sequence information, including biopolymeric sequence information such as DNA sequence information.
BACKGROUNDDeoxyribonucleic acid (“DNA”) sequencing is the process of determining the ordering of nucleotide bases (adenine (A), guanine (G), cytosine (C) and thymine (T)) in molecular DNA. Knowledge of DNA sequences is invaluable in basic biological research as well as in numerous applied fields such as, but not limited to, medicine, health, agriculture, livestock, population genetics, social networking, biotechnology, forensic science, security, and other areas of biology and life sciences.
Sequencing has been done since the 1956s, when academic researchers began using laborious methods based on two-dimensional chromatography. Due to the initial difficulties in sequencing in the early 1956s, the cost and speed could be measured in scientist years per nucleotide base as researchers set out to sequence the first restriction endonuclease site containing just a handful of bases. Thirty years later, the entire 3.2 billion bases of the human genome have been sequenced, with a first complete draft of the human genome done at a cost of about three billion dollars. Since then sequencing costs have rapidly decreased.
Today, the cost of sequencing the human genome is on the order of $5000 and is expected to hit the $1000 mark later this year with the results available in hours, much like a routine blood test. As the cost of sequencing the human genome continues to plummet, the number of individuals having their DNA sequenced for medical, as well as other purposes, will likely increase significantly. Currently, the nucleotide base sequence data collected from DNA sequencing operations are stored in multiple different formats in a number of different databases.
Such databases also contain annotations and other attribute information related to the DNA sequence data including, for example, information concerning single nucleotide polymorphisms (SNPs), gene expression, copy number variations methylation sequence. Moreover, transcriptomic and proteomic data are also present in multiple formats in multiple databases. This renders it impractical to exchange and process the sources of genome sequence data and related information collected in various locations, thereby hampering the potential for scientific discoveries and advancements.
SUMMARYIn one aspect the disclosure relates to method for facilitating processing of a request in a biological data network comprised of a plurality of biological data units stored at a plurality of network nodes. The method includes receiving, at a first network node included within the plurality of network nodes, the request from a client device wherein the first network node is configured to communicate with other network nodes included within the plurality of network nodes. The method also includes performing a first processing operation with respect to at least one of the biological data units based upon the request. Upon a determination being made that the processing of the request is complete, a response is sent from the first network node to the client device.
The disclosure also relates to a method for facilitating processing of genomic sequence information in a biological data network including a plurality of biological data units stored at a plurality of network nodes. The method includes receiving, at a network node included within the plurality of network nodes, a segment of a genome sequence of an organism and comparing the segment of the genome sequence to a reference sequence. The method further includes identifying sequence variants between the segment of the genome sequence and the reference sequence. A request for information relating to the sequence variants is then sent from the network node. The method further includes receiving, from another network node included within the plurality of network nodes, the information relating to the sequence variants.
In another aspect the disclosure pertains to a method for facilitating processing of a disease-related query within a biological data network including a plurality of biological data units stored at a plurality of network nodes. The method includes receiving, at a first network node of the plurality of network nodes, a query relating to a specified disease and a genomic sequence associated with the query. The method further includes identifying, relative to a control sequence, any variant alleles within the genomic sequence. Information identifying the variant alleles is then sent from the first network node to a second network node of the plurality of network nodes. The method further includes receiving, at the first network node, information relating to the variant alleles.
In yet another aspect the disclosure relates to a network-based method for facilitating processing of a disease-related query. The method includes receiving, at a first network node, a query relating to a specified disease and a genomic sequence associated with the query. The method further includes identifying, relative to a control sequence, any variant alleles within the genomic sequence. Information identifying the variant alleles is then sent from the first network node to a second network node. The method further includes receiving, at the first network node, pharmacological response data associated with those of the variant alleles included within genes associated with the specified disease and sending a response to the query based upon the pharmacological response data.
The disclosure also pertains to a method for facilitating processing of a disease-related query within a biological data network. The method includes receiving, at a network node, information identifying variant alleles within a genomic sequence associated with a query relating to a specified disease. The method further includes performing, at the network node, a statistical correlation analysis in order to identify those of the variant alleles included within genes associated with the specified disease. In addition, the method includes sending results of the statistical correlation to another network node for further processing.
In a further aspect the disclosure relates to a method for facilitating the processing of biological data within a network including a plurality of nodes. The method includes receiving, at a first node of the plurality of nodes, a request to process the biological data. The method further includes performing a first processing operation with respect to at least a DNA-specific layer of the biological data based upon the request. In addition, the method includes sending, to a second node of the plurality of nodes, results of the first processing operation wherein the second node is configured for processing of an RNA-specific layer of the results.
In another aspect the disclosure is directed to a network node for use within a biological data network comprised of a plurality of biological data units. The he network node includes a network interface and an input packet processor in electronic communication with the network interface, the input packet processor being configured to receive a request from a client device. The network nodes further includes a processing module configured to perform a first processing operation with respect to at least one of the biological data units based upon the request and to determine that the processing of the request is complete. In addition, the network node includes a transmit controller module in electronic communication with the network interface, the transmit controller module controlling the sending of a response to the client device.
The disclosure also pertains to a network node for use within a biological data network comprised of a plurality of biological data units. The network node includes a network interface and an input packet processor in electronic communication with the network interface, the input packet processor being configured to receive an input packet including a segment of a genome sequence of an organism. The network node also includes a processing module configured to compare the segment of the genome sequence to a reference sequence and to identify sequence variants between the segment of the genome sequence and the reference sequence. In addition, the network node also includes a transmit controller module in electronic communication with the network interface, the transmit controller module controlling the sending of a request for information relating to the sequence variants. The network interface is further configured to receive, from another network node included within the plurality of network nodes, the information relating to the sequence variants.
In a further aspect the disclosure pertains to a network node for use within a biological data network comprised of a plurality of biological data units. The network node includes a network interface and an input packet processor in electronic communication with the network interface, the input packet processor being configured to receive a query relating to a specified disease and a genomic sequence associated with the query. The network node also includes a processing module configured to identify, relative to a control sequence, any variant alleles within the genomic sequence. The network node further includes a transmit controller module in electronic communication with the network interface, the transmit controller module controlling the sending of information identifying the variant alleles to another network node wherein the network interface is further configured to receive, from the another network node, information relating to the variant alleles.
In another aspect the disclosure pertains to a network node for facilitating processing of a disease-related query. The network node includes a network interface and an input packet processor in electronic communication with the network interface, the input packet processor being configured to receive a query relating to a specified disease and a genomic sequence associated with the query. The network node further includes a processing module configured to identify, relative to a control sequence, any variant alleles within the genomic sequence. In addition, the network node includes a transmit controller module in electronic communication with the network interface, the transmit controller module controlling the sending of information identifying the variant alleles to another network node and the sending of a response to the query based upon pharmacological response data associated with those of the variant alleles included within genes associated with the specified disease.
In yet another aspect, the disclosure is directed to a network node for facilitating processing of a disease-related query. The network node includes a network interface and an input packet processor in electronic communication with the network interface, the input packet processor being configured to receive information identifying variant alleles within a genomic sequence associated with a query relating to a specified disease. A processing module is configured to perform a statistical correlation analysis in order to identify those of the variant alleles included within genes associated with the specified disease. A transmit controller module in electronic communication with the network interface, the transmit controller module controlling the sending of results of the statistical correlation to another network node for further processing.
Various objects and advantages and a more complete understanding of the disclosure are apparent and more readily appreciated by reference to the following Detailed Description and to the appended claims when taken in conjunction with the accompanying Drawings wherein:
This disclosure relates generally to an innovative new biological data network and related methods capable of efficiently handling the massive quantities of DNA sequence data and related information expected to be produced as sequencing costs continue to decrease. The disclosed network and approaches permit such sequence data and related medical or other information to be efficiently stored in data containers provided at either a central location or distributed throughout a network, and facilitate the efficient network-based searching, transfer, processing, management and analysis of the stored information in a manner designed to meet the demands of specific applications.
The disclosed approaches permit such sequence data and any related medical, biological, referential or other information, be it computed, human-entered/directed or a combination thereof, to be efficiently transmitted and/or shared or otherwise conveyed from a centralized location or either partly or wholly distributed throughout the biological data network. These approaches also facilitate data formats and encodings used in the efficient processing, management and analysis of various “omics” (i.e., proto/onco/pharma) information. The innovative new biological data network or, equivalently, BioIntelligence network, is configured to operate with respect to biological data units stored at various network locations.
Each biological data unit will generally be comprised of one or more BioIntelligence headers associated with or relating to a payload containing a representation of segmented DNA sequence data or other non-sequential data of interest. The term header in this context refers to one or more pieces of information that have relevance to the payload, without regard to how or where such information is physically stored or represented within the BioIntelligence network. As is discussed below, it will be appreciated that certain operations performed by the nodes or elements of the biological data network may be effected with respect to the entirety of the biological data units undergoing processing; that is, with respect to representations of both the segmented sequence data and BioIntelligence headers of such biological data units.
However, the elements of the biological data network may perform other operations by, for example, comparing or correlating only the BioIntelligence headers of the biological data units being processed. In this way network bandwidth may be conserved by obviating the need for network transport of segmented biological sequence data, or some representation thereof, in connection with various processing operations involving biological units nominally stored at different network locations.
The biological data network may be comprised of a plurality of network nodes configured with processing and analytical capabilities, which are individually or collectively capable of responding to machine or user queries or requests for information. As is discussed below, the functionality of the new biological data network may be integrated into the current architectural framework of the Open Systems Interconnection (OSI) seven-layer model and the Transmission Control Protocol and Internet Protocol (TCP/IP) model for network and computing communications. This will allow service providers to configure existing network infrastructure to accommodate biological sequence data to deliver optimized quality of service for medical and health professionals practicing genomics-based personalized medicine. Alternatively or in addition, the new biological data network may be realized as an Internet-based overlay network capable of providing biological, medical and health-related intelligence to applications supported by the network.
The new biological data network facilitates overcoming the daunting challenges associated with analysis of various pertinent omics data types together with, and in the context of, all relevant, available prior knowledge. In this regard the new biological data network may facilitate development of an integrated ecosystem in which distributed databases are accessible on a network and in which the data stored therein is configured to be linked by BioIntelligence. This new biological data network may enable, for example, forming, securing, linking, searching, filtering, sorting, aggregating and connecting an individual's genome data with a layered data model of existing knowledge in order to facilitate extraction of new and meaningful information.
Overview of Biological Data Units and BioIntelligence HeadersAs disclosed herein, the innovative new biological data network is configured to operate with respect to biological data units stored at various network locations. Biological data units can be considered as a set of information that is known or can be predicted to be associated with certain segments of genome sequences. Biological data units will generally be comprised of one or more BioIntelligence headers associated with or relating to a payload containing a representation of segmented DNA sequence data or other non-sequential data of interest.
The biological data units may be generated by dividing source DNA sequences into segments and associating one or more BioIntelligence headers (also referred to herein as “BI headers” or annotations or attributes) with one or more segments of genome sequence data. The various component parts XML metadata files that are of the header information contained in biological data units can be stored in distributed storage containers that are accessible on a network. Furthermore, the different segments of a whole genome sequence data contained in the payload of biological data units may be stored in multiple BAM files at various different locations on a network.
Each BI header can be considered a specific piece of information or set of information that may be associated with or have biological relevance to one or more specific segments of DNA sequence data within the payload of the biological data unit. It should be appreciated that any information that is relevant to the segmented sequence data payload of a biological data unit can be placed in the one or more BioIntelligence headers of the data unit or, as is discussed below, within BioIntelligence headers of other biological data units. It should also be clearly understood that the information contained in any biological data unit can be highly distributed and network linked in such a manner that allows filtration and dynamic recombination of any permutation of associated attributes and sequence segments.
The BioIntelligence headers may be arranged in any order, whether dependent upon or independent of the payload data. However, in one embodiment the BioIntelligence headers are each respectively associated with at least one layer of a biological data model of existing knowledge that is representative of the biological sequence data which, for example, may be stored as BAM files within the payloads of the distributed biological data units with which such headers or XML metadata attributes are associated.
Although the present disclosure provides specific examples of the use of BI headers in the context of a layered data model, it should be understood that BI headers may be realized in essentially any form capable of embedding information within, or associating such information with, all or part of any biological or other polymeric sequence or plurality thereof. For example, one or more BI headers could be associated with any permutation of segments of DNA sequence or other such polymeric sequence or within any combination thereof, in any analog or digital format.
The BI headers could also be placed within a representation of associated polymeric sequence data, or could be otherwise associated with any electronic file or other electronic structure representative of molecular information. In other words, the one or more metadata attributes that are stored in multiple storage containers on a network may compose BioIntelligence headers that are specifically associated with at least one segment of sequence contained in a file transfer session.
In the case in which BioIntelligence data is embedded within DNA or other biological sequence information, the BI headers or tags including the BioIntelligence data may be placed in front of, behind or in any arbitrary position within any particular segmented sequence data or multiple segmented data sequences. In other words, in one particular embodiment of the invention, information that is associated directly or indirectly may be stored within the base calls of reads that are contained in BAM files or any other sequence file format or internal memory structures, for example. This approach would involve a method for integrating, at least one specific attribute of information that is associated with a genome sequence between and or among the base calls contained within reads of sequence data files.
In addition, the BioIntelligence data may be embedded in a contiguous or disbursed manner among and within the base calls of the segmented sequence data. When this highly structured and layered approach is applied to the storage configuration of this sequence data and associated information it will advantageously facilitate the computationally efficient, effective and rapid analysis of, for example, the massive quantities of genome sequence data being generated by next-generation, high-throughput DNA sequencing machines.
In particular, distributed biological data units containing segmented DNA sequence data and associated attributes may be stored, sorted, filtered and operated on for various scope and depth of analysis based upon the said associated information which is contained within the BioIntelligence headers. This obviates the need to manipulate, transfer and otherwise breach the security of the segmented DNA sequence data in order to process and analyze such data.
One embodiment of the layered data model of the existing body of relevant knowledge includes not only BioIntelligence of or pertaining to biologically-relevant data but also other metadata which are associated with the nucleic acid sequence files. Such MetaIntelligence™ metadata may include, for example, facts, information, knowledge and prediction derived from biological, clinical, pharmacological, environmental, medical or other health-related data, including but not limited to other biological sequence data such as methylation sequence data as well as information on differential expression, alternative splicing, copy number variation and other related information.
The DNA sequence information included within the biological data units described herein may be obtained from a variety of sources. For example, DNA sequence information may be obtained “directly” from DNA sequencing apparatus, as well as from sequence data files that are stored in private and publicly accessible genome data repositories. Additionally, it may be computationally derived and/or manually gathered or inferred. In the case of the database of Genotypes and Phenotypes at the National Center for Biotechnology Information at the National Library of Medicine, the DNA sequence entries may be stored as BAM, SRF, fastq as well as in the FASTA format, which includes annotated information concerning the sequence data files. In one embodiment certain of the information contained within the one or more BioIntelligence headers of each biological data unit would be obtained from publicly accessible databases containing genome data sequences.
Turning now to
In addition, it should be understood that the BioIntelligence header information and sequence payload that is contained within biological data units relate directly to attributes in XML metadata files and BAM sequence files, respectively. Any key value can associate with one or more sequence files or segments of sequence within such files. In one particular aspect of the disclosed approach, the key value may be information of or pertaining to a drug or its effect and the sequence may be a segment of sequence contained in a genomic sequence object file transfer session.
The BioIntelligence header information may associate with or relate to for example a microRNA sequence or the regulatory region of a gene or interaction with another gene product from at least one molecular pathway. Since the example that is presented as
In one embodiment, the exemplary biological data unit that is depicted in
Similarly, a protein protocol data unit (PPDU) comprised of peptide-specific BioIntelligence headers and a payload containing a representation of amino acid sequence data. The biological sequence data that is contained in the payload of PPDUs may be from mass spectrophotometry protein sequencing data or deduced from the DNA sequence data of the DPDU of
Attention is now directed to
Alternatively, a given biological data unit which may be stored in multiple storage containers may comprise a payload containing a representation of biological sequence data and a plurality of BioIntelligence headers, each of which is associated with one or more of the layers of the biological data model of
BioIntelligence headers may be associated with any form of intelligence or information capable of being represented as headers, tags or other parametric information which relates to the biological sequence data within the payload of a biological data unit. Alternatively or additionally, BioIntelligence headers may point to relevant or unique (or arbitrarily assigned for the processing purpose) information that is associated with the biological sequence data within the payload.
A BioIntelligence header may be associated with any information which is either known or predicted based upon scientific evidence, and may also serve as a placeholder for information which is currently unknown but which later may be discovered or otherwise becomes known. For example, such information may include any type of information related to the source biological sequence data including, for example, analytical or statistical information, testing-based data such as gene expression data from microarray analysis, theories or facts based on research and studies (either clinical or laboratory), or information at the community or population level based study or any such related observation from the wild or nature.
In one embodiment relevant information concerning a certain segment of DNA sequence or biological sequence data may be considered metadata and could, for example, include clinical, pharmacological, phenotypic or environmental data capable of being embedded and stored in more than one storage container but with very close association with the sequence data as part of the payload or included within a look-up table.
One distinct advantage to storing metadata and sequence files in a manner that allows for effective and robust tracking and linking of the data is that it enables DNA and other biological sequences that make up large data files to be more efficiently processed and managed. The type of information that may be embedded or associated with segments of DNA sequences or any other biological, chemical or synthetic polymeric sequence can be represented in the form of packet headers, but any other format or method capable of representing this information in association with one or more segments of biological sequence data within a data unit is within the scope of the teachings presented herein.
The systems described herein are believed to be capable of facilitating real-time processing of biological sequence data and other related data such as, for example and without limitation, gene expression data, deletion analysis from comparative genomic hybridization, quantitative polymerase chain reaction, quantitative trait loci data, CpG island methylation analysis, alternative splice variants, microRNA analysis, SNP and copy number variation data as well as mass spectrometry data on related protein sequence and structure. Such real-time processing capability may enable a variety of applications including, for example, medical applications.
The types of medical applications that could be facilitated by this approach may include an automated computer-implemented algorithm that allows the storing, filtering, sorting and tracking of an individual's whole genome sequence in segments as they relate to all the attributes and annotations in association with a biological data model of existing knowledge to extract meaningful and relevant results to specific queries. The processing and analysis of this data will unveil a new class of rich BioIntelligence information that can be utilized in accordance with the layered data model of prior knowledge.
BI headers may be used for the embedding of biologically relevant information, in full or in part, in combination with any polymeric sequence or part or combination thereof, and may be placed at either end of such polymeric sequence or in association within any combination of such polymeric sequences. In addition, embedded information can be considered to be information that is clustered and linked in such a way that relevant information that is related to sequence data files are linked to allow for precipitation of meaningful new insight. Furthermore, the various components of the metadata information and sequence segments can be accessible from multiple storage containers on a network.
BI headers may be configured to be in any format and may be associated with one or more segments of polymeric sequence data. Furthermore, in certain cases the components of biological data units may be stored in a centralized container and in such case the BI Headers may be positioned in front of or behind (tail) the polymeric sequence data, or at any set of arbitrary locations within the representation of the segmented sequence data. Moreover, the BI headers may comprise contiguous strings of information or may be themselves segmented and the constituent segments placed (randomly or in accordance with a known pattern) among and between the segments of sequence data which is comprised within one or more biological data units.
The use of BI headers in representing genome sequence data in a structured format advantageously provides an enhanced capability for classifying and filtering the sequence data based upon any of several stored existing knowledge fields that are related to the said sequence segment. This approach allows for the sequence data to be sorted based on the abstracted descriptive information which is contained within the BI headers relating to the segmented sequence data of a specific biological data unit.
For example, the segmented genome sequence data represented by a plurality of biological data units could be processed such that, a particular gene that is normally known to be located at a certain position on chromosome 1 could be sorted along with other genes or gene products from the same or a different chromosome if the corresponding genes or gene products are associated with a particular molecular pathway, drug treatment, health condition, diagnosis, disease or phenotype. Alternatively, it should be known that certain chromosomal rearrangements could generate a similar result when a portion of one chromosome is transferred through translocation and becomes part of another.
In the general case not all of the segments of DNA sequence data within the set of biological data units resulting from segmentation of an individual genome will directly associate with every field of the applicable BI header attributes. For example, a certain biological data unit may contain a segment of DNA sequence lacking an open reading frame, in which case the exon count field of the DNA-specific BI header would not be applicable. In any case, the particular header information type along with other header information types are maintained as place holders for future scaling of the depth and scope of intelligence that is contained within the XML metadata files. This permits biological information relating to the segmented DNA sequence data of a certain biological data unit which is not yet known to be easily added to the appropriate layer of the biological data model once the information becomes known and, in certain cases, scientifically validated.
In certain exemplary embodiments disclosed herein, the biological or other polymeric sequence data contained within the payload of a biological data unit is represented in a two-bit binary format. However, it should be appreciated that other representations are within the scope of the teachings herein. For example, the instruction set architecture described in co-pending application Ser. No. 12/828,234 (the “'234 application”) may be employed in certain embodiments described herein to more efficiently represent and process the segmented genome sequence data within the payload of biological data units. Accordingly, in order to facilitate comprehension of these certain embodiments, a description is provided below of certain aspects of the instruction set architecture described in the '234 application.
Overview of Instruction Set Architecture for Polymeric Sequence ProcessingSet forth hereinafter are the general descriptions for the instruction set architectures comprised of instructions for processing biological sequences, as well as descriptions of associated biological sequence processing methods and apparatus configured to implement the instructions. The instructions may be recorded upon a computer storage medium, and a sequence processing system may contain the storage media and a processing apparatus which can be configured to implement the processing and analysis that is defined by the set of instructions that are designed specifically for operating on the associated attributes. In addition, a computer data storage product may contain sequence data encoded using instruction-based encoding in order to generate a biologically relevant representation of the segmented genome sequence data.
Also described herein is an article of manufacture in a system for processing biopolymeric information, where the article of manufacture comprises a machine readable method for comparative sequence analysis which comprises an instruction set architecture that includes a plurality of instructions for execution by a processor, each of the plurality of instructions being at least implicitly defined relative to at least one controlled sequence, and representative of a biological, chemical, medical, pharmacological, clinical, environmental or physical event affecting one or more aspects of a biopolymeric molecule.
The plurality of instructions may include a set of operation codes corresponding to the biological event and an operand relating to at least a portion of a monomeric unit of the biopolymeric sequence. The one or more aspects may include a monomer of the biological polymer molecule. The event that affects the one or more aspects may include a structural representation of the biopolymeric molecule. The biopolymeric molecule may comprise a segment of a DNA molecule and the monomer may comprise at least a portion of a nucleotide base of the DNA sequence.
Genomic-Based InstructionsHerein, genomic sequences are defined as sequences of data that is handwritten or stored digitally and describes the genomic characteristics of a particular organism. The term “genomic” in general will refer to sequence data that both encode genes (also referred to as “genetic” data) as well as data that is believed to be non-coding.
The phrase “a particular organism” will mean the organism from which cells were used to prepare DNA for sequencing. Cells will refer to all and any cell type that is integral to the particular organism including normal cells, and tumor cells, cell from plants and animals that may be in the digestive track of the organism. Furthermore, this will include bacteria, viruses and mobile DNA elements that are attached to the organism on the outside or inside. The terms “bacteria” and “viruses” will refer specifically to detection of any evidence of these microbial organisms DNA sequences which may be endogenous or exogenous.
The term “genome” will refer to an organism's entire hereditary information. Genomic sequencing is the process of determining a particular organism's genomic sequence. This term will further reference an organism's inheritable “genome” which will include methylation sequencing epigenomics data as well as microbiomics data and known or predicted non-Mendelian trans-generational transmission of RNA sequence data.
The human genome, as well as that of other organisms, can be generally thought of as being made of four chemical units called nucleotide bases (also referred to herein as “bases” for brevity). These bases are adenine(A), thymine(T), guanine(G) and cytosine(C). Double stranded sequences are made of paired nucleotide bases, where each base in one strand normally pairs with a base in the other strand, according to the Watson-Crick pairing rules, i.e., A pairs with T and C pairs with G (In RNA, Thymine is replaced with Uracil (U), which pairs with A and less often with G).
A sequence is a series of bases, ordered as they are arranged in molecular DNA or RNA. For example, a sequence may include a series of bases arranged in a particular order, such as the following example sequence fragment: ACGCCGTAACGGGTAATTCA. The human haploid genome contains approximately 3.2 billion base pairs, which may be further broken down into a set of 23 chromosomes. It is approximated that the 23 chromosomes encode about 30,000 genes. While each individual's sequence is different, there is much redundancy between individuals of a particular genome, and in many cases there is also much redundancy across similar species. For example, in the human genome the sequences of two individuals are about 99.5% equivalent, and are therefore highly redundant. Viewed in another way, the number of differences in bases in sequences of different individuals is correspondingly small. These differences may include differences in the particular nucleotide at a position in the sequence, also known as a single nucleotide polymorphism or SNP, as well as addition, subtraction, or rearrangement or repeats or any genetic or epigenetic variation of nucleotides between individuals' sequences at corresponding positions in the sequences.
Because of the enormous size of the human genome, as well as the genomes of many other organisms, storage and processing genomic sequences (which are typically separate sequences generated from a particular individual or organism, but may also be a sequence fragment, sub-sequence, sequence of a particular gene coding sequence or non-coding sequences between genes, etc.) creates problems with processing, analysis, memory storage, data transmission, and networking. Consequently, it is usually beneficial to store the sequences in as little space as possible. At present, there are several well recognized efforts to achieve efficient means to facilitate the smallest footprint. Moreover, it is typically important that no information is lost in storage and transmission. Accordingly, processing for storage or transmission of whole or partial sequences should include removing redundant information in a sequence in a lossless fashion.
Variations in the DNA sequences of different individuals are a result of deviations (also known as mutations). For example, one particular type of mutation may relate specifically to substitutions of nucleotide bases at common or certain reference positions in the sequence. A base substitution (also known as a point mutation) is the result of one base in a sequence at a particular position or reference location being replaced with a different one (relative to another sequence, which may be a reference sequence from which other sequences are compared).
A base substitution can be either a transition (e.g., between G and A, or C and T) or a transversion (e.g., between G and its paired base C or a T, or between A and its paired base T or a C).
Representation of Polymeric Sequence Data Using Biological Data UnitsOne aspect the present disclosure describes an innovative methodology for biological sequence manipulation well-suited to address the difficulties that are related to the processing comparative sequence analysis of large quantities of DNA sequence data. The disclosed methodologies enable segmented representations of such sequence data to be efficiently stored (either locally or in a distributed fashion), searched, moved, processed, managed and analyzed in an optimal manner in light of the demands of specific applications.
The disclosed method involves breaking whole genome DNA sequence entries into deliberate segments and packetizing the fragments in association with BioIntelligence header information to form biological data units. In one embodiment much of the BioIntelligence header information may be obtained from private or public databases containing information pertaining to involved molecular pathways, drug databases, published research data that can be found in well-established databases such as, for example, dbGaP and EMBL. The DNA sequence entries within many public databases may be stored in a BAM file format, which accommodates the inclusions of annotated information concerning the sequence. For example, an entry for a DNA sequence recorded in the BAM file format could include annotated information identifying the name of the organism from which the DNA was isolated and the gene or genes contained in the specific sequence entry.
Alternatively, the sequence file may contain the base sequence information while the ancillary metadata information could be contained in XML files as specific attributes that are associated with a particular segment of the sequence. The associated information that is contained in these files may relate with prior knowledge that is configured in a biological model that is consistent with a layered data model.
In addition, the information that is pertinent to which chromosome the particular DNA sequence segment was obtained and the starting and ending base positions of the sequence would also typically be available. Furthermore, other public and private databases include information relating to, for example, the location of human CpG islands and their methylation sequence, as well as the genes with which such islands are associated (see, e.g., http://data.microarrays.ca/cpg/index.htm).
For each identifiable gene there will be an essential need for a normal control state of the particular gene. Database entries that contain genes that are identified as being associated with a RefSeqGene, which pertains to a project within NCBI's Reference Sequence (RefSeq) project, provide another potential source of BioIntelligence header information. The RefSeqGene project defines the DNA sequences of genes that are well-characterized by leaders in the scientific community to be used as reference standards which is a part of the Locus Reference Genomic (LRG) project. In particular, sequences labeled with the keyword RefSeqGene serve as a stable foundation for reporting mutations, for establishing conventions for numbering exons and introns, and for defining the coordinates of other biologically significant variation. DNA sequence entries that associate directly with the RefSeqGene will be well-supported, exist in nature, and, to the extent for which it is possible, represent a prevalent, ‘normal’ allele.
It should be appreciated that there may be different schemas for segmentation and packetizing sequence entries in order to associate the highly relevant attribute information with specific sequence segments. For example, in the case in which it is suitable to segment sequence entries into packets containing genes or, alternatively, into introns and exons, relevant data is available for placement into the BioIntelligence header information relating to the metadata attributes of the biological data units containing such sequence segments.
Biological Data Units Including BioIntelligence HeadersReferring again to
In one approach, biological data units are created at least in part by specifically linking information from XML metadata files with particular segments of BAM file sequence data. In this case, the biological data units can be considered a unit of information that a certain relationship that can be stored or streaming from and to multiple nodes on a network. In this case the information that is contained within the BI header distributed and is able to link with sequence segments specifically. The protocols used for the transmission of these precisely related cluster of information in biological data units is integrated with a computer implemented program that defines and classifies the link between and among the BioIntelligence header information and the segment of sequence payload.
It should be appreciated that
In addition, although the following generally describes information as being contained or included within various sections of the BioIntelligence header 110, it should be understood that in various embodiments such headers may distributed and may contain pointers, tags or links to other structures or memory locations storing the associated header information.
Similarly, the payload 120 may contain a representation of the segmented DNA sequence data of interest, or may include one or more pointers or links to other structures or locations containing a representation of such sequence data. In this case, the various segments of a particular whole genome sequence may be stored in a distributive manner in multiple containers that are accessible on a network.
A first section 101 of the BioIntelligence header 110 provides information concerning CpG methylation sequence data that pertains to the various positions of the DNA sequence segment within the payload 120 of the biological data unit 100. In other words, the information that is contained in the ancillary files that are associated with the sequence points to section 101. Identification of these CpG islands and the methylation sequence will likely play an important role in understanding regulation of the associated genes and any involvement with disease.
The header information that is contained in section 110 also includes a property of chromosome banding pattern in section 102 containing information concerning any chromosomal rearrangement observed, known, yet unknown and or may be predicted to be involved with at least one segment of genome sequence data linked to this attribute. These types of cytogenetic abnormalities are often associated with severe phenotypic effects. This information may be configured to be in any other format to represent the genomic effects of chromosomal rearrangements which are known to be common in cancer tumor genomics.
Header sections 103 and 104 provide information identifying the beginning and ending positions for the exons that are contained in the DNA sequence segment included within the payload 120. In the case of whole exome sequencing this information represents exons throughout the whole genome that are expressed in genes. Since exon selection has tissue and cell type specificity, these positions may be different in the various cell types resulting from a splice variant or alternative splicing. Along with this DNA coding information for individual exons, header section 105 may represent information in a metadata file of a count of the number of exons contained in the DNA sequence segment included within the payload 120. This type of information is known to be relevant in disorder involving exon skipping and exon duplication.
Certain particular attribute-informational link specifically with one or more DNA sequence segments within payload 120 having some association with a disease will be represented by the attribute information contained within section 106. Information that is pertaining to certain known molecular pathways or systems that may have molecular interactions with other genes or gene products that would also be described within this section of the BI header. Alternatively, since variations of said certain gene could be involved in one or more diseases, such information would also generally be contained within header section 106.
To the extent the DNA sequence segment in the payload 120 contains a part of a gene, a gene or plurality of genes, then the header section 107 provides all of the pertinent information that relate specifically to the applicable known gene name or gene ID. Header section 108 may represent the type of information that specifies the tissue or cell type which may be relevant to the extent and level of expression of the various exons that may be encoded in the said gene or segment of genome that is described in section 105.
The metadata attribute located in the header section 109 will provide information concerning all possible open reading frames present within the segment of genome sequence data that is contained within the payload 102. This type of BioIntelligence attribute will be crucial for characterizing disease associated variants which are contained within what appears to be open reading frames that express no proteins or peptides that are detectable with today's methods.
Header section 110 and 111 represent the metadata annotations that specify the start and end positions of the DNA sequence segment that is linked to a specific segment of a BAM file, represented by the payload 102. These positions may be considered arbitrary since the positions in the sequence could be more than one reference sequence.
Section 112 indicates if the segmented DNA sequence data within the payload 102 is chromosomal, microbial or mitochondrial. Furthermore, section 113 provides information concerning the genus and species of the origin of the DNA sequence segment represented with the payload 102. It should be appreciated that sections 112 and 113 will provide the information that describes all the DNA sequence data that is associated with an individual including and not limited to microbes attached on the outside and found on the inside of said individual as well as genome sequence data from plants and other higher animals found in the digestive track.
All of the metadata annotations and attributes that are within the header 110 will generally contain prior knowledge information relating to the BioIntelligence that is relevant to the DNA sequence which is functionally utilized while the data is being sorted, filtered and processed. This packetized structure of the DNA sequence data that is represented in bits and encapsulated with BioIntelligence headers and other relevant information advantageously facilitates processing by existing network elements operative in accordance with layered or stacked protocol architectures.
For example, The Cancer Genome Atlas consortium has elected to implement biological data units comprised of BioIntelligence headers consisting of information contained in XML metadata files and payloads comprised of genome sequence data contained in the BAM files. In this exemplary implementation a first specific type of BioIntelligence information may reference the tissue type or cell type of the sequence files (section 108 of
Attention is now directed to
The distributed packetizing of segmented DNA sequence data files and the embedding of biologically and clinically relevant information in biological data units will enable development of a networked processing architecture within which such data may be organized and configured in a layered format. Based on preliminary results, the architecture is expected to be particularly suited for effecting rapid analysis of large amounts of data of this type.
In one approach, the header which is contained within such biological data units, is used to qualify or characterize the fragmented or otherwise segmented genome sequence data included within the payloads of such data units. In so doing, biological data units containing segmented DNA sequence data or other sequence data may now be sorted, filtered and operated upon based on the associated attribute information contained within the ancillary metadata files of the highly distributed data units.
For example, a data repository containing biological data units incorporating segmented DNA sequence data and related attribute information similar to that associated with the header 110 of
It is highly expected that it would be beneficial to arrange and represent all of the genomic sequence information from an individual, e.g., from bacteria, animals, plants to humans, in accordance with the layered data architecture illustrated in
Consider further that if, for example, the DNA sequence data of interest is a particular variant of a human gene associated with breast cancer, such as BRCA1, then such data could be extracted from the container by filtering the contents of the data container for metadata attributes associated specifically with the segment of DNA sequence data from the organism homo sapiens. The data units containing the specific BRCA1 variant along with all other DNA data packets containing human DNA sequence data may be easily extracted. However, sorting human DNA sequence data from the DNA sequence data from other organisms may not be sufficient enough of a challenge in view of the technical requirements of certain applications. Accordingly, additional processing and comparative analysis may be performed in which specific data units comprising certain segments of sequence data from human chromosome 17 would be filtered out from the data container.
Biological data units having payloads containing DNA sequence segments from chromosome 17 may provide a reasonable level of filtering. However, in order to efficiently analyze the gene most notably associated with breast cancer, further processing, sorting and filtering will be necessary. This may be achieved using several methods including but not limited to filtering on the specific start and end positions within the chromosome (S pos and E pos) or the gene ID (GID) or by disease, breast cancer. If the biological data units that are being sorted contain sequence segments data associated with an alternately-spliced variant of BRCA1, then this information may be contained in the header information representing the total exon count (see, e.g., header section 105 of
The packetized structural configuration of the disclosed distributed biological data units further enable functional integration of a layered data models such as that depicted in
The use of BioIntelligence header information which are consistent with a layered data architecture also advantageously enables substantial changes to be made to the information associated with one layer of the model without necessitating that corresponding modifications be made to other layers of the model. For example, sequence variants may be observed at splice donor and splice acceptor sites which may change the splicing pattern and mRNA size, protein structure and function, and these changes may yet be accommodated and mapped to the DNA layer without requiring that corresponding changes be made the DNA layer of the existing knowledge data model.
Attention is now directed to
In one embodiment the process 400 utilizes sequence feature information of the type annotated in well-established nucleotide databases 410 such as, for example, NCBI, EMBL and DDBJ for sorting, configuring and operating on the sequence data. By mapping the biological information within these databases into various layers of BioIntelligence header information, a layered data model of existing knowledge can be constructed.
Referring to
In a stage 420, the sequence data obtained from storage elements 410 is mapped and aligned with the reference genomic sequence data. The DNA sequence is associated with a set of relevant molecular features using, for example, biological data 414 deemed valid by the scientific community. This data 414 is mapped to specific regions of a sequence entry. In addition, clinical and pharmacological data 416 demonstrated to be associated with any coding or non-coding regions of a sequence entry is also mapped.
In one embodiment layer-1 biological data units 4441 include a payload comprised of segmented DNA sequence data and a DNA layer header. Similarly, layer-2 biological data units 4442 may include a payload comprised of segmented DNA sequence data, a DNA layer header and an RNA layer header. A layer-N biological data unit 444N may include a payload comprised of segmented DNA sequence data, a DNA layer header, an RNA layer header, and other headers associated with higher layers of the relevant data model.
Alternatively, in one embodiment layer-1 biological data units 4441 may include a payload comprised of segmented DNA sequence data and a DNA layer header, layer-2 biological data units 4442 may be comprised of a segmented RNA sequence data and an RNA layer header, and so on. In one embodiment a base unit may be prepended to or otherwise associated with each biological data unit in order to identify the specific headers included within the data unit and/or the number thereof.
In one embodiment BioIntelligence headers 424 may include physical, chemical, or biological knowledge or findings, or any related molecular data that has been peer reviewed, published and accepted as valid. BioIntelligence headers 424 may also include clinical, pharmacological and environmental data, as well as data from gene expression and methylation.
In certain embodiments BioIntelligence headers 424 may further include information relating to gene and gene product interaction with other components of a pathway or related pathways. The information within BioIntelligence headers 424 may also be obtained form, for example, microarray studies, copy number variation data, SNP data, complete genome hybridization, PCR and other related techniques, data types and studies.
The prior scientific knowledge and information associated with a specific sequence and included within a BioIntelligence header 424 may be of several different types including, for example, molecular biological, clinical, medical and pharmacological information. In this regard such molecular and biological information could be separated and layered based on data from, for example, genomics, exomics, epigenomics, transcriptomics, proteomics, and metabolomics in order to yield BioIntelligence data.
The BioIntelligence data may also include DNA mutation data, splicing and alternative splicing data, as well as data relating to posttranscriptional control (including microRNA and other non-coding silencing RNA and other nuclease degradation pathways). Mass spectrometric data on protein structure and function, mutant protein products with reduced or null function, as well as toxic products could also be utilized as BioIntelligence information.
In addition, pharmacological and clinical data relating to specific genes or gene regions disposed to exert effects through interaction with gene products or other components of a pathway could be considered as a class of BioIntelligence header information. Finally, BioIntelligence header information could also include environmental conditions or effects correlated with certain genes or gene products known or predicted to be related to a certain phenotypic effect or disease onset.
As mentioned above, during stage 440 BioIntelligence headers 424 are associated with segmented DNA sequence data form biological data units comprised of a BioIntelligence header 424 encapsulating a payload containing the segmented DNA sequence data. In this process the association of a BioIntelligence header 424 to payload containing segmented genome sequence data may be carried out in any of a number of ways. For example, such association may be effected using a pointer table, tag, graph, dictionary structure, key value stores or by embedding header information directly into the segmented sequence data.
In a stage 460, the biological data units 444 may be organized into encapsulated data units in accordance with the requirements of particular applications. For example, in certain cases it may be desired to create encapsulated biological data units including only a subset of the headers which would otherwise be included in the biological data units associated with at least one particular layer of the biological data model of prior knowledge. For example, a certain application may require encapsulated biological data units having headers associated with only layers 1, 2 and 5 of a data model.
Another application may require, for example, encapsulated biological data units having headers associated with only layer 2, 3 and 4 of the data model. Similarly, other applications may require that the headers of the encapsulated biological data units be arranged in a particular order, e.g., the header for layer 4, followed by the header for layer 1, followed by the header for layer 2.
In a stage 480, the encapsulated biological data units created in stage 480 are stored in a manner consistent with being interoperable with one or more multi-layered, multi-dimensional data containers 464. The content of the headers of the encapsulated biological data units is chosen to promote optimal interoperability among and between layers. For example, in one simplified case each biological data unit included within the data container 4641 may include at least a DNA layer header, an RNA layer header, and a protein layer header. It is a feature of the present system that information within higher-layer headers (e.g., RNA layer headers or protein layer headers) may be “mapped” to lower-layer headers and/or sequence information in such way as to establish a relationship provenance between information within various layers.
Consider an example wherein data concerning a particular protein product that is expressed in a certain tissue type (i.e., protein layer information) may also provide information relating to splicing (i.e., RNA layer information) or to a SNP at the genomic level (i.e., DNA layer information) resulting in a premature termination codon. In other words, protein structure related data can provide RNA level knowledge on alternative splicing as well as data on primary sequence data of amino acids substitutions revealing SNPs and indels at in the DNA sequence.
In another case, the diagnosis of a certain disease in a certain patient or, for example, results from a mammogram screen or prostate-specific antigen results, may provide information that is directly related to hyper-methylation of certain regions of the DNA sequence segment included within a DNA layer biological data unit. These epigenetic markers, along with the methylation profile at CpG islands associated with certain genes, could provide crucial BioIntelligence header information to relate and correlate with appropriate gene and disease conditions.
One advantage of the layered architecture of the data containers 464 is that modification or updating of the data content associated with a given layer has minimal or no effect on the processing of data in the remaining layers. In one embodiment layers are advantageously designed to be operated on independently while retaining the capability to integrate, and interoperate with, data and existing knowledge of other layers. In addition, data can be organized within each data container 464 in accordance with the requirements of specific applications.
All or part of this data may be mapped, via linked relationships between information within BioIntelligence headers or metadata attributes that are associated with different layers of a data model, to a disease condition capable of being associated with a region of segmented DNA sequence data contained within a biological data unit. This enables biological data units to be grouped and analyzed based upon the classification schema required by a particular application.
In a stage 490, biological data units encapsulated with BioIntelligence headers and stored with the data containers 464 may subsequently be filtered, sorted or operated upon based on information included within such headers. The layered structure of biological data units comprised of biological data units including encapsulated BioIntelligence headers enables querying of the information included within one or more such headers to be performed and results returned based upon a set of rules specified by, for example, the application issuing the query.
Attention is now directed to
The biological data network 500 may in one aspect be viewed as comprising a network of data stored within the databases 550 as well as within storage (not shown) at the network nodes 550. In one embodiment each biological data sequence or other sequence information stored within the network 500 may be accorded a unique identifier such as, for example, an IP address, in order to facilitate the establishment of such a data network. Moreover, tables may be maintained at each network node 510 for data tracking purposes (references herein to network node 510 are generally also intended to refer to network nodes 510′, unless the context of the reference clearly suggests otherwise). In particular, such tables may be used to track the sequence information available directly or indirectly (via other network nodes 510) from other network nodes 510, as well as the results of processing such sequence information at various nodes 510. These tables may be updated as biological data units containing sequence information and/or BioIntelligence and or MetaIntelligence™ headers are transported between nodes for processing. Alternatively or in addition, overhead messages may be exchanged between network nodes 510 for the purpose of propagating the information stored within ones of these table to the tables maintained by other nodes 510. Such messaging and updating of tables between network nodes 510 generates a type of BioIntelligent data awareness that provides a distinct advantage for processing and sharing data on network 500. Furthermore, the network processing that is carried out allows seamless access to network-associated processing functions, shared data as well as support databases that also contain properties of and information about the data.
Structure and Operation of Network Nodes of Biological Data NetworkDuring operation of the network 500, requests from a client terminal 560 are received by a network node 510. Such requests are interpreted at the network node 510 and appropriate processing is carried out at such network node 510, and potentially other network nodes 510, in order to produce the requested results. In this regard BioIntelligence headers relating to all of the data throughout the network 500 that is designated as or otherwise made network accessible may be accessed and processed in response to requests from a client terminal 560. In this way intelligent information concerning data stored remote from a client terminal 560 and its associated network node 510, and/or such data itself, may be processed in a manner transparent to such terminal 560 and node 510.
Although certain of the embodiments disclosed herein contemplate that various ones of the network nodes 510 may perform specialized processing functions and operate cooperatively to produce an overall processing result, in other embodiments certain nodes may be capable of performing all of the processing functions necessary to deliver results in response to queries.
In certain aspects of the invention whereby cooperative operations and processing functions are coordinated at various distributed network nodes 510 queries can be made that would facilitate the simulation, study and comprehension of systems in biology. In this case, BioIntelligence header information fields at the DNA, RNA and protein layers along with query dependent processing function requirements serve as the activated substrates for generating a result.
In general, when a query/request is made, a suite of protocols are invoked which are based upon the properties of the request. For example, a request can be made from any client on the network 500 and the stack of application protocols use processing functions at multiple nodes to access the associated data and a process management function to tabulate, coordinate and combine the partial information from multiple nodes to return the query result. In this regard, processing at a network node 510 can be achieved using either of at least two approaches. In a first approach of cooperative processing functions, data and or partial processing results can be moved to the desired functional node 510 to be processed. Alternatively, the required processing function can be moved form a network node 510 to the location of the network accessible data at 550 and the data is processed at the site at which it resides on the network 504. Furthermore, a combination of the two approaches can be used to return the query result to end nodes or terminals 560. In addition, any result from processing that is new network information can be used to update tables at nodes 510 to enhance network awareness.
The network nodes 510 are aware of the types, the content and location of all network accessible data and its intelligence. Moreover, the network nodes 510 are aware of the types, locations and capabilities of processing functions on the network 504. In this regard each node 510 is regularly updated with the activities being performed by, and processing results generated by, each other node 510 of the network 500. In one embodiment, network-based applications and protocols are aware of the information contained in the different fields of the BI headers associated with the biological data units stored within the databases 550 and access such information to the extent necessary to process queries from terminals 560.
Turning now to
The DPS™ is intended to enable existing Internet infrastructure to efficiently process and transport DNA sequence-based data. The DPS™ protocol stack comprises a DNA Transport Protocol™ (DTP™), DNA Signaling Protocol™ (DSP™), and DNA Control Protocol™ (DCP™). In one embodiment the DTP™ protocols enable network elements such as routers and switchers to process, transport, and communicate biological data such as DNA sequence data and related information between single or multiple sources of streaming DNA servers (discussed below). The servers will include or have access to data containers (e.g., storage devices) including biological data units and/or unprocessed or partially processed DNA sequence data.
The functions of the DPS™ protocol suite comprise processing, transporting, controlling, switching and routing biological data such as DNA sequence information as streaming data so as to enable such data to be utilized for a variety of “streaming” applications. In this regard the DPS™ protocol stack will be used for pulling streaming biological data from servers having access to containers of biological sequence data. Such streaming applications are capable of continuously “pushing” and “pulling” biological sequence data as necessary to support the functionality of each particular application.
Various options exist for introducing the DPS™ protocol suite into existing network infrastructure. In one implementation, for example, the DPS™ protocol suite may be distributed throughout the routers/switches of a given service provider. In another implementation, the DPS™ protocol suite may reside only in one or more network elements near an edge of the service provider's network in an overlay network.
Attention is now directed to
Each incoming IP packet containing biologically-relevant headers is received via a network interface 810 and provided to an input packet processor 820. The input packet processor 820 removes the IP header information and parses the higher-layer content included within the packet. A classification module 830 may then assign the packet to a particular class based upon this higher-layer content. The biologically-relevant header information included within the packet may then be passed to a configurable processing module 850 for processing in the manner described hereinafter based upon the determined class and any policies applicable to such class defined by policy module 840. As is also described hereinafter, the biologically-relevant header information may then be processed by configurable processing module with reference to various sequence location tables 870 and layered data tables 860 maintained at the network node 510. The layered data tables 860 are structured consistently with the biological data model (
Based upon the results of the processing performed by the configurable processing module 850, outgoing biologically-relevant header information associated with the biological sequence identified within the input IP packet or other processing results is provided to a transmit controller module 880 for packetization within an outgoing IP packet. To the extent the outgoing biologically-relevant header information requires further processing by another network node 510 in order to render an appropriate response to the user request received by the network 500, a load balancing module 882 within the transmit controller module 880 selects such a network node 510 from among the group of such nodes capable of performing the required processing. Such selection may be based upon, for example, the processing loads associated with each node within the group. Additionally, selection may be based upon processing results that are passed to the transmit controller module 880. A QoS module 884 places each outgoing IP packet in one or more queues in accordance with, for example, the applicable class accorded the corresponding incoming IP packet by the classification module 830 and the policy associated with such class. Each outgoing IP packet will generally include identifying information similar to that included within each incoming IP packet. The outgoing IP packets are provided by the transmit controller module from the applicable queue to the network interface for transmission to a destination network node 510.
In one embodiment the BioIntelligence headers within each IP packet received by a network node 510 will be functionally associated with or contain information having biological relevance to a segment of DNA sequence data, MetaIntelligence™ metadata, or both. It should be appreciated that the BioIntelligence headers may be arranged in any order, whether dependent upon or independent of any associated payload data. However, in one embodiment the BioIntelligence headers are each respectively associated with a particular layer of a biological data cube model representative of the biological sequence data contained within the payloads of the biological data units with which such headers are associated. Moreover, it should be understood that any patient-related data which is not predicated upon genomic sequence information but is nonetheless pertinent to the processing by the network 500 of a request may be included within the BioIntelligence headers of a received IP packet.
It should be further understood that BI headers may be realized in essentially any form capable of embedding information within, or associating such information with, all or part of any biological or other polymeric sequence or plurality thereof. BI headers may also be placed within a representation of associated DNA sequence data, or could be otherwise associated with any electronic file or other electronic structure representative of molecular information. In particular, biological data units containing segmented DNA sequence data may be sorted, filtered and operated upon based on the associated information contained within the BioIntelligence header fields.
Attention is now directed to
In an initial step of the variants processing procedure, a determination is made as to whether any differences exist between the biological data sequence associated with the query and the reference sequence. To the extent differences are detected, the nature of the differences and their locations with respect to the reference sequence are recorded. In this regard the sequence data associated with the query could comprise a portion of a gene or plurality of genes, an entire genomic sequence from normal cells, and/or an entire genomic sequence from diseased cells. The sequence data for a particular patient could comprise any, or a combination, of these types of sequence data.
In other embodiments a clinically transformed version of a patient's genomic sequence data, rather than the sequence data itself, is associated with user requests received by the network 500. Such a clinical transformation may involve, for example, associating a patient's medical records or health related information with any or a combination of the patient's genomic sequence or the patient's transcriptomic, proteomic, metabolomic or lipidomic information, or any other such related data. For example, such transformation could involve using certain minor allele variations in or near certain genes that are associated with certain phenotypes, symptoms, syndromes, diseases, disorders, etc. Furthermore, certain knowledge of the linkage disequilibrium that is associated with the haplotype map genome sequence of the patient might provide a detailed transformation of this genotyping data into information on protein concentrations in blood, urine and other body fluids. Information on functional activity of these proteins and their metabolic state which might include posttranslational modifications could be a useful part of improving the granularity of the patient's genomic-based transformed data. Accordingly, the present disclosure advantageously provides a mechanism for networking and sharing genomic-based data without requiring a corresponding sharing of a patient's genomic sequence data.
Again considering the process of
For simplicity, in the case where SNPs are the only variants dbSNP can be used to validate common SNPs. In addition, data on minor alleles with disease association might be present in other cancer genome databases that are maintained by public and private entities such as but not limited to CGP (Cancer Genome Project at Sanger Institute), TCGA (at NIH's National Cancer Institute), RCGDB (Roche Cancer Genome Database), and the like.
Attention is now directed to
In the general case, once the processing to be performed at a given network node 510 has been completed, a decision will be made to route or switch the processing to another network node 510 based upon the results of such processing (stage 960). The extent of the processing to be performed by the network 500 with respect to a particular request will of course be dependent upon the nature of the request.
Turning now to
In one embodiment each network node 510 implements a method which generally involves performing a processing operation involving ones of a first set of biological data units and a second set of biological data units. The processing might further involve a comparison of the called variant with access to established variants databases.
In the general case, the biological data unit encapsulated within the IP packet received by a network node 510 will contain a first header associated with first information relating to segmented biological sequence data and a second header associated with second information relating to the segmented biological sequence data. The method includes processing of the first information and the second information in relation to the content of the payload of the biological data unit. In one embodiment processing is carried out at each network node 510 with respect to biological data units including a first header associated with information relating to a first-layer representation of biological sequence data and a second header associated with information relating to a second-layer representation of biological sequence data wherein a biological, clinical, pharmacological, medical or other such relationship exists between the first-layer and second-layer representations. For example, the DNA sequence for a gene may be related to the cDNA or RNA sequence of that gene or the protein sequence, structure or function of the gene product. In one embodiment all of the data contained within a layered representation of the DNA sequence information (see
As may be appreciated with reference to
Attention is now directed to
The platform 1100 may further include a CAM memory device 1150, which is configured for very high speed data location by accessing content in the memory rather than addresses as is done in traditional memories. In addition, one or more database 1160 may be included to store data such as compressed or uncompressed biological sequences, dictionary information, metadata or other data or information, such as computer files. Database 1160 may be implemented in whole or in part in CAM memory 1150 or may be in one or more separate physical memory devices.
The platform 1100 may also include one or more network connections 1140 configured to send or receive biological data, sequences, instruction sets, or other data or information from other databases or computer systems. The network connection 1140 may allow users to receive uncompressed or compressed biological sequences from others as well as send uncompressed or compressed sequences. Network connection 1140 may include wired or wireless networks, such as Etherlan networks, T1 networks, 802.11 or 802.15 networks, cellular, LTE or other wireless networks, or other networking technologies are known or developed in the art.
Memory space 1170 may be configured to store data as well as instructions for execution on processor(s) 1110 to implement the methods described herein. In particular, memory space 1170 may include a network processing module 1172 for performing networked-based processing functions as described herein. Memory space 1170 may further include an operating system (OS) module 1174, a data module 1176 configured to temporarily store sequence data and/or associated attributes or metadata, a module 1178 for storing results of the processing effected by the network processing module 1172.
The various modules included within memory space 1170 may be combined or integrated, in whole or in part, in various implementations. In some implementations, the functionality shown in
Attention is now directed to
In one embodiment none of the data which is stored in the local storage container 1220 is generally accessible to clients 560 of the network 500. Movement of data between storage containers associated with or accessible to different network nodes 510 may be governed by the policies established by the one or more clients 560 controlling such containers. For example, depending on the policy in place at a first network node 510, certain aspects of actual patient data or a transformed version of such data might be “pulled” in whole or in part from data containers accessible to a second network node 510.
BioIntelligence Access to Existing KnowledgeAttention is now directed to
As may be appreciated by reference to
Referring now specifically to
It should be noted that
In the embodiment of
The segmented sequence data within the payload of the biological data unit identified by the field H1 within the L1 header 1310 may represent a certain region of a genome that may be positioned in similar but not necessarily identical base positions. For example, the comparison of this region or section of the genome that is represented in the payload for a particular gene would be expected to code for the same genes or at least different isoforms of the same gene.
As a result, the effect of L1H1 header field (layer 1, header field 1) from the stored DNA data would give comparable results for the various DNA layer annotations that are present in that data container. Such DNA layer information could include, for example, gene ID, chromosome, base positions, regulatory regions, 5′ and 3′ UTR, variant alleles and other DNA-based information related to the gene. Based on the query message, the individual network processing node 510 accesses information within data cubical of prior knowledge 1308 relating to, for example, chromosome number (for simplicity, not shown) and base positions identified by the L1H1 header field.
Referring now to
In one embodiment this field should contain at least one representation for the name of the gene and or gene product that is encoded by the DNA sequence in the payload of the biological data unit associated with headers 1304. In cases where more than one name is used to identify a gene, gene product or the activity associated with that gene the most current and widely accepted names are listed. Any gene ID name that is used to relate specifically to the sequence represented by the chromosome number and base positions that are indicated in the first header field of the layer 1 should be encoded by this particular sequence in this region of the genome. However, because of gene duplication, copy number variations, existence of gene families, repeat sequences, mobile transposable elements and other such related molecular phenomena certain classes of redundancy will exist. Furthermore, one gene or the polypeptide product of a gene or the enzymatic activity of a gene could be associated with more than one disease, syndrome, disorder, phenotype, etc.
Turning now to
For simplicity and clarity, the supportive data in this case show three different cancer types that are associated with packaged genome sequence data attached to the exemplary header fields. The diseases that are known to have association with the segmented sequence in the payload of this biological data unit in this case are colon, cervical and breast cancers. The gene or sequence segment might represent an up-regulated oncogene or proto-oncogene, a down-regulated tumor suppressor gene or a structural or functional gene involved in a pathway with other genes associated with the disease.
Referring now to
As shown in
Attention is now directed to
Referring now to
In this example, the variations in the number of exons that are contained in this gene indicate the existence of different splice variants that are associated with the transcripts from cell taken from the breast tumor tissue. The defect in splicing could be from variants of the gene or some component of the splicing mechanism.
In the embodiment of
Although
Attention is now directed to
In one embodiment the Smart Repository™ 1910 comprises a node of, or is in network communication with, a biological data network 1914 containing a plurality of other nodes 1918 and/or is in communication with other data networks, such as the Internet. In such embodiment the Smart Repository™ 1910 may be, except with regard to the information aggregation functionality described below, functionally and architecturally similar or identical to the network nodes 510 described above. In other embodiments the Smart Repository™ 1910 is configured to perform only the information aggregation functions described hereinafter and is not otherwise configured for networked-based processing of sequence-related information. In still other embodiments the Smart Repository™ 1910 is not included within a biological data network but has access to information within other networks, such as the Internet.
As shown in
In one embodiment the transcriptor 1930 operates to substantially continuously monitor the biological data network 1914 and such other data networks for information of potential relevance to users of the Smart Repository™ 1910. Certain of such information may then be retrieved by the Smart Repository™ 1910 and cached within the genome data repository 1940. For example, the transcriptor 1930 may collect drug efficacy and other information relating to sequence data stored within the repository 1940 which contains various biomarkers and is associated with particular disease conditions. Such information may be relatively detailed and comprehensive. For example, information relating to drug efficacy may include a “confidence” score associated with the information; that is, an indication of the level of confidence associated with the efficacy information. A high confidence score could be assigned to drugs for which relatively large amounts of patient data are available to confirm the reported efficacy, while relatively lower confidence scores could be assigned in the absence of extensive corroboration patient data.
In a first mode of operation, the SmartTracker™ module 1920 receives a query 1950 from an actor 1956 in the form of, for example, a client computer similar or identical to the client terminal 560. In one embodiment the SmartTracker™ module 1920 is, among other things, configured to track the uploading and downloading of sequence-related information occurring between the actor 1956 and the Smart Repository™ 1910. In one embodiment the transcriptor 1930 assembles, based upon the subject matter of the query 1950, the sequence-related upload and/or download activity of the requesting actor and/or other aspects of the actor's profile, a script of information of various different types determined to be of relevance to the query in view of the interests of the requesting actor. This assembling may include, for example, parsing fields of the metadata files 1944 associated with files of sequence data 1942 determined to be relevant to a query 1950 in order to identify clinical and/or biological information related to the query. Once such clinical and/or biological information has been identified, its relevance may be quantified and ranked and a script 1960 containing such information is provided to the requesting actor 1956. In one embodiment the script 1960 may also be locally cached within the genome data repository 1940 and provided to other actors (not shown) which the SmartTracker™ module 1920 presently or subsequently determines possess interests similar to the requesting actor 1956. In other embodiments the script 1960 may be aggregated with other information scripts cached within the genome data repository 1940. These aggregated scripts may then be combined with other relevant information for inclusion within scripts of information subsequently generated by the transcriptor 1930 in response to subsequent requests for genomic-related information received by the Smart Repository™ 1910.
In a second mode of operation, the SmartTracker™ module 1920 tracks the uploading and downloading activity of sequence-related information occurring between the actor 1956 and the Smart Repository™ 1910 and instructs the transcriptor to assemble a script of related information based upon one or more aspects of this activity. The resultant script of related information may then be provided by the transcriptor 1960 to the actor 1956 in connection with an uploading or downloading transaction initiated by the actor 1956. In one embodiment the contents of this script of related information is not necessarily pertinent to specific information included within a particular request from a requesting actor, but rather is selected by the transcriptor 1960 based upon the uploading/downloading activity and/or other system usage of the actor 1956.
In one embodiment the sequence data and BioIntelligence information assembled by the transcriptor 1960 in response to a request from an actor is provided to such actor through an entitlement control module 1970. For example, an authenticated actor can query the repository for a list of all the genome data files of individuals with colorectal cancer that were uploaded within the current year by the actor 1956 or other actors (not shown). Such a query would return a complete list of the files. However, in one embodiment the actor 1956 would be permitted to access only those files within the list which the entitlement control module 1970 determines the actor 1956 is authorized to access. For example, in certain embodiments only the owners of a subset of the listed genome data files may have consented to permit the actor 1956 to download or otherwise access such files. In embodiments in which the actor 1956 subscribes to services offered by the Smart Repository™ 1910 and/or biological data network 1914, the entitlement control module 1970 could be further configured to determine whether or not the requesting actor 1956 has a current subscription and, if so, the level or “quality of service” associated with the subscription. For example, in embodiments in which the actor 1956 has subscribed to a relatively higher quality of service, the information within the script provided by the transcriptor 1960 may be more recent or drawn from a wider variety of sources than would be the case had the actor 1956 opted for a lower quality of service.
In one embodiment the transcriptor 1930 may track attribute information located in the fields of the metadata files 1944 or BioIntelligence headers associated with the genomic sequence data files 1942. The information contained in the metadata files 1944 may be of any pertinent type including, without limitation, health record, image, clinical, pharmacological, medical, environmental and social data.
In certain embodiments the genome sequence data files 1942 may comprise disease-normal matched pair genomic sequence data (i.e., genomic sequence data associated with diseased tissue and “matched” genomic sequence data associated with normal tissue from the same individual). In this embodiment the Smart Tracker™ 1920 may track metadata attributes containing higher-order information with research and clinical relevance and instruct the transcriptor 1930 to include this type of information within the metadata files 1944. For example, the metadata files 1944 may contain annotation and attribute information such as, without limitation, germline and somatic genome variants, CNV data, methylation sequence data, microbiome, metabolome, transcriptome, proteome and any other related structure, function and genetic data.
The functionality of the transcriptor 1930 may be determined at least in part by the type of information incorporated into the metadata files 1944. For example, in metadata files 1944 containing transcriptome-related information, the transcriptor 1930 may be configured to aggregate data such a microRNA-Seq, mRNA-Seq and any other transcriptomic information of or pertaining to alternative splicing, differential expression and regulation.
Attention is now directed to
As shown in
In general, the transactor 2020 functions to monitor such interaction based upon information provided by the SmartTracker™ module 2018 and assign actors 2024 exhibiting similar behavior to the same cast of actors. For example, actors 2024 tending to download, from the Smart Repository™ 2010, similar segments of genomic sequence data or other information could be grouped by the transactor 2020 into the same Cast X 2030 comprised of actors 2024X.
As is described in the above-referenced provisional application Ser. No. 61/539,942, grouping of the actors 2024X into a common Cast X 2030 enables large files of sequence data 2044 and other information to be downloaded by such actors 2024X using a genomic sequence transfer protocol designed to more efficiently utilize the network bandwidth available to the Smart Repository™ 2010. The disclosed genomic sequence transfer protocol provides a method for secure, high-speed file transfer which is capable of overcoming the disadvantages of TCP and existing peer-to-peer protocols with respect to the distribution of files of very large size. Like other peer-to-peer file distribution systems, the disclosed high-speed file transfer system disclosed in the above-referenced provisional application utilizes a tracker (e.g., the SmartTracker™ 1920) to enable a plurality of actors 2024X within the CastX 2030 to cooperatively distribute a file of interest. Within the context of the genomic sequence transfer protocol, the transactor 2020 operates to identify and make a record of those Actors 2024 which request a certain file of interest (e.g., “file X”). The transactor 2024 will also generally include or be paired with an entitlement control module 2040 configured to determine the authentication and entitlement of each actor based on authorization rules and using a secure key distribution scheme.
In one embodiment the transactor 2020 may determine which actors 2024 are assigned to a particular cast (e.g., Cast X 2030) based upon, for example, the file requested, the location of the file (i.e., with which actor(s) 2024 the file is currently stored), as well as the credentials of the actors 2024 requesting access to the file. Once an actor 2024 has been directed to a particular cast, the actor 2024 exchanges messages 2046 with other actors 2024 within the cast in order to determine and receive the portions of the file of interest currently possessed by the Cast X 2030. Stated differently, the transactor 2020 proactively directs a requesting leecher actor to a feeder affinity group such that the leecher receives as much of the requested file as possible without, to the extent possible, incrementing the burden on the seed of file X.
In the case of very large files, such as files containing genomic or other biological sequence information, the disclosed genomic sequence transfer approach effectively “parallelizes” the transfer of file information and reduces the burden on the initial seed or seeds of file X. Moreover, the use of parallel streams within the disclosed system minimizes the effect of a multiplicative decrease in the speed of any one stream resulting from the characteristics of TCP. Thus, use of the disclosed genomic sequence transfer protocol may reduce the likelihood of bottlenecks developing around overburdened seed servers in connection with the transfer of very large data files.
The use of such parallel streams also enables the separate encryption of each individual file segment, thus obviating the need for re-encryption and retransmission of the entire file in the event of corruption of an individual segment. Particularly in the case of very large data files containing sensitive information (e.g., files containing genomic sequence information), this aspect of the disclosed genomic sequence transfer protocol may offer considerable advantages relative to existing methods of file distribution.
Turning now to
In one embodiment, the transcriptor 2128 will use the metadata collected by the SmartTracker™ module 2118 and stored in the metadata files 2142 of the genome data repository 2140 to form an aggregated script of associated information. Based on classification schemes which may be continuously updated and curation of such information within the metadata files 2142 by, for example, subject matter experts, the transcriptor 2128 assemble one or more scripts 2160 of stratified, highly-relevant clinical and research data. This assembling may include, for example, parsing fields of the metadata files 2142 associated with sequence data files 2144 determined to be relevant to a query 2150 in order to identify clinical and/or biological information related to the query 2050. This assembling may further include, for example, evaluating the sequence-related upload/download activities of the requesting actor 2142X. Once such clinical and/or biological information has been identified and such upload/download activity evaluated, the relevance of the information may be quantified and ranked and a script 2160 containing such information is provided to the requesting actor 2142X.
The Information within the metadata files 2142 will typically include one or more tags identifying the corresponding sequence data to which such information relates as well as relevant annotations and abstracted data. The information within the metadata files 2142 or BioIntelligence headers will typically include, without limitation, information from pathology reports associated with the corresponding sequence data such as, for example, gross and microscopic descriptions, diagnosis, tumor size and grade. The metadata files 2142 may also include information concerning a patient's blood report that is relevant to a given diagnosis, as well as treatment options relevant to such diagnosis. These information fields relating to such blood report will generally include, for example, red blood cell (RBC), leukocytes, platelets or thrombocytes counts, hemoglobin concentration, hematocrit measures, erythrocyte size test and mean corpuscular measures. The metadata files 2142 may also include information concerning a patient's cytology report indicating, for example, the presence or absence of atypical cells and/or malignant dysplasia.
Metadata Attributes Associated with Genome Data
Set forth below is a representative list of the type of exemplary information fields which may be included within the metadata files 2142 or otherwise stored within the genome data repository 2140 as associated BioIntelligence header information.
Molecular Metadata Attributes
-
- ID or UUID: Universal Unique ID corresponding to a sequence data file or to other information related to the sequence information of a sequence data file
- Disease: The diseases which are associated with a sequence data file
- Cell: Cell or tissue type used to prepare analyte
- CNV: Relevant information on copy number variation
- SV: Related structural variants and chromosomal rearrangements
- SNP: SNPs associated with the diseases
- microRNA: Correlated microRNA expression information
- mRNA: Differential expression of associated genes
- Splice: Any information on splice variants and alternative splicing
- Methylation: DNA methylation, hetero chromatin and Methyl-Seq information
- Pathway: Information on known or predicted pathways
- Gene: Information on known or predicted genes
- Activity: Molecular activities in the related pathway; kinase, methylation, phosphorylation
- Regulation: Mutations in known or predicted regulatory regions
- Exogenous: Relevant microbial genome information
- Mobile: Information on transposable DNA elements
- Repeats: Available information on any tandem or interspersed DNA repeat sequences associated with the disease
- Protein: Information on body fluids protein concentration and activity
Age:
Tumor size:
Tumor grade:Cellular differentiation
Tumor stage:
Tumor behavior:
Origin: Organ, tissue, cell
Node status: Positive or negative
Hormone receptor: Positive or negative
Laboratory procedures ordered:
-
- DNA preparation
- RNA purification
- DNase treatment
- PCR amplification
- cDNA purification
- microarray hybridization; scanning
- next generation sequencing
- raw reads
- sorting
- align and Map
- calling variants
- clinical associations
- molecular associations
In one embodiment some or all of the information within the genome data repository 2140 is linked on the basis of UUIDs corresponding to ones of the sequence data files 2144. For example, consider the exemplary case in which a sample of a cancerous tumor is taken from the patient. In this case a first UUID could be assigned to the tumor sample and stored within the genome data repository 2140 in association with information relating to the tumor (e.g., size of tumor, date taken, procedures performed). This first UUID could also be linked to medical or other records associated with the patient.
Based upon tissue from the tumor sample, various analytes (e.g., DNA, RNA) may be derived and purified. Each of these purified analytes may then also be associated with a different UUID, all of which are linked to the first or primary UUID associated with the tumor sample itself. Various information relating to each such analyte (e.g., concentration of the analyte within the test tube or other analyte repository, name or other information identifying the individual responsible for preparing the analyte sample) may be stored in connection with the UUID corresponding to the analyte. An aliquot of the solution of one such purified analyte (DNA, RNA, etc.) may then be obtained, assigned a UUID linked to, for example, either or both of the UUID of the analyte solution and the first or primary UUID of the tumor sample. The aliquot may then be provided to a sequencing machine and another UUID assigned to the resultant sequence data. In addition to base-pair sequence data, such sequence data may include information relating to, for example, the machine used to perform the sequencing and the individual(s) responsible for operating such machine at the time of the sequencing. In addition, variant calls may be made with respect to such sequence information and the corresponding sequence variants may be assigned UUIDs linked to the UUID of such sequence information.
In one embodiment the Smart Repository™ 2110 provides the sequence data generated by the sequencing machine to another node of a biological data network in which the Smart Repository™ 2110 is included and such node performs variants call processing to determine the sequence variants corresponding to one or more portions of the sequence data and provides such sequence variants to the Smart Repository™ 2110. Each of these sequence variants may then be assigned a UUID linked to the underlying sequence data and stored within the genome data repository 2140. Similarly, either the underlying sequence data or the associated sequence variants may be correlated with, for example, drug efficacy information. Such correlated information may be assigned one or more UUIDs linked to the UUIDs of the underlying sequence data and/or the sequence variants and stored within the genome data repository 2140.
As a consequence of this linked relationship among data records within the genome data repository 2140, an actor 2140X may submit a query to the repository 2140 identifying one or more UUIDs (e.g., the UUID relating to a particular tumor sample), and receive information relating to some or all of the data records associated with UUIDs linked to the identified UUID(s). The particular fields of the records within the repository 2140 which are evaluated by the transcriptor 2128 in response to a query in order to identify a relevant set of UUIDs to be returned in response to such a query will generally be dependent upon the subject matter of the query. For example, the transcriptor 2128 would evaluate attribute information from a different set of fields within the metadata files 2142, BioIntelligence headers or other records stored within the repository 2140 in response to a query relating to breast cancer than would be evaluated in response to a query relating to prostate cancer. This difference might range from particular types of sequence variants and modifications associated with certain cancers, to the pertinent clinical information that may be taken from the patient's laboratory reports.
Referring to
Turning now to
The GeneTransfer Executive module 2310 includes a transactor 2340, a GeneTransfer module 2344, access control manager 2348 and an encryption engine 2352. In one embodiment the GeneTransfer module 2344 generates packetized biological data units in the manner described herein. These packetized biological data units are then encrypted by the encryption engine 2352 prior to being sent by a network interface 2360 to a requesting client actor (not shown).
The Smart Repository™ 2300 further includes a genome data repository 2370, metadata storage 2374 and a repository of prior knowledge 2378. The network interface 2360 also facilitates the transfer of genomic sequence data, metadata and prior knowledge between, on the one hand, the genome data repository 2370, metadata storage 2374 and repository of prior knowledge 2378 and, on the other hand, the GeneTransfer Executive 2310 and the transcriptor 2320.
Attention is now directed to
The actor 2400 may also include one or more network connections 2440 configured to send or receive biological data, sequences, instruction sets, or other data or information to and from Smart Repositories or other databases or computer systems. The network connection 2440 may allow users to receive uncompressed or compressed biological sequences from, for example, the Smart Repository™ 2300 as well as send uncompressed or compressed sequences. Network connection 2440 may include wired or wireless networks, such as Etherlan networks, T1 networks, 802.11 or 802.15 networks, cellular, LTE or other wireless networks, or other networking technologies are known or developed in the art.
Memory space 2470 may be configured to store data as well as instructions for execution on the processor 2470 to implement the methods described herein. In particular, memory space 2470 may include a GeneTransfer client module 2472 for transferring genomic sequence information to and from the GeneTransfer module 2344 within the Smart Repository™ 2300. Memory space 2470 may further include an operating system (OS) module 2474, a data module 2476 configured to temporarily store sequence data and/or associated attributes or metadata, and a decryption module 2478 for decrypting encrypted biological data units or other encrypted information received from the Smart Repository™ 2300.
Attention is now directed to the process flow diagram of
The process 2500 is initiated in response to a query sent by the actor 2124 to one or Smart Repositories, such as the Smart Repository™ 2110 (stage 2501). It should be understood that in certain embodiments the actor 2124 may submit requests to a single Smart Repository™, such as the Smart Repository™ 2110. In such embodiments the Smart Repository™ 2110 may be capable of responding to the request directly. Alternatively, the Smart Repository™ 2110 may parse the request, send corresponding requests to one or more other Smart Repositories included within a biological data network accessible to the Smart Repository™ 2110, and forward a response to the request to the actor 2124 based upon information included within the Smart Repository™ 2110 and/or provided by such other Smart Repositories. In other embodiments the actor 2124 may send the request to a group of Smart Repositories and further process the set of results received from this group.
The query received during stage 2501 may exhibit varying degrees of specificity. For example, the query could request that the Smart Repository™ 2110 return information relating to all of the whole genome sequences included within the Smart Repository™ 2110 (or available within other Smart Repositories included within a biological data network in which the Smart Repository™ 2110 is also included) associated with a diagnosis of prostate cancer. Somewhat more specifically, the query received during stage 2501 could request generation of a complete list of all of the genome sequence files 2144 (e.g., BAM files) and associated ancillary metadata files 2142 (e.g., XML files) that have been submitted to the Smart Repository™ 2110 by a particular actor 2124 (e.g., a genome sequencing center) during a certain time frame. In this case the scripted response 2160 to the query could comprise a list of those UUIDs representative of all such genome sequence files 2144 and associated ancillary metadata files 2142.
In other cases the query received during stage 2501 may identify a particular disease type. In this regard the query could request information relating to the genome sequence files 2144 associated with tissue from individuals that have diagnosed with a certain disease type and sequenced within a particular time period at a specific sequencing center. As a specific example of this case, the query could request all sequence files generated at the Broad sequencing center based upon patients diagnosed with prostate cancer which were uploaded to a Smart Repository™ within a particular biological data network between Jun. 1, 2011 and Aug. 31, 2011. In this case the expected would be a list of those UUIDs representative of all the genome sequence files and metadata files (e.g., XML files) relating to patients diagnosed with prostate cancer.
In a stage 2502, the SmartTracker™ 2118 evaluates the received query and identifies all of the sequence data files 2144 and metadata files 2142 within the Smart Repository™ 2110 encompassed by such query. The SmartTracker™ 2118 may also identify all other sequence data files and metadata files accessible within any biological data network(s) in which the Smart Repository™ 2110 is included that are requested by the query. In one embodiment the SmartTracker™ 2118 identifies the appropriate genome sequence data files and associated metadata files by tracking and parsing the attribute fields of such metadata files in accordance with the received query. For example, in cases in which the received query indicates an interest in sequence data files from individuals diagnosed with prostate cancer which have been uploaded to a particular sequencing center during a particular time, the SmartTracker™ module 2118 would evaluate the attribute fields of all available metadata files relating to these parameters and generate a list of UUIDs corresponding to the metadata files and the associated genome sequence data files (stage 2503).
In one embodiment the list of UUIDs generated by the SmartTracker™ module 2118 is filtered by the entitlement control module 2140 based upon, for example, patient consent, subscription parameters and the like (stage 2503A). The filtered list of UUIDs produced by the entitlement control module 2140, which comprises an initial, specific response to the query received during stage 2501, is then sent to the requesting actor 2124 (stage 2504). In other embodiments the list of UUIDs is not filtered prior to being provided to the requesting actor 2124. However, in this case the entitlement control module 2140 enforces conditional access rules when the requesting actor 2124 attempts to download the genome sequence data files or metadata files corresponding to any of the listed UUID's. That is, the requesting actor 2124 is permitted to download only those sequence-related or metadata files which the entitlement control module 2140 determines that such actor 2124 is entitled to access.
At stage 2504A, the requesting actor 2124 prompts the transcriptor 2128 to initiate a process of generating a script of supplementary information related to the subject matter of the initial, specific response provided to the requesting actor 2124 during stage 2504. In one embodiment the requesting actor 2124 automatically prompts the transcriptor 128 to initiate such process upon receiving the initial, specific response during stage 2504, provided the requesting actor 2124 has expressed a preference to receive such an information script (either as part of the request received during stage 2501 or otherwise). In other embodiments stage 2504A is initiated only after the requesting actor 2124 has received the initial, specific response during stage 2504 and subsequently explicitly requested the script of supplementary information.
In order to generate the script of supplementary information, the transcriptor 2128 evaluates the query received during stage 2501 and the initial, specific response delivered during stage 2504. Next, in a stage 2505, the transcriptor 2128 initiates a script request by commencing a process for identifying a set of highest ranking attributes (HRAs) inherent within the metadata files returned as part of the initial, specific response during stage 2504. This could involve, for example, determining the relative frequency at which various attributes appear within the metadata files returned during stage 2504. For example, if a particular attribute appears in every one of the metadata files subsequently returned during stage 2504, then such attribute would likely be included among the set of HRAs associated with the corresponding query received during stage 2501. Based upon an evaluation of, for example, the relative frequency at which various attributes appear within the metadata files returned during stage 2504, a set of HRAs may be determined by the transcriptor 2128. In certain embodiments other considerations may bear upon whether a particular attribute is included among the set of HRAs corresponding to a given query. For example, the distribution of the particular attribute information as it relates to the universally unique identifier (UUID) and the strength of the curated evidence associated with such attribute information may also be considered by the transcriptor 2128 when determining a set of HRAs.
In addition to parsing the metadata files returned during stage 2504 in order to determine a set of HRAs, during stage 2506 the transcriptor 2128 will typically also evaluate those metadata files which are generally encompassed by the query received during stage 2501 but which were not identified during stage 2503 because of other limitation or constraints present within the query. For example, in the case in which a query received during stage 2501 requests a list of sequence files associated with a diagnosis of prostate cancer which were uploaded to the Smart Repository™ 2110 by a particular sequencing center during a particular time window, the transcriptor 2128 may nonetheless evaluate the attribute information included within those metadata files associated with a diagnosis of prostate cancer which are also associated with sequence information uploaded by other than the specified sequencing center and/or were uploaded outside of the particular time window. Similarly, the transcriptor may evaluate metadata files associated with a diagnosis of prostate cancer which are stored within other Smart Repositories as part of the process of determining HRAs corresponding to a particular query. In this way the set of HRAs may be determined to include attributes not otherwise known to be associated with a disease type specified within a query (e.g., prostate cancer) but which in fact are highly correlated to known cases of such disease type.
The HRAs may or may not be within a field identified by the query received during stage 2501. In one embodiment ranking of the attributes during stage 2668 determines which information fields and sequence data files, including those files identified in the initial, specific response to the received query, may be most relevant to the primary subject of the query received during stage 2501. For example, in the case in which the received query specifically requests a list of sequence files relating to prostate cancer, all of the sequence files identified in the initial, specific response will be associated with metadata files containing the attribute “prostate cancer” within the “disease type” field. However, there could be instances in which a specific attribute of a metadata file such as, for example, blood PSA level, could be indicated to be significantly above a concentration level associated with a high risk for prostate cancer (e.g., 4 ng/ml), but in which such metadata file does not include an attribute indicating a diagnosis of prostate cancer and the associated individual exhibits no other symptoms of prostate cancer. Accordingly, even through a query received during stage 2501 having a primary subject of “prostate cancer” may not have included “blood PSA level” as a parameter, a blood PSA level in excess of 4 ng/ml could be deemed to be an HRA with respect to such query because of its correlation with prostate cancer. As another example, there could be cases associated with individuals which have developed a prostate cancer tumor in which one attribute of a metadata file relating to fluorescence in situ hybridization (FISH) reflects a “PTEN deletion” while another such attribute indicates a normal or low PSA blood level. Such a PTEN deletion could then be determined by the transcriptor 2128 to be an HRA. As a consequence, it would be expected that most of the genome sequence data files on the list of UUIDs returned in a response to a query for which a PTEN deletion is determined to be an HRA will have metadata attributes indicative of such a deletion.
The HRAs may be considered to be data points that are heavily weighted based on a voting scheme and a set of rules that may be continuously modified based on new information and knowledge. In this regard, an HRA may be considered to be specific to a particular query and the response to such query.
Again referring to
In the exemplary case in which the primary subject of query received during stage 2501 was determined to be prostate cancer, at least all of the metadata files associated with individuals whom have been clinically diagnosed with prostate cancer by an oncologist will be evaluated by the transcriptor 2128. In one embodiment each such file would include the attribute “prostate cancer” within a “disease type” field of the file. However, simply because an individual is not clinically diagnosed as having prostate cancer using traditional approaches does not mean that such individual is not advancing towards this disease condition at the molecular level. As a consequence, sequence information derived from analytes produced from the tissue of such an individual may include cellular queues or biomarkers which contribute significantly to such disease condition and which therefore may be relevant to determining whether such condition exists in a given individual.
In a stage 2508, the transcriptor 2128 identifies those metadata files which are associated with a primary subject of the query received during stage 2501 (e.g., those metadata files having the attribute “prostate cancer” in the “disease type” field) but which do not include any of the HRAs determined based upon the received query request and the initial, specific response to such request. Again, in one embodiment the metadata files include the metadata files 2142 as well as the metadata files stored within other repositories networked with the Smart Repository™ 2110. In this exemplary case it is assumed that when individuals have been diagnosed with a particular disease (e.g., prostate cancer), an attribute reflecting this diagnosis is included within a “disease type” or similar field in the applicable metadata file. Accordingly, those patients associated with metadata files identified during stage 2508 may be considered to be “rare variants” in the sense that such patients have been diagnosed with a particular disease condition but exhibit none of the HRAs generally associated with such condition.
In a stage 2509, the transcriptor 2128 aggregates the files identified during stages 2506, 2507 and 2508 (or the UUIDs corresponding to such files) in order to form a script of supplementary information relating to the query received during stage 2501 and the initial, specific response to the query provided during stage 2504. Once the script of supplementary information has been formed by the transcriptor 2128 it is sent to the requesting actor 2124 (stage 2510).
The above-described script of supplementary information advantageously enables the requesting actor to access potentially obscure but nonetheless relevant information that was not explicitly requested nor included within the initial, specific response to the request received during stage 2501. Such advantageous results are facilitated by the dynamic characterization of the original query and subsequent statistical correlation analysis of the various fields of the ancillary metadata associated with the genome sequence data from the patient of interest carried out in the manner described above. However, other approaches to developing such a script of supplemental information based upon an original query and/or initial query response may be apparent to those skilled in the art in view of the teachings and exemplary approaches described herein. Moreover, it should be appreciated that the schema and process disclosed in
Turning now to
Turning now to
Attention is now directed to
Referring now to
The word “exemplary” is used herein to mean “serving as an example, instance, or illustration.” Any embodiment described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other embodiments.
In one or more exemplary embodiments, the functions, methods and processes described may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or encoded as one or more instructions or code on a computer-readable medium. Computer-readable media includes computer storage media. Storage media may be any available media that can be accessed by a computer.
By way of example, and not limitation, such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. Disk and disc, as used herein, includes compact disc (CD), laser disc, optical disc, digital versatile disc (DVD), floppy disk and blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.
It is understood that the specific order or hierarchy of steps or stages in the processes and methods disclosed are examples of exemplary approaches. Based upon design preferences, it is understood that the specific order or hierarchy of steps in the processes may be rearranged while remaining within the scope of the present disclosure. The accompanying method claims present elements of the various steps in a sample order, and are not meant to be limited to the specific order or hierarchy presented.
Those of skill in the art would understand that information and signals may be represented using any of a variety of different technologies and techniques. For example, data, instructions, commands, information, signals, bits, symbols, and chips that may be referenced throughout the above description may be represented by voltages, currents, electromagnetic waves, magnetic fields or particles, optical fields or particles, or any combination thereof.
Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both.
To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system.
Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.
The various illustrative logical blocks, modules, and circuits described in connection with the embodiments disclosed herein may be implemented or performed with a general purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration. Additionally, the scope of the invention includes hardware not traditionally used or thought-of having use within general purpose computing, such as graphic processing units (GPUs).
The steps or stages of a method, process or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
Certain of the disclosed methods may also be implemented using a computer-readable medium containing program instructions which, when executed by one or more processors, cause such processors to carry out operations corresponding to the disclosed methods.
An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. In the alternative, the processor and the storage medium may reside as discrete components in a user terminal.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein. It is intended that the following claims and their equivalents define the scope of the disclosure.
Claims
1. A method for facilitating processing of a request in a biological data network comprised of a plurality of biological data units stored at a plurality of network nodes, the method comprising:
- receiving, at a first network node included within the plurality of network nodes, the request from a client device wherein the first network node is configured to communicate with other network nodes included within the plurality of network nodes;
- performing a first processing operation with respect to at least one of the biological data units based upon the request;
- determining that the processing of the request is complete; and
- sending, from the first network node, a response to the client device.
2. The method of claim 1 wherein each of the plurality of biological data units includes a representation of biological sequence data and at least one attribute of biological relevance in a tag associated with the biological sequence data.
3. The method of claim 1 further including:
- determining, based upon results of the first processing operation, that the processing of the request is incomplete;
- selecting, based upon the results of the first processing operation, a second network node of the plurality of network nodes to perform a second processing operation; and
- sending, from the first network node, the results of the first processing operation to the second network node.
4. A method for facilitating processing of genomic sequence information in a biological data network including a plurality of biological data units stored at a plurality of network nodes, comprising:
- receiving, at a network node included within the plurality of network nodes, a segment of a genome sequence of an organism;
- comparing the segment of the genome sequence to a reference sequence;
- identifying sequence variants between the segment of the genome sequence and the reference sequence;
- sending, from the network node, a request for information relating to the sequence variants; and
- receiving, from another network node included within the plurality of network nodes, the information relating to the sequence variants.
5. The method of claim 4 wherein each of the plurality of biological data units includes a representation of biological sequence data and at least one attribute of biological relevance associated with the biological sequence data.
6. A method for facilitating processing of a disease-related query within a biological data network including a plurality of biological data units stored at a plurality of network nodes, the method comprising:
- receiving, at a first network node of the plurality of network nodes, a query relating to a specified disease and a genomic sequence associated with the query;
- identifying, relative to a control sequence, any variant alleles within the genomic sequence;
- sending information identifying the variant alleles from the first network node to a second network node of the plurality of network nodes; and
- receiving, at the first network node, information relating to the variant alleles.
7. The method of claim 6 further including sending a response to the query based upon the information relating to the variant alleles.
8. The method of claim 1 wherein each of the plurality of biological data units includes a representation of biological sequence data and at least one attribute of biological relevance associated with the biological sequence data.
9. The method of claim 6 further including:
- determining, at the first network node, that processing of the query is incomplete;
- selecting a second network node of the plurality of network nodes to perform a processing operation based upon the query; and
- sending, from the first network node, the information relating to the set of variant alleles to the second network node.
10. A network-based method for facilitating processing of a disease-related query, the method comprising:
- receiving, at a first network node, a query relating to a specified disease and a genomic sequence associated with the query;
- identifying, relative to a control sequence, any variant alleles within the genomic sequence;
- sending information identifying the variant alleles from the first network node to a second network node;
- receiving, at the first network node, pharmacological response data associated with those of the variant alleles included within genes associated with the specified disease; and
- sending a response to the query based upon the pharmacological response data.
11. A method for facilitating processing of a disease-related query within a biological data network, the method comprising:
- receiving, at a network node, information identifying variant alleles within a genomic sequence associated with a query relating to a specified disease;
- performing, at the network node, a statistical correlation analysis in order to identify those of the variant alleles included within genes associated with the specified disease; and
- sending results of the statistical correlation to another network node for further processing.
12. The method of claim 11 further including receiving, at the network node, results of the further processing.
13. The method of claim 11 wherein the biological data network includes a plurality of biological data units stored at a plurality of network nodes.
14. The method of claim 13 wherein each of the plurality of biological data units includes a representation of biological sequence data and at least one attribute of biological relevance associated with the biological sequence data.
15. A method for facilitating the processing of biological data within a network including a plurality of nodes, the method comprising:
- receiving, at a first node of the plurality of nodes, a request to process the biological data;
- performing a first processing operation with respect to at least a DNA-specific layer of the biological data based upon the request; and
- sending, to a second node of the plurality of nodes, results of the first processing operation wherein the second node is configured for processing of an RNA-specific layer of the results.
16. The method of claim 15 further including selecting, based upon the results of the first processing operation, the second node to perform the processing of the RNA-specific layer of the result.
17. The method of claim 15 further including receiving, at the first node, results of the processing of the RNA-specific layer.
18. A network node for use within a biological data network comprised of a plurality of biological data units, the network node comprising:
- a network interface;
- an input packet processor in electronic communication with the network interface, the input packet processor being configured to receive a request from a client device;
- a processing module configured to perform a first processing operation with respect to at least one of the biological data units based upon the request and to determine that the processing of the request is complete; and
- a transmit controller module in electronic communication with the network interface, the transmit controller module controlling the sending of a response to the client device.
19. The network node of claim 18 wherein the input packet processor includes a classification module and a policy module.
20. The network node of claim 18 wherein each of the plurality of biological data units includes a representation of biological sequence data and at least one attribute of biological relevance associated with the biological sequence data.
21. The network node of claim 20 further including sequence location tables containing information indicative of network locations at which are stored the biological sequence data included within ones of the plurality of biological data units.
22. A network node for use within a biological data network comprised of a plurality of biological data units, the network node comprising:
- a network interface;
- an input packet processor in electronic communication with the network interface, the input packet processor being configured to receive an input packet including a segment of a genome sequence of an organism;
- a processing module configured to compare the segment of the genome sequence to a reference sequence and to identify sequence variants between the segment of the genome sequence and the reference sequence; and
- a transmit controller module in electronic communication with the network interface, the transmit controller module controlling the sending of a request for information relating to the sequence variants;
- wherein the network interface is configured to receive, from another network node included within the plurality of network nodes, the information relating to the sequence variants.
23. The network node of claim 22 wherein each of the plurality of biological data units includes a representation of biological sequence data and at least one attribute of biological relevance associated with the biological sequence data.
24. A network node for use within a biological data network comprised of a plurality of biological data units, the network node comprising:
- a network interface;
- an input packet processor in electronic communication with the network interface, the input packet processor being configured to receive a query relating to a specified disease and a genomic sequence associated with the query;
- a processing module configured to identify, relative to a control sequence, any variant alleles within the genomic sequence; and
- a transmit controller module in electronic communication with the network interface, the transmit controller module controlling the sending of information identifying the variant alleles to another network node;
- wherein the network interface is configured to receive, from the another network node, information relating to the variant alleles.
25. The network node of claim 24 wherein the transmit controller further controls sending a response to the query based upon the information relating to the variant alleles.
26. The network node of claim 24 wherein each of the plurality of biological data units includes a representation of biological sequence data and at least one attribute of biological relevance associated with the biological sequence data.
27. A network node for facilitating processing of a disease-related query, the network node comprising:
- a network interface;
- an input packet processor in electronic communication with the network interface, the input packet processor being configured to receive a query relating to a specified disease and a genomic sequence associated with the query;
- a processing module configured to identify, relative to a control sequence, any variant alleles within the genomic sequence; and
- a transmit controller module in electronic communication with the network interface, the transmit controller module controlling the sending of information identifying the variant alleles to another network node and the sending of a response to the query based upon pharmacological response data associated with those of the variant alleles included within genes associated with the specified disease.
28. A network node for facilitating processing of a disease-related query, the network node comprising:
- a network interface;
- an input packet processor in electronic communication with the network interface, the input packet processor being configured to receive information identifying variant alleles within a genomic sequence associated with a query relating to a specified disease;
- a processing module configured to perform a statistical correlation analysis in order to identify those of the variant alleles included within genes associated with the specified disease; and
- a transmit controller module in electronic communication with the network interface, the transmit controller module controlling the sending of results of the statistical correlation to another network node for further processing.
Type: Application
Filed: Mar 9, 2012
Publication Date: Sep 13, 2012
Applicant: ANNAI SYSTEMS, INC. (Los Gatos, CA)
Inventors: Lawrence Ganeshalingam (Los Gatos, CA), Patrick Nikita Allen (Scotts Valley, CA)
Application Number: 13/417,184
International Classification: G06F 17/30 (20060101);