SYSTEMS AND METHODS FOR ENCODING GENOMIC GRAPH INFORMATION
The invention provides systems and methods for encoding specific portions of a genomic graph in a way that is useful for copying or moving those portions from one location to another. The genomic graph can comprise nodes connected by edges, wherein the nodes and edges are stored as edge objects in non-transitory memory of a computer system and wherein an edge within the graph represents a genomic sequence. In on embodiment, the a method of encoding a genomic graph comprises selecting at least one edge from the genomic graph, the at least one edge representing a single genetic sequence and having a position on the genomic graph. The method can further comprise encoding the at least one edge as a string of bits, in which at least a first component of the string of bits stores at least a portion of the position of the edge on the genomic graph.
Latest Seven Bridges Genomics Inc. Patents:
This application claims priority from U.S. Provisional Patent Application Ser. No. 62/338,706, filed on May 19, 2016, entitled “SYSTEMS AND METHODS FOR ENCODING GENOMIC INFORMATION”, U.S. Provisional Patent Application Ser. No. 62/400,388, filed on Sep. 27, 2016, entitled “SYSTEMS AND METHODS FOR SEQUENCE ENCODING”, U.S. Provisional Patent Application Ser. No. 62/400,411, filed on Sep. 27, 2016, entitled “SYSTEMS AND METHODS FOR METADATA STORAGE AND COMPRESSION”, and U.S. Provisional Patent Application Ser. No. 62/400,123, filed on Sep. 27, 2016, entitled “SYSTEMS AND METHODS FOR GENOMIC GRAPH INSERT SPECIFICATION”, the disclosures of which are hereby incorporated by reference.
TECHNICAL FIELDThe disclosure relates to biology, genomics, and medical genetics. In particular, the disclosure relates to methods of encoding genomic information represented by graphs.
BACKGROUNDStudying genomes can potentially reveal much about the health and history of organisms and entire species. Researchers can identify genes associated with cancer and other diseases through genomic studies, and genomics also plays an important role in forensics, genealogy, and agriculture, among other fields. A frequent approach to genomic studies includes sequencing DNA from a sample and comparing the resulting sequence reads to each other or to a known reference to characterize the genomes of organisms represented in the sample. Reference genomes often include whole genomes that have been sequenced and published, often available online. Variation among genomes has been represented by structuring genomic sequences as graphs.
Unfortunately, storing and analyzing genomic information has become impractical or impossible using conventional techniques. For example, as more genomes continue to be sequenced, the time required to store, query, and manage the new resulting information also increases. Many conventional compression techniques developed to transmit genomic information require custom or proprietary decoders, which further limits broad access. Representing genomes and genomic variations as graphs presents further complexities. Therefore, as the number of sequenced genomes and catalogued variations increases, analyzing and comparing the information quickly becomes intractable because of the required processing time or storage of such large data sets, especially when computational resources are limited.
Further, existing approaches to structuring genomic sequences into graphs restrict the usefulness of available genomic data. Due to the fact that graphs are inherently multi-dimensional, consisting of a web-like network of inter-related pieces in computer memory, it is difficult to copy parts of the graph such as specific genetic sequences from one location to another, because the multi-dimensional nature of graph topology does not lend itself easily to expression in a flattened, serialized form. Additionally, as graphs start to represent genetic variation among populations of individuals, the size of the graph becomes quite large, further hindering the ability to copy or transfer the graph from one location to another.
SUMMARYThe present disclosure describes various systems and methods for encoding genomic information. In some embodiments, the present disclosure describes encoding specific portions of a genomic graph in a way that is useful for copying or moving those portions from one location to another via a file or data stream. The genomic graph has nodes connected by edges such that paths through the graph represent genetic sequences. A specific genetic sequence is represented using coordinates that identify a starting position and an end position of the sequence within the graph. The coordinates are encoded into strings, or sequences, of bits that represent the data efficiently, and compactly, resulting in easy and rapid transfer between locations in computer memory.
Each encoded set of coordinates is thus a specification describing the removal, or insertion, of an edge in a genomic graph as that graph is being deconstructed from, or built up within, respectively, a location in computer memory. Thus an entire graph, or substantial portions of a graph, can be moved by iteratively specifying the removal of an edge or set of edges from the graph in one location and building up the graph, or portion thereof, in a new location by iteratively inserting a single edge representing the removed edge or set of edges according to the encoded specifications. The specification for each edge—using coordinates to describe the removal or insertion of that edge for a respective genome graph copy or re-build operation—may be compactly, encoded using a variable length encoding scheme into a string of bits. Since all of the edges of the graph or portion thereof are encoded iteratively, the entire graph or portion thereof is encoded as a string of bits. The string of bits is linear—or serial—in nature, and thus the genomic graph is linearized, or serialized. The serialized data structure is conducive to rapid transfer and compact storage in computer systems.
The invention includes random access methods useful to load or deserialize only portions of a serialized graph back into memory as node and edge objects. The fully-encoded graph exists as a serialized stream of bits with a variable-length format because each insertion is represented by a variable number of bits. The fully-encoded graph is provided with a random-access entry point map that aids in randomly accessing serialized portions of the graph for deserialization. The entry point map provides an indexing structure that indicates location of particular offsets of encoded edge insertion representations for a serialized graph. To identify the location of a particular insertion, a decoder consults the entry point map for the nearest located insertion. The decoder then continues decoding from that insertion until it identifies the desired insertion. Use of the entry point map greatly decreases the average complexity of random access.
In the encoding of the insert specifications, each edge may be described by coordinates that include at least a starting coordinate that defines a starting position of the edge within the genomic graph and may also include information that identifies an end position of the edge within the genomic graph. One insight of the invention is that some edges may be described by a starting coordinate and an end coordinate that share a common prefix but that each have a unique suffix. Such edges may be described by a specification that includes a component for the common prefix, a component for the unique suffix of the starting coordinate, and a component for the unique suffix of the end coordinate. Each component may be independently encoded using a variable-length encoding scheme.
Because a genomic graph, or portions of a genomic graph, can be described as a series of edge insertions that is compactly encoded into a stream of bits, methods of the invention can be used to transfer graphs or portions thereof from one location in memory to others, or between two separate computing systems. Where graphs are implemented using node and edge objects that are connected via spatial referencing in memory (e.g., via adjacency lists or index-free adjacency), due to the spatial referencing being coupled intrinsically to physical locations in memory, those graphs cannot easily be copied or moved according to existing methods. Methods of the invention allow such graphs to be effectively copied or moved by describing graphs as a series of edge insertions. Each edge insertion is given a specification that is compactly encoded as a stream of bits. The entire graph or portion thereof is thus a serialized data structure, which may be compactly and quickly stored or moved.
In certain aspects, the invention provides a method for describing a genetic sequence. The method includes providing a genomic graph that has nodes connected by edges. The nodes and edges are stored as objects in non-transitory memory of a computer system. An edge (or in some embodiments, a node) within the genomic graph represents a genetic sequence. An edge of the genomic graph is converted into a string, or sequence, of bits in which at least a first component of the string of bits stores at least a portion of a coordinate that defines a position of the edge within the genomic graph.
In related aspects, the invention provides a system for describing a genetic sequence. The system includes a processor coupled to a non-transitory memory subsystem. A genomic graph that has nodes connected by edges is stored in the memory subsystem. The nodes and edges are stored as objects in the non-transitory memory subsystem. An edge (or in some embodiments, a node) within the genomic graph represents a genetic sequence. The system includes instructions that when executed by the processor cause the system to convert an edge of the genomic graph into a string of bits in which at least a first component of the string of bits stores at least a portion of a coordinate that defines a position of the edge within the genomic graph.
In various embodiments of methods and systems herein, the string of bits provides a representation of the edge that includes a coordinate and may also include information that identifies a start position and an end position of the edge within the genomic graph. The information that identifies the start position and end position of the edge may be a pair of coordinates that define a starting position and an ending position of the inserted edge within the genomic graph. The string of bits may include a second component that includes at least a portion of a start coordinate and an end coordinate.
In certain embodiments of the systems and methods, the starting coordinate and the end coordinate share a common prefix and the string of bits includes a third component that includes the common prefix.
In some embodiments, the first component includes length bits indicating a length of the first component and representation bits that include the portion of the starting coordinate. The first component may further include flag bits that represent a type of variation (e.g., an insertion of one or more nucleotides, a deletion of one more nucleotides, or a substitution of one more nucleotides) between the edge and a reference genome. The second component may include second length bits, second flag bits, and second representation bits. Each component may be stored as a sub-string of the string of bits in which a first portion of the sub-string represents a length of the sub-string and a second portion of the sub-string includes stored information.
The specifying and storing steps may be performed to convert all, or a substantial portion of, the genomic graph into a linear string of bits comprising a representation of the plurality of edges. A plurality of edges from the genomic graph may be stored, each as its own string of bits. As will be described in further detail herein, some of the edges of the plurality of edges may be merged together as a single string of bits in order to increase efficiency, lit) some embodiments, an entire genomic graph is stored as a sequence of strings of bits, each string of bits comprising a first portion representing a length of the string and a second portion representing coordinates.
In some embodiments, the first component further comprises flag bits that identify a variant relationship between the genetic sequence and a reference genome represented by a reference path through the genomic graph.
The disclosure further provides systems and methods for describing a specific path within a genomic graph by assigning coordinates that specify a start point and an end point within the genomic graph and describe a unique path from the start point to the end point. Methods and systems of the disclosure are applicable to genomic graphs that store genomic sequence from multiple sample genomes as a single unified graph, and represent genetic variation as divergent paths within the graph while representing genetic similarities using convergent paths. Because the assigned start and end coordinates describe a unique segment within the graph, specific information and paths within the sequence graphs are easily transmitted between users and between different computer systems. For example, as genomic information, such as a variant, is identified (e.g., by mapping sequenced reads from a DNA sequencer to a reference graph), that information is transmitted to another party as a path within the graph without having to transmit or recreate the entire reference graph.
Additionally, some embodiments provide a variable-length encoding scheme by which coordinates (as described above) are compactly encoded in a binary format. This allows for an exchange of information from graphs while reducing the computational resources required for efficient and meaningful use of the sequence (e.g., genomic) data. To that end, some embodiments operate to efficiently encode and byte-align sequence data by, at least in part, assigning a coordinate to the sequence data that represents a path through the graph. Simply put, the assigned. coordinate can be an integer or plurality of integers that represent a location of a specific sequence within a graph, even as the sequence graph continues to rapidly expand with new information.
Beyond the use of coordinates to specify a path segment within a graph, methods and systems described herein may also encode coordinates using a minimal number of bytes necessary to preserve the accuracy of the original data. The disclosed embodiments provide for variable length encoding with lossless compression. A computer system analyzes the genomic information and chooses the minimal number of bytes required to store the information. For example, the system may receive genomic information in the form of a sequence read, identify a path segment within a sequence graph that matches that sequence read, and encode that segment by assigning a coordinate to the segment and compressing the coordinate using the variable length encoding scheme, which minimizes the number of bytes and embeds the number of bytes used (e.g., a length value) within the encoded information. The encoding is variable in that the system chooses the number of bytes based on the specific genomic information it is analyzing. By minimizing the number of bytes, the computer system achieves efficiencies in storage and speed while not losing data.
Variable-length encoding applies to a variety of formats of sequence data and genomic data, such as a multiple sequence alignment, a set of sequence read data, or a directed graph representing relationships among multiple genomic sequences. Since the variable length encoding may be applied to very large datasets, such as sequence reads, sequence alignments, sequence graphs (e.g., genomic graphs), very large genomic data sets may be encoded compactly. Since the variable length encoding embeds the number of bytes used (e.g., a length value) within the encoded information, a downstream decoder will not require prior knowledge of the number of bytes required for each component. Since the system can encode the data efficiently in terms of both compression ratio and time while also preserving the accuracy of the information, the size or required processing time are not limiting factors in analyses for medical genetics, genomic research, and next-generation sequencing (NGS).
Encoding schemes described herein may provide a solution for storing integers that efficiently encodes integers while remaining manageable. The encoding schemes may efficiently, represent a value and can accommodate auxiliary binary information (e.g., a set of flags), which can serve to identify the particular coding context used. Further, the encoding schemes present a compromise between encoding and decoding complexity and coding efficiency by using variable lengths, yet providing byte-aligned code, resulting in no need for additional padding or meaningless bytes between the various data structures. Advantageously, parameters can be adjusted to fit any application. Further, strings of encoded integers can be placed in sequence and can have different parameters, allowing for complex data structures and coordinates to be defined.
Compression techniques are available to reduce the amount of space required for a sequence graph. However, conventional compression techniques are generally suited for compression of text files, and thus, may not optimally compress genomic or graph related data. Using methods of the invention, the encoding process is efficient in terms of both compression ratio and time while also being lossless in terms of information accuracy.
In one aspect, the disclosure provides a computer-implemented method for encoding genetic sequences. The method includes obtaining a genomic graph representing genetic sequences from one or more organism, assigning a coordinate that defines a path from a first position to a second position within the genomic graph, and encoding the assigned coordinate using a variable length encoding scheme. The encoded coordinate includes length information.
In another aspect, the disclosure provides a system for specifying the genotype of an organism. The system includes a reference graph and instructions stored within a tangible memory subsystem. The reference graph includes nodes connected by edges into a plurality of paths. The plurality of paths each represents a variation from a reference sequence. When executed by the system, the instructions cause the system to assign a coordinate or pair of coordinates representing one or more of the plurality of paths. The number of paths depends upon the ploidy of the organism; for example a diploid organism which is heterozygous at a specific region within the sequence graph would actually require two concurrent paths through that region in order to specify the genotype. The instructions further cause the system to encode the coordinates using a variable length encoding scheme. The assigned coordinates define the path from a first position to a second position, and the encoded coordinates include length information about the size of the encoded representation. In certain implementations, the genotypes may be determined by evaluating the number of sequence reads aligned to particular paths within the reference graph.
In some embodiments, a variable length encoding scheme includes determining the number of bits to represent the coordinate, encoding the coordinate using the determined number of bits, and generating a header value that indicates a total number of octets representing the encoded coordinate. The encoded coordinate may be represented by at least one octet, the at least one octet being a header octet storing a header value. In some embodiments, the encoded coordinates are in a byte-aligned format. In some embodiments, the encoded coordinate is represented by a plurality of octets, at least one of the plurality of octets being a header octet including the header value.
In some embodiments, the genomic graph includes a plurality of nodes and a plurality of edges, wherein the nodes connected by the edges are stored in a tangible memory subsystem. Nodes and edges may be connected using adjacency lists including pointers to physical locations in the memory. In some embodiments, the genomic graph is provided within a tangible memory subsystem of a hardware computer system. In some embodiments, a processor is coupled to the tangible memory subsystem.
In some embodiments, encoded coordinates may be stored within the tangible memory subsystem of a hardware computer system. In some embodiments, an encoded coordinate can be decoded without prior knowledge of the total number of octets representing the encoded coordinate. Assigning a coordinate may include identifying a series of integers, where each integer is a component of a coordinate, and converting the series of integers to a binary format.
In some embodiments, a variable length encoding scheme also includes obtaining auxiliary binary information and encoding the auxiliary binary information using the variable length encoding scheme. In some embodiments, at least one of the plurality of octets includes the encoded auxiliary binary information.
In some embodiments, the genomic data includes a directed graph representing at least a portion of a plurality of genomes. The system may serialize the genomic data. In certain embodiments, encoding the characters within each of the plurality of segments transforms the directed graph into a sequence of bits. The system may transfer the sequence of bits serially between the memory subsystem and a second computer system. The system may support quick or random access. The system may write a table that includes a plurality of pairs <p, p′>, where p is an offset in the genomic data and p′ is an offset in the encoded characters, and allow characters of known position in the genomic data to be read from the encoded characters by reading from the table.
These and other aspects, features, and implementations, and combinations of them may be expressed as apparatus, methods, means or steps for performing functions, components, systems, program products, and in other ways.
Other aspects, features, and advantages will be apparent from the description and drawings, and from the claims.
Features and advantages of the claimed subject matter will be apparent from the following detailed description of embodiments consistent therewith, which description should be considered with reference to the accompanying drawings.
For a thorough understanding of the present disclosure, reference should be made to the following detailed description, including the appended claims, in connection with the above-described drawings. Although the present disclosure is described in connection with exemplary embodiments, the disclosure is not intended to be limited to the specific forms set forth herein. It is understood that various omissions and substitutions of equivalents are contemplated as circumstances may suggest or render expedient.
DETAILED DESCRIPTIONDescribed herein are systems and methods for describing a genetic sequence as an insertion of an edge to a genomic graph and compressing the description for compact storage and ease of transfer. Various embodiments of systems and methods disclosed herein may be referred to as Compressed Insert Specification (“CIS”) Format. Further, various methods and systems described throughout this disclosure provide an approach to encoding information (e.g., genetic information) that minimizes the number of required bytes, which is particularly advantageous when the information includes genomic or graph related data. For example, genetic information can be structured as a reference graph that can store genetic information from multiple individuals, or even multiple organisms. Using a minimal number of bytes, the invention encodes both coordinates and paths through a genomic reference graph using a variable number of bytes, while also storing information in the encoded information (e.g., in a header octet) about the total size of the total number of bytes. This minimizes wasted bytes as the total number of bytes is tuned to the specific information or query and simplifies the decoding process as a decoder is automatically provided with this key information.
One way to represent a genome is as a directed graph having a plurality of vertices and a plurality of directed edges in such a reference graph representation of a genome (e.g., a sequence graph or a genomic graph), genetic variations within species can be incorporated as divergent paths along the graph. Genetic similarities within a species can be represented as a single path. In other words, the data associated with genetic similarities must only be stored once in the sequence graph. Genomic data, such as nucleotide or protein sequences, may be associated with either the vertices or edges of the graph, and thus the genomic graph can be either vertex-based or edge-based. The direction of the edges indicates the allowable sequences that may be represented by following paths through the graph. Different paths through the graph may represent variation in sequences. An edge-based genomic graph can be built, for example, by first associating a single edge with a reference sequence. While the present disclosure illustrates edge-based graphs, however it should be noted that methods and systems described herein are equally applicable to vertex-based graphs.
In practice, a graph may be serialized and deserialized in order to store and load the graph in computer memory.
Variation of the genetic sequence can be described by adding subsequent edges to the is graph. An edge can be added to the genomic graph 201 by referring information specifying the edge, such as a sequence and a position on the genomic graph 201. As shown in this embodiment, the genetic sequence is associated with a single edge 209, and each letter within the sequence is associated with an increasing offset value starting from zero. Offset values may be used to identify particular positions within the graph for selecting desired insertion points of new edges, for example.
In certain embodiments, the offset values are retained as edges are segmented by the addition of new inserts. For example, as shown in
Further, as each edge is sequentially inserted into the graph, the new edge may also be associated with a unique identifier (e.g., a number that increments by one, relative to the identifier associated with the existing edge upon which the insertion is occurring). As previously inserted edges are segmented into multiple edges (i.e., by the addition of a new nodes in order to accommodate the inserted edge), these segmented edges may retain their initial identifier. Identifiers may be utilized later when determining an order for deconstructing the graph, for example. Typically a graph is deconstructed in the order in which it was constructed, i.e., the inserted edges are removed in the reverse order in which they were inserted.
The above-described method may also be reversed to serialize a genomic graph,
Identifying 509 an edge or set of edges to encode and remove can be performed in a variety of ways. In certain embodiments, the edge or set of edges may have an associated identifier (e.g., an identifier that was associated with the edge when it was first added to the graph). In these embodiments, the associated identifier indicates the order in which the edges were added to the graph. The most recently added edge or set of edges may then be selected for removal. Edges segmented during graph construction may retain the same identifier, and are thus selected together. Identifiers may be associated with edges as metadata, for example.
An ordering for edge removal and graph deconstruction can also be determined if no previously existing information is available. In certain embodiments, the ordering may be determined by identifying edges that were most recently added, i.e., in the reverse order of construction. For example, one may traverse the graph 201 to identify an edge or set of edges for removal. This can be performed by traversing the graph (e.g., by a depth-first search) and noting those paths through the graph which begin with nodes having multiple outgoing directed edges, end at nodes having multiple incoming edges, and have at least one directed edge specifying a linear path between them, without any intervening nodes. If no paths satisfy this rule, then the only remaining set of edges is linear (i.e., has no divergent paths) and can thus be selected as the final set of edges for encoding and removal. For example, as shown in the embodiment of
In other embodiments, an ordering for removal may be determined by identifying an order in which the edges may have initially been inserted into the graph and then annotating the graph with this information. In one embodiment, a first path is selected from the graph that traverses from a source vertex (i.e., the first vertex) to a sink vertex (i.e., the last vertex). This path is considered to be the backbone branch, and the edges and/or vertices in this path may be annotated with an initial identifier (e.g., 1 or 0), indicating that it should be inserted as a first edge insertion when creating the graph (or, if deconstructing the graph, the last to be encoded and removed). (The relevance of a backbone path, source vertex, and sink vertex are discussed in more detail below with respect to
Preferably, path selection identifies the longest available path available through the graph. For example, when selecting the backbone path, the graph may be traversed (e.g., via a depth-first search) to identify the longest path in terms of the number of symbols, edges, vertices, or other measure. Selecting the longest available path is beneficial because it reduces the number of subsequent edge insertions that need to be encoded and decoded, resulting in a more efficient and compact representation of the graph in serialized form. Accordingly, in certain embodiments, a graph may be deconstructed in a different manner from how it was initially constructed.
When annotating paths with ordering information, those edges connected to vertices having multiple edges may be annotated with an additional identifier indicating its order in terms of the number of available outgoing paths from that vertex. For example, after the backbone branch is first identified, the first directed edge not followed (which is then followed when identifying the second path) may be annotated with a “2”, indicating that it is the second available directed edge and specifying the second path that can be followed from that vertex. Similarly, the directed edge associated with the backbone branch may be annotated with a “1”.
Once an edge or set of edges have been identified 509 (i.e., the edge 219 representing “A” in the graph 201 of
Once the edge has been serialized and encoded, it may then be removed from the graph 201, e.g. as shown in
Linearizing graphs is useful because graphs can take up a large amount of computer memory, and thus it can be more efficient to load the graph or subsets of the graph only when needed. However, at the scales associated with genomic data, procedures associated with linearizing and de-linearizing graphs may also present a performance bottleneck. These issues can result in complexities during analysis, causing a non-negligible impact on genomic research. Systems and methods of the invention provide improvements in storing and representing graph-related data at genomic scales. While shown in
The present disclosure recognizes that a graph can be described as a plurality of edge insertions. Every inserted edge can be described by two parameters: a position of the edge within a graph, and the variation in genetic sequence that the edge represents. The position within the graph may comprise a start position and an end position. The variation in genetic sequence may comprise the length of the sequence or number of symbols associated with that edge, the genetic sequence itself, and/or an encoding of the type of variation. This information can be used to store graphs in a serialized format for ease of transport and instantiation. Further, this information can be efficiently represented to improve access and loading times, e.g., by using embodiments of encoding schemes as described below.
Graph Genome Coordinate EncodingThe LR8 and LFR8 encoding schemes described herein represent coordinate components as a variable number of bytes, wherein the representation itself includes information about its size. In particular, an encoding scheme for storing coordinate components is proposed in which a header octet indicates the number of additional bytes (“Length”) following the header octet that is used to accommodate the stored coordinate component (“Representation”). In this way, the encoding scheme is able to minimize the number of bytes required to store a coordinate component, thus achieving efficiencies in storage and speed while also not suffering from problems associated with pure variable length encoding schemes. Further, the header may also store other information related to the integer or position (e.g., whether the integer represents the last component in the coordinate) as additional binary information, such as a set of flags (“Flags”). The LFR8/LR8 encoding schemes can be used to efficiently encode any kind of data, including graph-related data, such as coordinates and paths.
Assigning a CoordinateMethods of the disclosure provide a coordinate system to identify locations or positions in a graph. Coordinates may comprise a series of integers (which may be referred to as “coordinate components”) describing a path through a graph that may be taken to arrive at a particular position. The positions at which serialized edges should be inserted into the graph can be defined by coordinates as described herein, for example.
The disclosed method encodes graphs and accompanying variation data, and is compatible with a variety of graph coordinate systems. For purposes of the present disclosure, a coordinate refers to an expression that defines a particular or relative position within the graph. Once a coordinate system has been established, pairs of coordinates can be used to define a trajectory specifying a path from a first position to a second position within the graph.
Often, reference graphs designate a backbone branch representing a particular linear reference sequence (e.g., the nucleotide sequence of chromosome 1 of the human genome). Additional branches diverging from and reconnecting to the backbone branch thus represent known variation from the linear reference sequence, and represent events such as single nucleotide polymorphisms (“SNPs”), small insertions and deletions (“indels”), and larger structural variants. As shown in
The reference graph 100 includes three additional branches specifying variations from the backbone branch 102. For example, a first branch 104 represents a nucleotide sequence “A”, a second branch 106 represents a nucleotide sequence “GACTG”, and a third branch 108 represents a nucleotide sequence “TACTG”. As shown, the first branch 104 and the second branch 106 are each connected to the backbone branch 102, and the third branch 108 is connected to the second branch 106. Similar to the backbone branch 102, a nucleotide sequence for each of the branches 104, 106, 108 is associated with an increasing offset value starting from zero.
Each branch describes a path or a particular set of edges and vertices within the reference graph 100 that may be followed through a series of increasing offset values. While the branches 102, 106, and 108 (and offsets) are shown at specific locations relative to the backbone branch 102, these or other branches could be specified in a variety of ways. For example, the backbone branch 102 could instead begin with “AAT” instead of “ACT”. In that case, the edge representing the nucleotide “A” in 104 would instead be reassigned to the backbone branch 102, with a corresponding reassignment of the offset values as well the edge within branch 102 that specifies “C” would instead be assigned an offset of “0” as a new alternate branch, whereas the edge specifying “A” would have offset “1” in the backbone branch). Accordingly, for a particular graph, designated branches and offset values may vary.
In practice, designating offsets and branches may be specified as a graph is created. The first reference sequence incorporated can be specified as the backbone branch, and its associated nucleotide sequence can be assigned increasing offset values. Subsequent sequences representing variations from the reference can be inserted as new branches connected to the backbone branch. These new branches will have new relative offset values starting from zero.
One exemplary coordinate system to identify particular positions on the graph specifies a sequence of (offset, branch) pairs. Pairs may be specified as a list of integers, each separated by a period “.”. Each integer in the coordinate may also be referred to as a coordinate component. The sequence of (offset, branch) pairs indicates a unique path that may be taken to reach a particular position within the graph. For example, starting from the source vertex 130 (i.e., the first vertex on the backbone branch 102), each (offset, branch) pair represents (i) the desired offset to be reached on the current branch, followed by (ii) an indication of whether an edge to an alternate branch (and having a new relative offset) is followed before that offset. Subsequent pairs in the coordinate are similarly followed. In this exemplary system, the first integer is zero-based (i.e., the first offset is “0”, the second offset is “1”, etc.), whereas the second integer begins from 1 (i.e., the first outgoing edge is “1”, the second outgoing edge is “2”, etc.).
For example, an “A” 110 in the first branch 104 may be identified by proceeding to the first offset (1), and moving to an alternate branch, i.e., by following a second outgoing edge 120 (2) from the vertex immediately before that offset. Therefore, the location of the “A” 104 in the reference graph 100 can be represented by the coordinate 1.2. While not necessary, the coordinate may also include that no further offset movements are required. Therefore, the “A” 110 can also be represented by the coordinate 1.2.0.
Similarly, if no divergent paths from the current branch are required to reach the desired position, the second integer in the coordinate can be omitted. In another example, a “T” nucleotide 112 on the backbone branch 102 may be specified by the coordinates “2,” which simply represents a traversal to the nucleotide having offset “2” in the backbone branch 102 from the source vertex 130.
More complicated coordinates can include multiple (offset, branch) pairs, indicating the traversal across multiple branches to reach a desired position. For example, a “C” nucleotide 114 in the third branch 108 can be reached by proceeding, from the source vertex 130 on the backbone branch 102, to the nucleotide having offset 3 (3), following a second outgoing edge 122 (2) before that offset to the second branch 106, proceeding to offset 3 (3) on the second branch 106, taking a second outgoing edge 124 (2) just before that offset to the third branch 108, and then traversing to offset two (2) on the third branch 108. As a result, the location of the “C” nucleotide 114 in the reference graph 100 can be represented by the coordinate 3.2.3.2.2.
In use, an initial offset may be much larger than the prior examples due to the large size of a reference genome. For example, a SNP on chromosome 1 at position 203,422,877 could be identified by the coordinates: “203422877.2”, representing a traversal at that position along a second outgoing edge (2) to an alternate branch representing the SNP.
In some examples, an offset may be larger than the number of nucleotides on a current branch. In these cases, the offset is considered to “fall off” the branch (e.g., along an outgoing edge 120 from the first branch 104, an outgoing edge 126 from the second branch 106, or an outgoing edge 128 from the third branch 108) and continue along the connected edges until the desired relative offset is reached. For example, the second “C” nucleotide 116 in the backbone branch 102 (at offset 5) can also be reached by coordinates “3.2.6”, representing a traversal to the vertex having offset 3 on the backbone branch 102, following the second outgoing edge 122 to the second branch 106 before that vertex, and then traversing an additional six offsets. However, after offset 4, the only available outgoing edge is edge 126, which “falls off,” or returns to the backbone branch 102. An additional two offsets are then traversed to arrive at the “C” nucleotide (having offset 5 on the backbone branch 102).
The coordinates essentially describe a path that may be taken to identify a particular position in the reference graph 100 from the source vertex 130. As graphs are multi-dimensional structures, a single position may be represented (or reached) by multiple different coordinates. Each unique coordinate describes a unique path to the selected position within the reference graph 100. For example, the “T” nucleotide 112 in the backbone branch 102 may also be represented by, the coordinate “1.2.1.” The coordinate 1.2.1 represents a path that: traverses to the nucleotide having offset 1 (i.e., the first “C” nucleotide in the backbone branch 102), follows the second outgoing edge 120 before that offset to the first branch 104, and traverses an additional two offsets to the “T” nucleotide 112. Similar to the previous example, the offset specified by the coordinate component is relative because the first branch 104 ends and returns to the backbone branch 102. Similarly, the “C” nucleotide 114 in the third branch 108 may be represented by the coordinates “3.2.3.2.2” and also “1.2.2.2.3.2.2”.
When a single coordinate is used to represent a single point or offset within the graph, they essentially describe a path that may be taken, starting with the leftmost or “source” vertex 130, to identify a particular position in the reference graph 100. However, pairs of coordinates may be specified indicating a start position and an end position. The trajectory between the start position and end position defines a path that may be taken through the reference graph 100. By specifying a pair of coordinates, a path may be specified that starts from a coordinate other than the source vertex 130.
Briefly, next-generation sequencing reads may be aligned against a graph genome (such as the reference graph 100) using a graph-aware alignment algorithm or another algorithm that has been modified to align to a graph. See, e.g., U.S. Pub. 2015/0057946 and U.S. Pub. 2015/0056613, both incorporated by reference. The start position of each sequence read may be located at any position within the graph genome. Further, alignments that describe variation (such as SNPs, short insertions or deletions, or larger structural variations) may have paths that span multiple branches. Paths and alignments can be represented in the reference graph 100 by defining a trajectory or path between a first coordinate and a second coordinate. The first coordinate can represent the position at which the 5′ end of the sequence read aligns with the reference sequence, and the second coordinate can represent the position at which the 3′ end aligns. Further, the path defined by the second coordinate can be relative to the first coordinate, and describe the path taken from the first coordinate over which the sequence read aligns.
For example, to describe the path 200 representing the position of the nucleotide sequence “AAT” (as generally indicated by edges 202, 120, 204), the path 200 begins at the source vertex 130 (e.g., offset 0), proceeds to the “C” nucleotide at offset 1, follows the second outgoing edge 120 before that offset, and then traverses two offsets to the “T” nucleotide 112. This path can be represented by the starting coordinate “0” and the ending coordinate “1.2.1”. The coordinates may be combined by a delimiter (e.g., “..”) to define a path “0..1.21.”
In another example, to describe the path 220 representing the position of the nucleotide sequence “TGACTGA” (as generally indicated by edges 204, 122, 222, 224, 226, 228), the path can be specified by the pair of coordinates “2..3.2.6.” In this case, the path begins at offset two on the backbone branch 102 at the “T” nucleotide 112 (2), and then proceeds (“..”) to offset three (3), follows the second outgoing edge 122 to the third branch 106 before the offset (2), and then proceeds seven offsets to the “C” nucleotide 116 (6). Note that proceeding seven offsets returns the path to the backbone branch 102.
When defining paths, the second coordinate may be specified relative to the branch reached by first coordinate. For example, the path 240 representing the position of the nucleotide sequence “GACTAC” begins at coordinates “3.2,” which begins the path at offset 0 in the second branch 106 (“G”). The second coordinate is “3.2.2”, meaning proceed to the third offset (3) on the current branch (i.e., the second branch 106), follow an outgoing directed edge (i.e., the edge 124) to an alternate branch (i.e., the third branch 108) before the third offset, and then proceed to offset two (2) on the alternate branch.
Specifying the second coordinate relative to the first coordinate has an advantage in that fewer bytes are required for storing the coordinates. However, in certain embodiments, both the first and second coordinates may be specified starting from a source vertex.
As noted above, the second coordinate defines a particular path to be taken from the first coordinate, which can be used to describe the path associated with a sequence read aligned to the reference graph 100. Therefore, the second coordinate may only be described by a single unique sequence of integers, and not by multiple different choices, as these would define different paths. However, the first coordinate may be specified in any manner. For example, the path 240 “GACTAC” may be specified by either “3.2..3.2.2” or “1.2.2.2..3.2.2.” In some cases, the coordinate requiring fewer or smaller integers is selected so as to maximize efficiency.
Reverse coordinates and paths (i.e., coordinates and paths that follow incoming as opposed to outgoing edges) may also be specified. As the second coordinate is relative to the branch of the first coordinate, when on the base branch, a reverse path may simply be specified by switching the first and second coordinates (e.g., “7..5”). In cases where the reverse path takes a divergent path along a vertex, negative numbers may be used. With coordinates specifying forward paths (as detailed above with reference to
A second reverse path 280 is specified by the pair of coordinates “2..1.−2.−1”. This describes a path having a start position at coordinate “2” (i.e., the “T” nucleotide 112 on the backbone branch 102) and proceeds “..” to offset 1 on the same branch. The second path 280 identifies the vertex connected in the direction of the edge at offset 1, and then follows the second incoming edge to an alternate branch (“−2”) (“A”). The second path 280 then traverses one additional offset against the directionality of the edge to arrive at offset 0 in the backbone branch 102 (“A”). Taken together, the pair of coordinates “2..1.−1.−1” thus represents the position of the nucleotide sequence “TAA”.
A third reverse path 300 is specified by the pair of coordinates “3.2.3.2.2..−2”. This describes a path having a start position at coordinate “3.2.3.2.2” (i.e., the “C” nucleotide in the third branch 108). The third reverse path 300 then proceeds (“..”) to offset −2, which is equivalent to offset 1 on the second branch 106. This represents the position of the nucleotide sequence “CATCA”.
Reverse paths may be used, for example, to indicate alignments that are the reverse complement of a forward path. In these cases, the 3′ end of a sequence read may have a mapped position before the mapped position of the 5′ end of the read.
Encoding the CoordinateConventional variable length encoding maps source symbols to a variable number of bits. For example, a conventional variable length code would store coordinate components using only those bits required for each component. However, this can be troublesome because a decoder will then require prior knowledge of the number of bits required for each component, which introduces additional overhead costs. Moreover, conventional variable length coding will result in a non-byte aligned format. However, most operating systems expect binary data to be represented in bytes, i.e., as 8-bit octets. Therefore, deviating from this norm can lead to problems and confusion during the development of software that uses a graph coordinate system.
In an embodiment, the data is divided into two parts: header values and a data representation. The header provides information about the length of the data representation and allows the data representation to be interpreted accurately. A computer system may be programmed with instructions that when executed by a processor cause the system to interpret the header and use that information to decode the data. In certain embodiments, the header is encoded according to a scheme that depicts a length of what is encoded, certain useful flags, and the encoded representation, all using 8-bit octets (where the encoded representation here will be a length of the data section, such as the number of octets following the header octet). Thus, methods and systems of the invention implement a length-flag-representation 8-bit scheme.
As shown in
Once encoded, a decoding agent or algorithm would receive the first octet of a coordinate component and then determine, based on information within the first octet, the total size of the representation.
Modifying the Encoding SchemesIn the encoding schemes discussed above (e.g., LFR8 and LR8 formats), the total number of length bits and flag bits does not exceed eight so that the header information remains within a first octet. If some bits are unused, these bits can be allocated to any position within the encoded representation. In the examples described above, the value representation is shifted to the rightmost position of the octets such that the last bit in the last octet is the first bit of the encoded integer. However, these value representations can be shifted in a different position or permuted in a different manner.
Any unused bits (i.e., to the left of the value and after the length and flag bits) may be padded as zeros. However, the LFR8 and LR8 formats could be modified to use additional header octets to store additional flags as needed. Similarly, flags could also be stored in the additional representation octets (e.g., LFR8<3,6> would require at least one flag bit in an additional octet).
As previously described, an LFR8 format is compatible if the particular application requires storage of additional flags. In these cases, encoding an integer may require an additional octet because there will be additional flag bits in the header. For example, encoding the integer “4,954” requires only two or three octets, depending on whether LR8 or LFR8 format is used. For example, a position 201,348,366 on human chromosome 1, represented in binary form, becomes: 1100000000000101010100001110. The binary form of this integer requires at least 28 bits, and thus requires at least four octets in order to be byte-aligned (32 bits), of which half of one octet is unused. When stored using LFR8 format, the header octet that includes the value representation begins with “011”, indicating that three additional octets follow the header octet.
The encoded representation, including the length (L), padded bits (P), and binary value (V) thus becomes:
If flags are desired, the encoding scheme can also accommodate flags (F), for example, following the length bits. In this case, if we would like to include four flags in the header octet, we would require five octets, as the binary value of the integer is then shifted to the right, as shown in the example below:
In the exemplary coordinate system described above, the first coordinate component (here, an integer) in a coordinate will typically be much larger than subsequent coordinate components. This is because the subsequent components typically describe relative offsets or branch numbers. (The use of relative offsets or branch numbers helps to reduce the size of the encoded representation.) Thus, the first component in a coordinate may be quite large and require a large number of bits, but subsequent components will be small and require a minimal number of bits. For example, if a coordinate or path describes traversing to offset 201,348,366 on the main branch, taking the second branch from that offset, and then proceeding to offset 154, the coordinate component “154” could be encoded as follows:
The number “2” (for the second branch) is small enough to fit entirely within the leftover value representation bits of a header octet, and above it has been shown that the number 201,348,366 can be encoded using four octets. Thus, using an LR8 encoding scheme, the entire coordinate (201,348,366.154) is represented using only 7 octets.
Padding (e.g., “0” bits) can be used when there are unused bits in the representation. Of course, these bits can be used to store additional binary information (e.g., up to 5 binary symbols or flags if three representation bits are used). For example, if we need to store 3 flags, we can convert the LR8<3> schema to LFR8<3,3>, e.g.:
This particular encoding scheme (i.e., one in which three length bits are present in the header) allows for up to seven additional octets to be used. However, if the header octet is entirely occupied by length bits and flag bits (that is, there are no bits left over for value representation), then, in certain embodiments, the encoding scheme imposes a special rule: add at least one extra octet, and the numbers encoded in the length bits hold the number of additional value representation octets to append above and beyond this. So, for example, in an LFR8<3,5> encoding scheme, the representation would include one header octet, and the number of additional value representation octets would range from one to eight, for a total code size (including the header) in the range of two to nine bytes. Accordingly, eight additional representation octets are available in addition to the header octet, allowing for the representation of a value having 264 positions—a value far in excess of the number of nucleotides in any presently known genome.
One reason to incorporate this special rule is so that when we encode the number “0”, it does not end up being represented as a “null” value (i.e., just a header octet, with no assigned value representation bits). The reason this is useful is because if the value is changed later on, e.g., from “0” to “1”, then the representation does not need to be resized in order to get the number “1” to fit. For example, imagine several million LFR8/LR8 representations, all assigned to a single contiguous block of memory. If one representation is resized by adding an extra byte, then every representation to the right would presumably have to be moved over by one byte in order to accommodate the change, which can be extremely inefficient.
When encoding, an encoder first determines, based on a coordinate component or value to be encoded, the number of octets required for representation in a particular LFR8 or LR8 format. Exemplary pseudocode is provided below which gives one example.
As shown in the above pseudocode, first the encoder receives a value (such as an integer or coordinate component) to encode. (In this example, if the value is negative, the encoder outputs an error.) The encoder then determines the number of available representation bits in the header by subtracting the number of length bits and the number of flag bits from the size of the header (i.e., 8 bits). For example, in an LFR8<3,2> representation, the number of available representation bits would be 8−3−2=3 bits. The encoder then calculates a “right shifted” value, which represents the number that would need to be encoded by the remaining binary digits once the least significant binary digits are stored in the (e.g., 3 bits of) the header. If the “right shifted” value is not zero, the encoder determines the number of representation octets required by calculating the logarithm of the right shifted value with base 256, and then identifying the ceiling the smallest integer value greater than or equal to this value) of the result to yield the number of additional octets required for the representation. The resulting code size is thus 1 plus this number (i.e., the header octet plus the additional octets required for the representation). However, if the “right shifted” value is zero, the encoder considers whether there are any available representation bits in the header. If not (meaning that the value itself is zero), then the encoder provides for one additional representation octet. (This is to ensure that if the value itself later changes, additional octets do not need to be added; see the description of the “special rule” above.) However, if it is the case that there are available representation bits in the header, the entire representation is fit into a single octet.
Note in the table of
Graph coordinates and paths as described above may be efficiently encoded and stored using combinations of LFR8 and LR8 encoding schemes. As previously noted, graph paths and coordinates according to the disclosure comprise a set of integers describing a path (whether relative or absolute) to arrive at a particular position within the graph. For example, a path having coordinate components X1, X2, . . . Xk could be represented as a sequence of LFR8 encoded integers as shown below in Equation (1).
PathRep(X1,X2, . . . Xk)=LFR83,2(|X1|,[ε1,γ1]),LFR83,2(|X2|,[ε2,γ2]), . . . LFR83,2(|Xk|,[εk,γk]) (1)
In Equation (1), ε is a single flag bit representing whether the encoded integer is the last integer in the coordinate, and, γ is a single flag bit representing the sign (whether positive or negative) of the stored integer. If only forward paths (i.e., no negative integers) need to be considered, then the path representation can be simplified as shown in Equation 2.
ForwardPathRep(X1,X2, . . . Xk)=LFR83,1(X1,ε1),LFR83,1(X2,ε2), . . . LFR83,1(Xk,εk) (2)
In contrast to Equation (1), Equation (2) requires only a single flag bit E, which indicates whether the stored integer is the last component in the coordinate, and represents the end of the path followed to reach that coordinate. (Delineating the end of the path in this way allows many paths to be stored together in a single contiguous block of memory without ambiguity as to where one path ends and the next one begins.) Accordingly, a decoder reading a sequence of LFR8 encoded integers would understand that a particular integer is the last integer in a coordinate by reading the value for the flag bit E.
As an example, the coordinates 1056.2.5 (representing a traversal to offset 1056 along the base branch, moving to a new branch having identifier “2”, and traversing to offset 5) could be encoded as follows using the forward path representation in Equation (2):
ForwardPathRep(1056.2.5)=LFR83,1(1056,0),LFR83,1(2,0),LFR83,1(5,1).
This would encode the coordinates 1056.2.5 as:
A decoder configured to decode format, when encountering the first octet above, would read the first three length bits of the first octet and understand there to be one additional representation octet. The entire representation may then be decoded to yield the component “1056”. The decoder would then read the next header octet, which indicates no additional representation octets, and is decoded to yield the component “2”. The next header octet indicates that the encoded integer is the last in the series of integers for the coordinate, and can be decoded to “5”. Thus, the decoded representation yields the coordinates “1056.2.5”. In this way, the decoder does not require any prior knowledge of the size of the representation of each component, but determines the size by analyzing the header octet. Further, additional octets are only used if required, leading to efficient encoding of the component.
Efficient Encoding of Edge InsertionsSystems and methods herein may be used to provide a specification that describes an edge as an insertion to a genomic graph. The specification makes use of coordinates that define the insertion and that specification is compactly encoded using the described LFR8/LR8 variable length encoding, allowing for fast storage, querying, and instantiation times. Since an edge is described as an insertion to a graph using a specification that is compactly encoded, the format of the string of bits describing an edge may be referred to as a compressed insert specification, or CIS format.
Before the variable length encoding, an edge is described as an insertion to a graph using a specification includes at least a position on the graph and information that defines the variation in genetic sequence represented by the edge. In preferred embodiments, an edge can be represented by a tuple that includes the 5′ starting position, 3′ ending position, and the length of the associated genetic sequence:
X1,X2, . . . Xm,Y1,Y2, . . . Yn,L),
Wherein each Xi and Yi are the coordinate components of the 5′ and 3′ coordinates of the inserted edge, respectively. In the case of a SNP, L can be 1, for storing the single nucleotide associated with the SNP. As will be described in detail below, each edge can be efficiently and compactly encoded using LFR8/LR8 encoding, allowing for fast storage, querying, and instantiation times.
Single Coordinate InsertionsIn certain cases, the end coordinate of an inserted edge can be inferred from the start coordinate and the type of variation represented by the edge. In these cases, the end coordinate can be omitted from the representation, leading to a more efficient representation of the edge. In particular, edges that affect only a single position in the reference genome (one of the most commonly observed types of genomic variation between of individuals within a population) allow for the 3′ end to be omitted from the edge representation. For these kinds of variations, the edge can be represented in CIS format in the following manner:
InsertionSpecificationRep(X1,X2, . . . L)=LFR83,5(X1,[ε1,
wherein X represents each coordinate component in the stored coordinate, L represents the length of the associated nucleotide sequence, ε1 is a flag that indicates whether the position is the last integer in the coordinate (i.e., it represents the end of the path followed to identify the position in the graph), and
As shown in
In this way, only a single coordinate needs to be associated with the edge, and information regarding whether the edge reconnects with the vertex after or prior to the specified edge can be encoded in the vector
Other variations may be more complex, and have edges with ending coordinates that differ from their starting coordinates. In these cases, both the starting and ending coordinates must be specified in the representation. However, often edges will have starting (5′) and ending (3′) coordinates that share several coordinate components. For example, a structural variant representing a CT→GAC mutation (e.g., as shown in the table in
X1,X2, . . . Xk,Xk+1, . . . Xm,Y1,Y2, . . . Yk,Yk+1, . . . Yn,L),
where X1=Y1, . . . Xk=Yk is a common path prefix (“10.2”), Xk+1, . . . Xm is a unique path suffix for the 5′ end (“5”), Yk+1, . . . Yn is a unique path suffix for the 3′ end (“7”), and L represents the number of symbols, such as nucleotides, that are associated with the edge. In this case, L is 1, for storing the nucleotide G. In CIS format, this representation can be specified as follows:
As shown above, the first component of the insertion specification representation is the common path prefix representation, followed by the representation of the unique suffixes for the start (5′) and end (3′) coordinates, and finally the representation of the number of symbols or nucleotides in the insertion. The common path between the start and end coordinates can be represented by:
CommonPathPrefixRep((X1, . . . ,Xk)=LFR83,5(X1,[ε1,ϑ1,
wherein ε is a flag that represents whether the position is the last integer in the coordinate (“end of path”), and the remaining four flag bits [ϑ1,
The second component of the insertion specification representation is the unique suffix representation of the starting coordinate. Occasionally this component may not be necessary. For example, an insertion starting at position “10.2” and ending at position “10.2.5” will have no unique suffix for the start coordinate. If the start coordinate does have a unique suffix, this can simply be represented by the forward path representation in Equation (2) above:
End5′PathSuffixRep(λk+1, . . . Xm)=ForwardPathRep(Xk+1, . . . Xm) (6)
The third component is the suffix of the ending (i.e., 3′) coordinate. To further minimize the storage of integers, the 3′ coordinate may be encoded differently, i.e., by having a suffix that is relative to the 5′ coordinate. In other words, the coordinate component of the 3′ coordinate that is aligned with the 5′ coordinate is subtracted from the coordinate component of the 5′ coordinate. This representation can be:
End3′RelativePathSuffixRep(Xk+1,Yk+1, . . . Yn)=LFR83,1(|Yk+1−Xk+1|,εk+1),LFR83,1(Yk+2,εk+2), . . . LFR83,1(Yn,εn) (7)
Finally, the number of nucleotides in the new branch can be encoded by LR83(L). It should be noted that in the case of variations representing deletions or single bases, the LR83(L) parameter may be omitted because it can be inferred from the encoded variation type. For example, a decoder reading the representation could be configured to understand that the length parameter is omitted due to the associated vector; e.g., for the first 3 cases shown in the table of
It should also be noted that while
Once a graph has been linearized, it is often advantageous to load or deserialize only, portions or subsets of the graph into memory for use, e.g., by loading only those CIS representations that are needed. In practice this presents some issues. A CIS-encoded graph is typically presented as a serialized stream. CIS efficiently encodes each insertion into a minimal number of bits, but this results in a variable length format, because each insertion will be represented by a variable number of bits. Thus, a decoder is not able to predict the location of a particular offset in an encoded stream simply based on its magnitude. To identify particular insertion representations for deserialization, a decoder must start at the first insertion in the encoded stream and continue until the desired insert is identified. Thus, if a graph has N insertions, it requires on average O(N/2) access time to locate the desired encoded insert for instantiation. When applied to genomic data scales, this access time may be too slow. In these cases, randomly accessing serialized portions of the graph for deserialization can be aided by use of a random-access entry point map.
Typically, the value chosen for K will be much smaller than N, as otherwise the random access entry point map itself may begin to take up a significant amount of memory. K can be adjusted to a variety of settings. In practice, suitable values for K are those in which O(N/K) is between 100 and 1,000. This presents a good compromise in random-access entry point map size and CIS lookup speed. For example, if an encoded CIS stream has 106 insertions, a random-access entry point map having 1000 entries will result in a distance of 1000 insertions between each entry. A decoder thus would only need to traverse, on average, about 500 entries before locating a desired insertion.
While this disclosure refers to a random-access entry point map, various other data structures could be used to more efficiently locate a particular (IS insert in an encoded stream. For example, a tree-based structure having multiple levels of random access could be used. The tree could be divided into several levels in which each branch is traversed until the desired level indicating the location in memory of a particular CIS insert is reached.
One application of the random-access entry point map is to generate specialized graphs from a larger graph specification. For example, one may wish to build a graph that includes only variations from certain databases, such as dbSNP. Alternately, one may wish to build a graph appropriate for particular genetic populations. In these cases, one could leverage the random-access entry point map to quickly identify and deserialize only those desired insertions into the graph. In such examples, the backbone branch would be loaded first, followed by the selected insertions. Note also that the edges should be inserted in an order in which they do not depend on other edges, which may not have been inserted. Care should be taken during graph construction in this regard; for example, preferably dbSNP insertions should not depend on (i.e., branch from) insertions from other variation databases, and similarly insertions representative of one genetic population should not depend on those from another. However, various embodiments are considered to be within the scope of the disclosure.
It should also be noted that the random-access entry point map discloses the location in memory of CIS insertions, which may not correspond to actual genome position. For example, the map may be able to inform a decoder as to the location of the 100,000th insertion, but is not necessarily useful in finding insertions located near a particular position (e.g., at the middle of chromosome 17). In these cases, a random-access entry point map may also contain additional information specifying the relative genome position of certain insertions. Alternately, a second table or map storing this information may be used.
Integration with Existing Representations of Variant Data
As noted above, a graph genome can be constructed from a reference sequence by, continuously adding new edges representing variants. Each edge can be defined in Compressed Insertion Specification (CIS) format. Converting existing variant information to CIS format can be relatively straightforward. For example, Variant Call Format (VCF) is an existing file format that describes variations from a reference sequence as an alternate base at a particular position on a reference genome. A VCF->CIS conversion program could iterate through each entry of a VCF file and create a new CIS entry for that variant describing an insertion on a graph. In this way, existing VCF data and data represented in other formats can be converted directly to CIS format.
Advantages of CISThe proposed specification efficiently stores graphs as a plurality of efficiently encoded edges. In particular, the proposed specification minimizes the number of edges that need to be stored by constructing a graph by adding edges individually, thus segmenting previously inserted edges into multiple edges. Further, when combined with the LFR8/LR8 encoding scheme, the specification requires only a minimal amount of storage space, leading to greatly compressed structures. For example, a single SNP in the human genome (the most common type of variant) requires only 40 bits to store. Further, use of a random-access entry point map or other indexing structure allows for fast random data access by allowing for particular inserts or components of graphs to be quickly identified and loaded into memory when needed.
Representative SystemProcessor refers to any device or system of devices that performs processing operations. A processor will generally include a chip, such as a single core or multi-core chip, to provide a central processing unit (CPU). A processor may be provided by a chip from Intel or AMD, A processor may be any suitable processor such as the microprocessor sold under the trademark XEON E7 by Intel (Santa Clara, Calif.) or the microprocessor sold under the trademark OPTERON 6200 by AMD (Sunnyvale, Calif.).
The memory subsystem 475 contains one or any combination of memory devices. A memory device is a mechanical device that stores data or instructions in a machine-readable format. Memory may include one or more sets of instructions (e.g., software) which, when executed by one or more of the processors of the disclosed computers can accomplish some or all of the methods or functions described herein. Preferably, each computer includes a non-transitory memory device such as a solid state drive, flash drive, disk drive, hard drive, subscriber identity module (SIM) card, secure digital card (SD card), micro SD card, or solid-state drive (SSD), optical and magnetic media, others, or a combination thereof.
Using the described components, the system 401 is operable to, cause the system to assign a coordinate that defines a path from a first position to a second position within the genomic graph, and encode the coordinate using a variable length encoding scheme where the encoded coordinate includes length information. The system may receive or obtain a genomic graph representing genomic sequences from one or more organism through the use of one or more input/output device, labelled “I/O” in
In certain embodiments, the reference graph 100 is stored in the memory subsystem 475 using adjacency lists or index-free adjacency, which may include pointers to identify a physical location in the memory subsystem 475 where each vertex is stored. In a preferred embodiment, the graph is stored in the memory subsystem 475 using adjacency lists. In some embodiments, there is an adjacency list for each vertex. For discussion of implementations see ‘Chapter 4, Graphs’ at pages 515-693 of Sedgewick and Wayne, 2011, Algorithms, 4th Ed., Pearson Education, Inc., Upper Saddle River N.J., 955 pages, the contents of which are incorporated by reference and within which pages 524-527 illustrate adjacency lists.
Sequencing TechnologiesIn use, sequence reads are aligned to edges or paths through a reference graph, and then genotyping or variant calling can be performed by counting the number of reads across each path or by identifying mismatches across each path. Because the sequence reads are aligned to the reference graph which includes known variation, the subsequent step of identifying mutations by, comparing to a table of known mutations can be eliminated. Further, alignment to reference graphs results in greatly improved results for larger structural variants, as well as indels and SNPs located near structural variants. Additionally, in some cases, there may be no need to perform an additional variant calling step because an ideal reference graph will not have any mismatches from the sequence read (of course, variant calling can still be performed to detect unknown variations).
Sequence reads may be obtained in a variety of ways.
In certain embodiments, sequence reads are obtained by performing sequencing on the sample 2202 from a subject (however in some embodiments, sequence reads are obtained when a read file is transferred into a system of the invention). Sequencing may be by any method and sequencing instrument known in the art. See, generally, Quail, et al., 2012, a tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences, and Illumina MiSeq sequencers, BMC Genomics 13:341. DNA sequencing techniques include classic dideoxy sequencing reactions (Sanger method) using labeled terminators or primers and gel separation in slab or capillary, sequencing by synthesis using reversibly terminated labeled nucleotides, pyrosequencing, 454 sequencing, Illumina/Solexa sequencing, allele specific hybridization to a library of labeled oligonucleotide probes, sequencing by synthesis using allele specific hybridization to a library of labeled clones that is followed by ligation, real time monitoring of the incorporation of labeled nucleotides during a polymerization step, polony sequencing, and SOLiD sequencing.
A sequencing technique that can be used includes, for example, use of sequencing-by-synthesis systems and sequencing instruments sold under the trademarks GS JUNIOR, GS FLX+ and 454 SEQUENCING by 454 Life Sciences, a Roche company (Branford, Conn.), and described by Margulies, M. et al., Genome sequencing in micro-fabricated high-density picotiter reactors, Nature, 437:376-380 (2005); U.S. Pat. No. 5,583,024; U.S. Pat. No. 5,674,713; and U.S. Pat. No. 5,700,673, each incorporated by reference. 454 sequencing involves two steps. First, DNA is sheared into blunt-end fragments attached to capture beads and then amplified in droplets. Second, pyrosequencing is performed on each DNA fragment in parallel. Addition of nucleotides generates a light signal that is recorded by a CCD camera in a sequencing instrument.
Another sequencing technique and instrument that can be used is SOLID technology by Applied Biosystems from Life Technologies Corporation (Carlsbad, Calif.). In SOLiD sequencing, genomic DNA is sheared into fragments, and adaptors are attached to generate a fragment library. Clonal bead populations are prepared in microreactors containing beads, primers, template, and PCR components. Following PCR, the templates are denatured and enriched and the sequence is determined by a process that includes sequential hybridization and ligation of fluorescently labeled oligonucleotides.
Another example of a DNA sequencing technique and instrument that can be used is ion semiconductor sequencing using, for example, a system sold under the trademark ION TORRENT by Ion Torrent by Life Technologies (South San Francisco, Calif.). Ion semiconductor sequencing is described, for example, in Rothberg, et al., An integrated semiconductor device enabling non-optical genome sequencing, Nature 475:348-352 (2011); U.S. Pubs. 2009/0026082, 2009/0127589, 2010/0035252, 2010/0137143, 2010/0188073, 2010/0197507, 2010/0282617, 2010/0300559, 2010/0300895, 2010/0301398, and 2010/0304982, each incorporated by reference. DNA is fragmented and given amplification and sequencing adapter oligos. The fragments can be attached to a surface. The addition of one or more nucleotides releases a proton (H+), which signal is detected and recorded in a sequencing instrument.
Another example of a sequencing technology that can be used is Illumina sequencing. Illumina sequencing is based on the amplification of DNA on a solid surface using fold-back PCR and anchored primers. Genomic DNA is fragmented and attached to the surface of flow cell channels. Four fluorophore-labeled, reversibly terminating nucleotides are used to perform sequential sequencing. After nucleotide incorporation, a laser is used to excite the fluorophores, and an image is captured and the identity of the first base is recorded. Sequencing according to this technology is described in U.S. Pub. 2011/0009278, U.S. Pub. 2007/0114362, U.S. Pub. 2006/0024681, U.S. Pub, 2006/0292611, U.S. Pat. No. 7,960,120, U.S. Pat. No. 7,835,871, U.S. Pat. No. 7,232,656, U.S. Pat. No. 7,598,035, U.S. Pat. No. 6,306,597, U.S. Pat. No. 6,210,891, U.S. Pat. No. 6,828,100, U.S. Pat. No. 6,833,246, and U.S. Pat. No. 6,911,345, each incorporated by reference.
Other examples of a sequencing technology that can be used include the single molecule, real-time (SMRT) technology of Pacific Biosciences (Menlo Park, Calif.) and nanopore sequencing as described in Soni and Meller, 2007 Clin Chem 53:1996-2001.
As shown in
In some embodiments, the sequence reads 904 are assembled to provide a contig or consensus sequence, which contig or consensus sequence may be used in finding alignments to a reference graph. Sequence assembly may include any suitable methods known in the art including de novo assembly, reference-guided assembly, others, or combinations thereof. In a preferred embodiment, sequence reads are assembled using graph-based alignment methods. See, e.g., U.S. Pub. 2015/0057946 and U.S. Pub. 2015/0056613, both incorporated by reference. Embodiments of a graph and its use are discussed in greater detail below. The result of assembly is a sequence representing the corresponding portion of the original nucleic acid. The contig or consensus sequence or one or more of individual sequence reads 2204 may be mapped to a reference graph to find an alignment with an optimal score.
In certain embodiments, each sequence read 2204 is mapped to a reference graph and an alignment is found. As previously noted, alignments of sequence reads to a reference graph may be used as evidence to identify the likelihood of the possible diploid genotypes from an interval in a reference graph.
Exemplary Encoded and Decoded CoordinatesGenomic graphs, such as the reference graphs that may find use in sequence assembly, may be serialized, e.g. by the method 101 of
References and citations to other documents, such as patents, patent applications, patent publications, journals, books, papers, web contents, have been made throughout this disclosure. All such documents are hereby incorporated herein by reference in their entirety for all purposes.
EQUIVALENTSVarious modifications of the invention and many further embodiments thereof, in addition to those shown and described herein, will become apparent to those skilled in the art from the full contents of this document, including references to the scientific and patent literature cited herein. The subject matter herein contains important information, exemplification, and guidance that can be adapted to the practice of this invention in its various embodiments and equivalents thereof.
Claims
1. A method of encoding a genomic graph representing variations in genetic sequence, the method comprising:
- providing a genomic graph comprising nodes connected by edges, wherein the nodes and edges are stored as objects in non-transitory memory of a computer system and wherein an edge within the genomic graph represents a genomic sequence;
- selecting at least one edge from the genomic graph, the at least one edge representing a single genetic sequence and having a position on the genomic graph;
- encoding the at least one edge as a string of bits, in which at least a first component of the string of bits stores at least a portion of the position of the branch on the genomic graph.
2. The method of claim 1, wherein selecting at least one edge comprises selecting edges having an associated identifier.
3. The method of claim 1, wherein selecting at least one edge from the genomic graph further comprises traversing the genomic graph and identifying paths beginning with nodes having multiple outgoing edges, ending at nodes having multiple incoming edges, and having at least one directed edge specifying a linear path between.
4. The method of claim 3, wherein selecting at least one edge from the genomic graph comprises selecting the at least one directed edge specifying a linear path between a node having multiple outgoing edges and a node having multiple incoming edges.
5. The method of claim 1, wherein selecting at least one edge from the genomic graph further comprises determining an ordering of the edges in the genomic graph.
6. The method of claim 5, wherein determining an ordering comprises identifying a longest path in the graph from a source vertex to a sink vertex.
7. The method of claim 1, wherein the position of the at least one edge on the genomic graph further comprises a starting position and an ending position; and further wherein encoding the branch as a string of bits comprises encoding at least a portion of the starting position and a least a portion of the ending position of the at least one edge on the genomic graph.
8. The method of claim 1, wherein the starting position and the ending position share a common prefix, and encoding the branch as a string of bits further comprises including third component that includes the common prefix.
9. The method of claim 1, wherein encoding the branch as a string of bits further comprises encoding the length of the associated genomic sequence.
10. The method of claim 1, wherein encoding the at least one edge as a string of bits further comprises encoding the associated genomic sequence.
11. The method of claim 1, wherein encoding the branch as a string of bits further comprises encoding the type of variation represented by the genomic sequence.
12. The method of claim 11, wherein the type of variation is encoded as a vector.
13. The method of claim 11, wherein the type of variation is a deletion of a single nucleotide, a single nucleotide polymorphism, a single nucleotide insertion, a multiple nucleotide insertion, or a replacement of one nucleotide by multiple nucleotides.
14. The method of claim 13, wherein encoding the at least one edge as a string of bits comprises encoding only a start position of the at least one edge on the genomic graph.
15. The method of claim 1, wherein encoding the branch as a string of bits further comprises encoding the associated genomic sequence.
16. The method of claim 1, further comprising removing the at least one edge from the graph.
17. The method of claim 16, further comprising selecting and encoding at least one second edge from the genomic graph.
18. The method of claim 1, wherein the first component includes length bits indicating a length of the first component and representation bits that include the portion of the starting coordinate.
19. The method of claim 18, wherein each component is stored as a sub-string of the string of bits wherein a first portion of the sub-string represents a length of the sub-string and a second portion of the sub-string includes stored information.
20. The method of claim 19, further comprising performing the selection and encoding steps to convert the genomic graph into a linear string of bits comprising a representation of each of the plurality of edges.
21. The method of claim 1, further comprising storing a plurality of edges from the genomic graph, each as its own string of bits.
22. The method of claim 1, further comprising storing all of the edges from the genomic graph so that the entire genomic graph is stored as a sequence of strings of bits, each string of bits comprising a first portion representing a length of the string and a second portion representing the position.
23. A method of decoding a genomic graph from a serialized specification, the method comprising:
- receiving a representation of a genomic sequence;
- associating the genetic sequence with a first edge of a genomic graph;
- receiving a representation of a variation of the genetic sequence, the representation further comprising a position on the genomic graph of the variation;
- associating the representation of the variation with a second edge; and
- inserting the second edge into the genomic graph at the position.
24. The method of claim 23, further comprising segmenting the first edge at the position of the variation.
Type: Application
Filed: May 18, 2017
Publication Date: Dec 27, 2018
Applicant: Seven Bridges Genomics Inc. (Cambridge, MA)
Inventor: Vladimir Semenyuk (Monterey, CA)
Application Number: 15/598,404