SYSTEMS AND METHODS FOR ENCODING GENOMIC GRAPH INFORMATION

Info

Publication number: 20180373839
Type: Application
Filed: May 18, 2017
Publication Date: Dec 27, 2018
Applicant: Seven Bridges Genomics Inc. (Cambridge, MA)
Inventor: Vladimir Semenyuk (Monterey, CA)
Application Number: 15/598,404

Abstract

The invention provides systems and methods for encoding specific portions of a genomic graph in a way that is useful for copying or moving those portions from one location to another. The genomic graph can comprise nodes connected by edges, wherein the nodes and edges are stored as edge objects in non-transitory memory of a computer system and wherein an edge within the graph represents a genomic sequence. In on embodiment, the a method of encoding a genomic graph comprises selecting at least one edge from the genomic graph, the at least one edge representing a single genetic sequence and having a position on the genomic graph. The method can further comprise encoding the at least one edge as a string of bits, in which at least a first component of the string of bits stores at least a portion of the position of the edge on the genomic graph.

Description

Description

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims priority from U.S. Provisional Patent Application Ser. No. 62/338,706, filed on May 19, 2016, entitled “SYSTEMS AND METHODS FOR ENCODING GENOMIC INFORMATION”, U.S. Provisional Patent Application Ser. No. 62/400,388, filed on Sep. 27, 2016, entitled “SYSTEMS AND METHODS FOR SEQUENCE ENCODING”, U.S. Provisional Patent Application Ser. No. 62/400,411, filed on Sep. 27, 2016, entitled “SYSTEMS AND METHODS FOR METADATA STORAGE AND COMPRESSION”, and U.S. Provisional Patent Application Ser. No. 62/400,123, filed on Sep. 27, 2016, entitled “SYSTEMS AND METHODS FOR GENOMIC GRAPH INSERT SPECIFICATION”, the disclosures of which are hereby incorporated by reference.

TECHNICAL FIELD

The disclosure relates to biology, genomics, and medical genetics. In particular, the disclosure relates to methods of encoding genomic information represented by graphs.

BACKGROUND

Studying genomes can potentially reveal much about the health and history of organisms and entire species. Researchers can identify genes associated with cancer and other diseases through genomic studies, and genomics also plays an important role in forensics, genealogy, and agriculture, among other fields. A frequent approach to genomic studies includes sequencing DNA from a sample and comparing the resulting sequence reads to each other or to a known reference to characterize the genomes of organisms represented in the sample. Reference genomes often include whole genomes that have been sequenced and published, often available online. Variation among genomes has been represented by structuring genomic sequences as graphs.

Unfortunately, storing and analyzing genomic information has become impractical or impossible using conventional techniques. For example, as more genomes continue to be sequenced, the time required to store, query, and manage the new resulting information also increases. Many conventional compression techniques developed to transmit genomic information require custom or proprietary decoders, which further limits broad access. Representing genomes and genomic variations as graphs presents further complexities. Therefore, as the number of sequenced genomes and catalogued variations increases, analyzing and comparing the information quickly becomes intractable because of the required processing time or storage of such large data sets, especially when computational resources are limited.

Further, existing approaches to structuring genomic sequences into graphs restrict the usefulness of available genomic data. Due to the fact that graphs are inherently multi-dimensional, consisting of a web-like network of inter-related pieces in computer memory, it is difficult to copy parts of the graph such as specific genetic sequences from one location to another, because the multi-dimensional nature of graph topology does not lend itself easily to expression in a flattened, serialized form. Additionally, as graphs start to represent genetic variation among populations of individuals, the size of the graph becomes quite large, further hindering the ability to copy or transfer the graph from one location to another.

SUMMARY

The present disclosure describes various systems and methods for encoding genomic information. In some embodiments, the present disclosure describes encoding specific portions of a genomic graph in a way that is useful for copying or moving those portions from one location to another via a file or data stream. The genomic graph has nodes connected by edges such that paths through the graph represent genetic sequences. A specific genetic sequence is represented using coordinates that identify a starting position and an end position of the sequence within the graph. The coordinates are encoded into strings, or sequences, of bits that represent the data efficiently, and compactly, resulting in easy and rapid transfer between locations in computer memory.

Each encoded set of coordinates is thus a specification describing the removal, or insertion, of an edge in a genomic graph as that graph is being deconstructed from, or built up within, respectively, a location in computer memory. Thus an entire graph, or substantial portions of a graph, can be moved by iteratively specifying the removal of an edge or set of edges from the graph in one location and building up the graph, or portion thereof, in a new location by iteratively inserting a single edge representing the removed edge or set of edges according to the encoded specifications. The specification for each edge—using coordinates to describe the removal or insertion of that edge for a respective genome graph copy or re-build operation—may be compactly, encoded using a variable length encoding scheme into a string of bits. Since all of the edges of the graph or portion thereof are encoded iteratively, the entire graph or portion thereof is encoded as a string of bits. The string of bits is linear—or serial—in nature, and thus the genomic graph is linearized, or serialized. The serialized data structure is conducive to rapid transfer and compact storage in computer systems.

The invention includes random access methods useful to load or deserialize only portions of a serialized graph back into memory as node and edge objects. The fully-encoded graph exists as a serialized stream of bits with a variable-length format because each insertion is represented by a variable number of bits. The fully-encoded graph is provided with a random-access entry point map that aids in randomly accessing serialized portions of the graph for deserialization. The entry point map provides an indexing structure that indicates location of particular offsets of encoded edge insertion representations for a serialized graph. To identify the location of a particular insertion, a decoder consults the entry point map for the nearest located insertion. The decoder then continues decoding from that insertion until it identifies the desired insertion. Use of the entry point map greatly decreases the average complexity of random access.

In the encoding of the insert specifications, each edge may be described by coordinates that include at least a starting coordinate that defines a starting position of the edge within the genomic graph and may also include information that identifies an end position of the edge within the genomic graph. One insight of the invention is that some edges may be described by a starting coordinate and an end coordinate that share a common prefix but that each have a unique suffix. Such edges may be described by a specification that includes a component for the common prefix, a component for the unique suffix of the starting coordinate, and a component for the unique suffix of the end coordinate. Each component may be independently encoded using a variable-length encoding scheme.

Because a genomic graph, or portions of a genomic graph, can be described as a series of edge insertions that is compactly encoded into a stream of bits, methods of the invention can be used to transfer graphs or portions thereof from one location in memory to others, or between two separate computing systems. Where graphs are implemented using node and edge objects that are connected via spatial referencing in memory (e.g., via adjacency lists or index-free adjacency), due to the spatial referencing being coupled intrinsically to physical locations in memory, those graphs cannot easily be copied or moved according to existing methods. Methods of the invention allow such graphs to be effectively copied or moved by describing graphs as a series of edge insertions. Each edge insertion is given a specification that is compactly encoded as a stream of bits. The entire graph or portion thereof is thus a serialized data structure, which may be compactly and quickly stored or moved.

In certain aspects, the invention provides a method for describing a genetic sequence. The method includes providing a genomic graph that has nodes connected by edges. The nodes and edges are stored as objects in non-transitory memory of a computer system. An edge (or in some embodiments, a node) within the genomic graph represents a genetic sequence. An edge of the genomic graph is converted into a string, or sequence, of bits in which at least a first component of the string of bits stores at least a portion of a coordinate that defines a position of the edge within the genomic graph.

In related aspects, the invention provides a system for describing a genetic sequence. The system includes a processor coupled to a non-transitory memory subsystem. A genomic graph that has nodes connected by edges is stored in the memory subsystem. The nodes and edges are stored as objects in the non-transitory memory subsystem. An edge (or in some embodiments, a node) within the genomic graph represents a genetic sequence. The system includes instructions that when executed by the processor cause the system to convert an edge of the genomic graph into a string of bits in which at least a first component of the string of bits stores at least a portion of a coordinate that defines a position of the edge within the genomic graph.

In various embodiments of methods and systems herein, the string of bits provides a representation of the edge that includes a coordinate and may also include information that identifies a start position and an end position of the edge within the genomic graph. The information that identifies the start position and end position of the edge may be a pair of coordinates that define a starting position and an ending position of the inserted edge within the genomic graph. The string of bits may include a second component that includes at least a portion of a start coordinate and an end coordinate.

In certain embodiments of the systems and methods, the starting coordinate and the end coordinate share a common prefix and the string of bits includes a third component that includes the common prefix.

In some embodiments, the first component includes length bits indicating a length of the first component and representation bits that include the portion of the starting coordinate. The first component may further include flag bits that represent a type of variation (e.g., an insertion of one or more nucleotides, a deletion of one more nucleotides, or a substitution of one more nucleotides) between the edge and a reference genome. The second component may include second length bits, second flag bits, and second representation bits. Each component may be stored as a sub-string of the string of bits in which a first portion of the sub-string represents a length of the sub-string and a second portion of the sub-string includes stored information.

The specifying and storing steps may be performed to convert all, or a substantial portion of, the genomic graph into a linear string of bits comprising a representation of the plurality of edges. A plurality of edges from the genomic graph may be stored, each as its own string of bits. As will be described in further detail herein, some of the edges of the plurality of edges may be merged together as a single string of bits in order to increase efficiency, lit) some embodiments, an entire genomic graph is stored as a sequence of strings of bits, each string of bits comprising a first portion representing a length of the string and a second portion representing coordinates.

In some embodiments, the first component further comprises flag bits that identify a variant relationship between the genetic sequence and a reference genome represented by a reference path through the genomic graph.

The disclosure further provides systems and methods for describing a specific path within a genomic graph by assigning coordinates that specify a start point and an end point within the genomic graph and describe a unique path from the start point to the end point. Methods and systems of the disclosure are applicable to genomic graphs that store genomic sequence from multiple sample genomes as a single unified graph, and represent genetic variation as divergent paths within the graph while representing genetic similarities using convergent paths. Because the assigned start and end coordinates describe a unique segment within the graph, specific information and paths within the sequence graphs are easily transmitted between users and between different computer systems. For example, as genomic information, such as a variant, is identified (e.g., by mapping sequenced reads from a DNA sequencer to a reference graph), that information is transmitted to another party as a path within the graph without having to transmit or recreate the entire reference graph.

Additionally, some embodiments provide a variable-length encoding scheme by which coordinates (as described above) are compactly encoded in a binary format. This allows for an exchange of information from graphs while reducing the computational resources required for efficient and meaningful use of the sequence (e.g., genomic) data. To that end, some embodiments operate to efficiently encode and byte-align sequence data by, at least in part, assigning a coordinate to the sequence data that represents a path through the graph. Simply put, the assigned. coordinate can be an integer or plurality of integers that represent a location of a specific sequence within a graph, even as the sequence graph continues to rapidly expand with new information.

Beyond the use of coordinates to specify a path segment within a graph, methods and systems described herein may also encode coordinates using a minimal number of bytes necessary to preserve the accuracy of the original data. The disclosed embodiments provide for variable length encoding with lossless compression. A computer system analyzes the genomic information and chooses the minimal number of bytes required to store the information. For example, the system may receive genomic information in the form of a sequence read, identify a path segment within a sequence graph that matches that sequence read, and encode that segment by assigning a coordinate to the segment and compressing the coordinate using the variable length encoding scheme, which minimizes the number of bytes and embeds the number of bytes used (e.g., a length value) within the encoded information. The encoding is variable in that the system chooses the number of bytes based on the specific genomic information it is analyzing. By minimizing the number of bytes, the computer system achieves efficiencies in storage and speed while not losing data.

Variable-length encoding applies to a variety of formats of sequence data and genomic data, such as a multiple sequence alignment, a set of sequence read data, or a directed graph representing relationships among multiple genomic sequences. Since the variable length encoding may be applied to very large datasets, such as sequence reads, sequence alignments, sequence graphs (e.g., genomic graphs), very large genomic data sets may be encoded compactly. Since the variable length encoding embeds the number of bytes used (e.g., a length value) within the encoded information, a downstream decoder will not require prior knowledge of the number of bytes required for each component. Since the system can encode the data efficiently in terms of both compression ratio and time while also preserving the accuracy of the information, the size or required processing time are not limiting factors in analyses for medical genetics, genomic research, and next-generation sequencing (NGS).

Encoding schemes described herein may provide a solution for storing integers that efficiently encodes integers while remaining manageable. The encoding schemes may efficiently, represent a value and can accommodate auxiliary binary information (e.g., a set of flags), which can serve to identify the particular coding context used. Further, the encoding schemes present a compromise between encoding and decoding complexity and coding efficiency by using variable lengths, yet providing byte-aligned code, resulting in no need for additional padding or meaningless bytes between the various data structures. Advantageously, parameters can be adjusted to fit any application. Further, strings of encoded integers can be placed in sequence and can have different parameters, allowing for complex data structures and coordinates to be defined.

Compression techniques are available to reduce the amount of space required for a sequence graph. However, conventional compression techniques are generally suited for compression of text files, and thus, may not optimally compress genomic or graph related data. Using methods of the invention, the encoding process is efficient in terms of both compression ratio and time while also being lossless in terms of information accuracy.

In one aspect, the disclosure provides a computer-implemented method for encoding genetic sequences. The method includes obtaining a genomic graph representing genetic sequences from one or more organism, assigning a coordinate that defines a path from a first position to a second position within the genomic graph, and encoding the assigned coordinate using a variable length encoding scheme. The encoded coordinate includes length information.

In another aspect, the disclosure provides a system for specifying the genotype of an organism. The system includes a reference graph and instructions stored within a tangible memory subsystem. The reference graph includes nodes connected by edges into a plurality of paths. The plurality of paths each represents a variation from a reference sequence. When executed by the system, the instructions cause the system to assign a coordinate or pair of coordinates representing one or more of the plurality of paths. The number of paths depends upon the ploidy of the organism; for example a diploid organism which is heterozygous at a specific region within the sequence graph would actually require two concurrent paths through that region in order to specify the genotype. The instructions further cause the system to encode the coordinates using a variable length encoding scheme. The assigned coordinates define the path from a first position to a second position, and the encoded coordinates include length information about the size of the encoded representation. In certain implementations, the genotypes may be determined by evaluating the number of sequence reads aligned to particular paths within the reference graph.

In some embodiments, a variable length encoding scheme includes determining the number of bits to represent the coordinate, encoding the coordinate using the determined number of bits, and generating a header value that indicates a total number of octets representing the encoded coordinate. The encoded coordinate may be represented by at least one octet, the at least one octet being a header octet storing a header value. In some embodiments, the encoded coordinates are in a byte-aligned format. In some embodiments, the encoded coordinate is represented by a plurality of octets, at least one of the plurality of octets being a header octet including the header value.

In some embodiments, the genomic graph includes a plurality of nodes and a plurality of edges, wherein the nodes connected by the edges are stored in a tangible memory subsystem. Nodes and edges may be connected using adjacency lists including pointers to physical locations in the memory. In some embodiments, the genomic graph is provided within a tangible memory subsystem of a hardware computer system. In some embodiments, a processor is coupled to the tangible memory subsystem.

In some embodiments, encoded coordinates may be stored within the tangible memory subsystem of a hardware computer system. In some embodiments, an encoded coordinate can be decoded without prior knowledge of the total number of octets representing the encoded coordinate. Assigning a coordinate may include identifying a series of integers, where each integer is a component of a coordinate, and converting the series of integers to a binary format.

In some embodiments, a variable length encoding scheme also includes obtaining auxiliary binary information and encoding the auxiliary binary information using the variable length encoding scheme. In some embodiments, at least one of the plurality of octets includes the encoded auxiliary binary information.

In some embodiments, the genomic data includes a directed graph representing at least a portion of a plurality of genomes. The system may serialize the genomic data. In certain embodiments, encoding the characters within each of the plurality of segments transforms the directed graph into a sequence of bits. The system may transfer the sequence of bits serially between the memory subsystem and a second computer system. The system may support quick or random access. The system may write a table that includes a plurality of pairs <p, p′>, where p is an offset in the genomic data and p′ is an offset in the encoded characters, and allow characters of known position in the genomic data to be read from the encoded characters by reading from the table.

These and other aspects, features, and implementations, and combinations of them may be expressed as apparatus, methods, means or steps for performing functions, components, systems, program products, and in other ways.

Other aspects, features, and advantages will be apparent from the description and drawings, and from the claims.

BRIEF DESCRIPTION OF THE DRAWINGS

Features and advantages of the claimed subject matter will be apparent from the following detailed description of embodiments consistent therewith, which description should be considered with reference to the accompanying drawings.

FIG. 1 diagrams an embodiment of a method for decoding a genomic graph from a serialized specification.

FIG. 2 shows a genomic graph comprising two nodes connected by an edge, wherein the edge is associated with a nucleotide sequence.

FIG. 3 shows an initial stage of adding an edge to the genomic graph of FIG. 2 by adding two vertices to a previous edge, thus segmenting the previous edge.

FIG. 4 shows the addition of a new edge to the genomic graph of FIG. 3.

FIG. 5 diagrams a method for encoding a genomic graph into a serialized specification of the genomic graph.

FIG. 6 illustrates the genomic graph of FIG. 4 prior to encoding.

FIG. 7 illustrates converting information describing an edge in the genomic graph of FIG. 6 into a string of bits.

FIG. 8 illustrates the genomic graph of FIG. 6 with an edge removed.

FIG. 9 diagrams an embodiment of a method of processing bioinformatics data.

FIG. 10 shows certain paths in a directed genomic graph starting from a source vertex.

FIG. 11 shows certain paths in a directed genomic graph starting from several different vertices.

FIG. 12 shows certain reverse paths in a directed genomic graph.

FIG. 13 depicts an embodiment of a method of variable length encoding according to the disclosure.

FIG. 14 shows a representation of the encoded information using an embodiment of a variable length encoding scheme (e.g., a length-flag-representation-8 bit octet (LFR8) encoding scheme).

FIG. 15 shows an encoding of a coordinate value, “4,954”, in a length-representation-8 bit (LR8) encoding scheme where three length bits in a header octet represent a header value.

FIG. 16 shows an encoding of a coordinate value, “4,954”, in a length-flags-representation (LFR8) encoding scheme where three length bits in a header octet represent a header value and three length bits in the header represent auxiliary binary information (e.g., a set of flags).

FIG. 17 depicts a table showing how an encoding scheme according to the disclosure can be modified to accommodate both a variety of length bits and flags.

FIG. 18 depicts a table illustrating several kinds of sequence variations that affect a single position in the genome, and including vectors that may be associated with variations and examples of corresponding graphs showing those variations.

FIG. 19 is a table illustrating two kinds of sequence variations that affect multiple positions in the genome, including vectors and example graphs.

FIG. 20 illustrates an embodiment of a random-access entry point map.

FIG. 21 diagrams systems of the invention.

FIG. 22 illustrates obtaining a set of sequence reads from a sample.

FIG. 23 illustrates an example of an encoded LFR8 coordinate.

FIG. 24 illustrates an example of the encoded LFR8 coordinate of FIG. 23 after decoding.

FIG. 25 illustrates a toy example of a serialized genomic graph.

For a thorough understanding of the present disclosure, reference should be made to the following detailed description, including the appended claims, in connection with the above-described drawings. Although the present disclosure is described in connection with exemplary embodiments, the disclosure is not intended to be limited to the specific forms set forth herein. It is understood that various omissions and substitutions of equivalents are contemplated as circumstances may suggest or render expedient.

DETAILED DESCRIPTION

Described herein are systems and methods for describing a genetic sequence as an insertion of an edge to a genomic graph and compressing the description for compact storage and ease of transfer. Various embodiments of systems and methods disclosed herein may be referred to as Compressed Insert Specification (“CIS”) Format. Further, various methods and systems described throughout this disclosure provide an approach to encoding information (e.g., genetic information) that minimizes the number of required bytes, which is particularly advantageous when the information includes genomic or graph related data. For example, genetic information can be structured as a reference graph that can store genetic information from multiple individuals, or even multiple organisms. Using a minimal number of bytes, the invention encodes both coordinates and paths through a genomic reference graph using a variable number of bytes, while also storing information in the encoded information (e.g., in a header octet) about the total size of the total number of bytes. This minimizes wasted bytes as the total number of bytes is tuned to the specific information or query and simplifies the decoding process as a decoder is automatically provided with this key information.

One way to represent a genome is as a directed graph having a plurality of vertices and a plurality of directed edges in such a reference graph representation of a genome (e.g., a sequence graph or a genomic graph), genetic variations within species can be incorporated as divergent paths along the graph. Genetic similarities within a species can be represented as a single path. In other words, the data associated with genetic similarities must only be stored once in the sequence graph. Genomic data, such as nucleotide or protein sequences, may be associated with either the vertices or edges of the graph, and thus the genomic graph can be either vertex-based or edge-based. The direction of the edges indicates the allowable sequences that may be represented by following paths through the graph. Different paths through the graph may represent variation in sequences. An edge-based genomic graph can be built, for example, by first associating a single edge with a reference sequence. While the present disclosure illustrates edge-based graphs, however it should be noted that methods and systems described herein are equally applicable to vertex-based graphs.

In practice, a graph may be serialized and deserialized in order to store and load the graph in computer memory. FIG. 1 diagrams an embodiment of a method 101 for decoding a genomic graph from a serialized specification by constructing the graph as a plurality of edge insertions. The method 101 can begin by selecting 105 a first edge insertion from the serialized specification and associating 109 a reference sequence from the serialized specification with a first edge of the graph. This first edge may be referred to as the backbone of the graph. A second edge insertion and its position on the graph are then selected 113 from the serialized specification. A sequence from the second edge insertion is associated 117 with a second edge and inserted into the graph at the position. Subsequent edge insertions are similarly processed 121, e.g., by creating new edges within the graph at various positions. Once there are no remaining edge insertions from the serialized specification, the genomic graph is complete 125.

FIGS. 2-4 illustrate an example embodiment of constructing a genomic graph according to the method 101 of FIG. 1. FIG. 2 shows an example of a genomic graph 201 comprising nodes 205 connected by edges 209, wherein the nodes and edges are stored as objects in non-transitory memory of a computer system and wherein an edge 209 within the genomic graph represents a genetic sequence (e.g., TACTGGAC). In this embodiment, the graph 201 has been created by selecting 105 a first edge insertion from a serialized specification and associating 109 a reference sequence (here, the genetic sequence TACTGGAC) from the serialized specification with a first edge 209 of the graph (e.g., by the method 101 of FIG. 1). However, in certain embodiments, the graph 201 may be a pre-existing construct or have been generated in another manner.

Variation of the genetic sequence can be described by adding subsequent edges to the is graph. An edge can be added to the genomic graph 201 by referring information specifying the edge, such as a sequence and a position on the genomic graph 201. As shown in this embodiment, the genetic sequence is associated with a single edge 209, and each letter within the sequence is associated with an increasing offset value starting from zero. Offset values may be used to identify particular positions within the graph for selecting desired insertion points of new edges, for example.

FIG. 3 shows an initial stage of adding an edge to the genomic graph 201. In this example, the edge to be inserted represents the nucleotide sequence “A”, and has a position on the graph 201 beginning before offset 3 and ending before offset 4. In order to accommodate the new edge, new nodes 205 are added to the graph, thus segmenting the edge 209. A first node 205 is added at a starting position 211 and a second node is added at an end position 215. The starting node can be added by referencing a position, such as a starting coordinate (described later herein) that defines the starting position 211 of the edge within the genomic graph 201. It will be appreciated that different methods may be used to define an end position 215 of the soon-to-be-added edge. For example, an end coordinate may be explicitly specified with the serialized specification. Alternately, the numerical value of the end coordinate may be inferred by associating the edge with a commonly recognized insertion type, and then calculating the end coordinate value based on the designated properties of that edge type. To illustrate, if the edge is defined as representing a SNP, its end position can be inferred as one nucleotide away from the starting position. Given the starting position 211 and information that identifies the end position 215 of the new edge, the new edge can be added to the genomic graph 201.

FIG. 4 shows the addition of a new edge 219 to the genomic graph 201. As illustrated in FIG. 4, a variation, such as a single nucleotide polymorphism (SNP), can be inserted into the genomic graph 201 as a new edge. As shown in FIG. 3, the reference sequence edge 209 has been segmented by the addition of two nodes 205 at the position in the reference sequence where the variation occurs. Each end of the new edge representing the variation is then connected to the graph at the newly added nodes, respectively. Two paths are now available through the graph: 1) the reference sequence (TACTGGAC), and 2) the reference sequence including the SNP (TACAGGAC). This process may be repeated until the genomic graph is complete. At each step, an edge is inserted into the graph, which may require segmenting previously existing edges into multiple edges by the addition of new nodes 205.

In certain embodiments, the offset values are retained as edges are segmented by the addition of new inserts. For example, as shown in FIG. 2, the edge 209 has increasing offset values from zero to seven, and the multiple edges created by segmenting the edge 209 (e.g., by adding nodes at the position 211 and position 215 of FIG. 3) retain their increasing offset values, despite now being associated with separate edges. As will be described in further detail below, this feature may be used to help determine which edge or set of edges to encode and remove from the graph during serialization.

Further, as each edge is sequentially inserted into the graph, the new edge may also be associated with a unique identifier (e.g., a number that increments by one, relative to the identifier associated with the existing edge upon which the insertion is occurring). As previously inserted edges are segmented into multiple edges (i.e., by the addition of a new nodes in order to accommodate the inserted edge), these segmented edges may retain their initial identifier. Identifiers may be utilized later when determining an order for deconstructing the graph, for example. Typically a graph is deconstructed in the order in which it was constructed, i.e., the inserted edges are removed in the reverse order in which they were inserted.

The above-described method may also be reversed to serialize a genomic graph, FIG. 5 diagrams a method 501 for serializing a genomic graph. The method 501 includes the steps of providing 505 a genomic graph comprising nodes connected by edges, wherein the nodes and edges are stored as objects in non-transitory memory of a computer system and wherein an edge within the genomic graph represents a genetic sequence or variation in genetic sequence. An edge or set of edges are identified 509 and selected for serialization, and a specification of the edge or set of edges are generated 513. In certain embodiments, the edge is converted into a string of bits such that at least a first component of the string of bits stores at least a portion of a coordinate that defines a position of the edge within the genomic graph. That string of bits is a specification that describes the edge as an insertion to the genomic graph. By having written the edge to the string of bits, that edge can be removed 517 from the graph. The identification 509, conversion 513, and removal 517 steps can be iterated 521 to serialize 525 all or substantial portions of the graph.

FIGS. 6-8 illustrate an exemplary deconstruction of the graph 201 (created in FIGS. 2-4 using the method of FIG. 1) by identifying, encoding, and removing an edge or set of edges. Graph deconstruction, or serialization of a graph into CIS format, can be performed by selecting an edge or set of edges that were previously inserted (or, if this information is not available, could have been previously inserted) as a single edge into the graph. (The edge or set of edges may be referred to as a branch or path through the graph. In genome graphs, alternate branches describing genomic variation typically return to the base branch, and thus may also be referred to as “anabranches.”) Further, FIGS. 6-8 illustrate how the graph 201 (which has four vertices and four edges) can be described as only two edge insertions, resulting in an efficient storage representation.

Identifying 509 an edge or set of edges to encode and remove can be performed in a variety of ways. In certain embodiments, the edge or set of edges may have an associated identifier (e.g., an identifier that was associated with the edge when it was first added to the graph). In these embodiments, the associated identifier indicates the order in which the edges were added to the graph. The most recently added edge or set of edges may then be selected for removal. Edges segmented during graph construction may retain the same identifier, and are thus selected together. Identifiers may be associated with edges as metadata, for example.

An ordering for edge removal and graph deconstruction can also be determined if no previously existing information is available. In certain embodiments, the ordering may be determined by identifying edges that were most recently added, i.e., in the reverse order of construction. For example, one may traverse the graph 201 to identify an edge or set of edges for removal. This can be performed by traversing the graph (e.g., by a depth-first search) and noting those paths through the graph which begin with nodes having multiple outgoing directed edges, end at nodes having multiple incoming edges, and have at least one directed edge specifying a linear path between them, without any intervening nodes. If no paths satisfy this rule, then the only remaining set of edges is linear (i.e., has no divergent paths) and can thus be selected as the final set of edges for encoding and removal. For example, as shown in the embodiment of FIG. 6, the edge 219 could be selected for removal because it begins with a node at the position 211 that has multiple outgoing edges, ends with a node at the position 215 having multiple incoming edges, and has a single directed edge 219 between. Similarly, an edge 213 representing the “T” nucleotide could also be selected by this rule. In such cases, an encoder may also consult the offset values to determine which edge or set of edges to remove. For example, the edge 213 representing the “T” nucleotide has an offset value of 3, indicating that it is not the first letter of the originally inserted nucleotide sequence. Accordingly, if one were to deconstruct the graph of FIG. 6, one would start by selecting the edge 219 representing “A” and having an offset of 0. Because this procedure identifies an edge or set of edges that could have been inserted last, it can be performed dynamically, e.g., by identifying an edge or set of edges to remove, encoding the edge or set of edges as a single edge, removing the edge or set of edges from the graph, and repeating until there are no remaining edges to remove.

In other embodiments, an ordering for removal may be determined by identifying an order in which the edges may have initially been inserted into the graph and then annotating the graph with this information. In one embodiment, a first path is selected from the graph that traverses from a source vertex (i.e., the first vertex) to a sink vertex (i.e., the last vertex). This path is considered to be the backbone branch, and the edges and/or vertices in this path may be annotated with an initial identifier (e.g., 1 or 0), indicating that it should be inserted as a first edge insertion when creating the graph (or, if deconstructing the graph, the last to be encoded and removed). (The relevance of a backbone path, source vertex, and sink vertex are discussed in more detail below with respect to FIG. 9.) Next, a second path is selected by following the backbone branch from the source vertex to the first untaken directed edge, or noting the first available edge that does not belong to the first path or backbone branch. A path starting from that edge and reconnecting with the backbone branch is then selected and annotated with an incremented identifier, indicating that the edges along that path should be inserted into the graph as a second edge insertion (or, encoded and removed next-to-last). This process continues iteratively until each path is annotated with an identifier. The identifiers specify the order in which the graph could have been created via edge insertions, and thus can be used to specify an order in which the graph can be deconstructed and encoded.

Preferably, path selection identifies the longest available path available through the graph. For example, when selecting the backbone path, the graph may be traversed (e.g., via a depth-first search) to identify the longest path in terms of the number of symbols, edges, vertices, or other measure. Selecting the longest available path is beneficial because it reduces the number of subsequent edge insertions that need to be encoded and decoded, resulting in a more efficient and compact representation of the graph in serialized form. Accordingly, in certain embodiments, a graph may be deconstructed in a different manner from how it was initially constructed.

When annotating paths with ordering information, those edges connected to vertices having multiple edges may be annotated with an additional identifier indicating its order in terms of the number of available outgoing paths from that vertex. For example, after the backbone branch is first identified, the first directed edge not followed (which is then followed when identifying the second path) may be annotated with a “2”, indicating that it is the second available directed edge and specifying the second path that can be followed from that vertex. Similarly, the directed edge associated with the backbone branch may be annotated with a “1”.

Once an edge or set of edges have been identified 509 (i.e., the edge 219 representing “A” in the graph 201 of FIG. 6), a serialized specification of a single edge representing the edge or set of edges is then created 513. In this example, first, the position of the edge 219 is noted. As shown in FIG. 7, the position of the edge 219 can comprise the starting coordinate and end coordinate (e.g., “3” and “4”, as shown in box 601). The serialized specification may also comprise a length of the edge (i.e., the length of the associated nucleotide sequence, here “1”), and/or the nucleotide sequence itself (here, “A”). This information is then encoded into a serialized specification of the edge. Using methods described below (e.g., using “LFR8” and “LR8” encoding), the specification of the edge can be converted 711 into a string of bits 701 in which at least a first component stores at least a portion of the position. The string of bits will provide a representation of the edge that includes the position of the edge within the genomic graph.

Once the edge has been serialized and encoded, it may then be removed from the graph 201, e.g. as shown in FIG. 8. The process can be repeated 521 until the graph has been entirely, deconstructed, with the result being a series of statements, each describing the properties of a particular set of edges in the graph. Note that the graph 201 now includes two additional nodes which are no longer necessary. From the perspective of serialization, these nodes are essentially, ignored. (This feature results in an advantage of the present invention in that sets of edges may be serialized together, thus leading to efficient serialization and a reduction in the number of steps required to construct or deconstruct a graph.) As the graph now only contains a single linear set of edges, these edges are then selected together as a single edge for serialization and encoding. Deconstructing a graph in this manner can be referred to as serialization 125 or linearization, because a multi-dimensional structure becomes linearized into a discrete set of edges. Similarly, the graph can be reconstructed (i.e., de-linearized) by sequentially adding each linearized edge back to the graph and segmenting pre-existing edges as needed.

Linearizing graphs is useful because graphs can take up a large amount of computer memory, and thus it can be more efficient to load the graph or subsets of the graph only when needed. However, at the scales associated with genomic data, procedures associated with linearizing and de-linearizing graphs may also present a performance bottleneck. These issues can result in complexities during analysis, causing a non-negligible impact on genomic research. Systems and methods of the invention provide improvements in storing and representing graph-related data at genomic scales. While shown in FIGS. 2-4 in terms of the insertion of an edge into a genomic graph, methods of the invention are operable for the removal of an edge and the conversion of the edge into a string of bits as shown in FIGS. 6-8.

The present disclosure recognizes that a graph can be described as a plurality of edge insertions. Every inserted edge can be described by two parameters: a position of the edge within a graph, and the variation in genetic sequence that the edge represents. The position within the graph may comprise a start position and an end position. The variation in genetic sequence may comprise the length of the sequence or number of symbols associated with that edge, the genetic sequence itself, and/or an encoding of the type of variation. This information can be used to store graphs in a serialized format for ease of transport and instantiation. Further, this information can be efficiently represented to improve access and loading times, e.g., by using embodiments of encoding schemes as described below.

Graph Genome Coordinate Encoding

FIG. 9 diagrams a method 901 of processing bioinformatics data. Using genomic sequence data as an example, a computer system obtains 902 a genomic or reference graph representing genetic sequences from an organism. The genomic graph may be stored within the computer system or retrieved from a remote location. The computer system can then execute an assigning process 904 in which a coordinate is assigned that defines a path from a first position to a second position within the genomic graph. The computer system can then execute an encoding process 906 in which the coordinate is encoded using a variable length encoding scheme (which may be referred to as an LFR8 or LR8 encoding scheme).

The LR8 and LFR8 encoding schemes described herein represent coordinate components as a variable number of bytes, wherein the representation itself includes information about its size. In particular, an encoding scheme for storing coordinate components is proposed in which a header octet indicates the number of additional bytes (“Length”) following the header octet that is used to accommodate the stored coordinate component (“Representation”). In this way, the encoding scheme is able to minimize the number of bytes required to store a coordinate component, thus achieving efficiencies in storage and speed while also not suffering from problems associated with pure variable length encoding schemes. Further, the header may also store other information related to the integer or position (e.g., whether the integer represents the last component in the coordinate) as additional binary information, such as a set of flags (“Flags”). The LFR8/LR8 encoding schemes can be used to efficiently encode any kind of data, including graph-related data, such as coordinates and paths.

Assigning a Coordinate

Methods of the disclosure provide a coordinate system to identify locations or positions in a graph. Coordinates may comprise a series of integers (which may be referred to as “coordinate components”) describing a path through a graph that may be taken to arrive at a particular position. The positions at which serialized edges should be inserted into the graph can be defined by coordinates as described herein, for example.

The disclosed method encodes graphs and accompanying variation data, and is compatible with a variety of graph coordinate systems. For purposes of the present disclosure, a coordinate refers to an expression that defines a particular or relative position within the graph. Once a coordinate system has been established, pairs of coordinates can be used to define a trajectory specifying a path from a first position to a second position within the graph.

FIG. 10 shows certain exemplary paths in an embodiment of a reference graph 100 (which is an example of a genomic graph). The reference graph 100 can be a directed graph that depicts known variation in a reference genome. As shown in this embodiment, the reference graph 100 associates nucleotide sequences with the edges (and thus is an edge-based representation). Edges in the graph 100 are connected by a plurality of vertices. Al each vertex, there may be one or more outgoing directed edges specifying the possible paths that may be followed from that vertex. A series of distinct outgoing edges connected by vertices may be referred to as a branch. (Similarly, the term “path” may describe a particular series of outgoing edges connected by vertices, which may span multiple branches.) Each letter in the nucleotide sequence may be associated with an offset, which is an integer specifying the relative position of the letter within the branch. In other words, along a given branch, the offset starts from “0” (at the first nucleotide) and increases by “1” for each successive nucleotide. In the described examples, the offsets are zero-based. However, other offset values or even coordinate systems may be used.

Often, reference graphs designate a backbone branch representing a particular linear reference sequence (e.g., the nucleotide sequence of chromosome 1 of the human genome). Additional branches diverging from and reconnecting to the backbone branch thus represent known variation from the linear reference sequence, and represent events such as single nucleotide polymorphisms (“SNPs”), small insertions and deletions (“indels”), and larger structural variants. As shown in FIG. 10, the graph 100 comprises a backbone branch 102 having 5 edges which, when read sequentially in the direction associated with each edge, represents the nucleotide sequence “ACTGACTG”. Each nucleotide in the sequence is associated with an increasing offset value (0, 1, 2, etc.) indicating its relative position along the branch. Further, the backbone branch 102 has two special vertices: a source vertex 130, designating the first vertex in the graph 100 and the 5′ end of the associated nucleotide sequences; and a sink vertex 132, designating the last vertex in the graph and the 3′ end of the associated nucleotide sequences.

The reference graph 100 includes three additional branches specifying variations from the backbone branch 102. For example, a first branch 104 represents a nucleotide sequence “A”, a second branch 106 represents a nucleotide sequence “GACTG”, and a third branch 108 represents a nucleotide sequence “TACTG”. As shown, the first branch 104 and the second branch 106 are each connected to the backbone branch 102, and the third branch 108 is connected to the second branch 106. Similar to the backbone branch 102, a nucleotide sequence for each of the branches 104, 106, 108 is associated with an increasing offset value starting from zero.

Each branch describes a path or a particular set of edges and vertices within the reference graph 100 that may be followed through a series of increasing offset values. While the branches 102, 106, and 108 (and offsets) are shown at specific locations relative to the backbone branch 102, these or other branches could be specified in a variety of ways. For example, the backbone branch 102 could instead begin with “AAT” instead of “ACT”. In that case, the edge representing the nucleotide “A” in 104 would instead be reassigned to the backbone branch 102, with a corresponding reassignment of the offset values as well the edge within branch 102 that specifies “C” would instead be assigned an offset of “0” as a new alternate branch, whereas the edge specifying “A” would have offset “1” in the backbone branch). Accordingly, for a particular graph, designated branches and offset values may vary.

In practice, designating offsets and branches may be specified as a graph is created. The first reference sequence incorporated can be specified as the backbone branch, and its associated nucleotide sequence can be assigned increasing offset values. Subsequent sequences representing variations from the reference can be inserted as new branches connected to the backbone branch. These new branches will have new relative offset values starting from zero.

One exemplary coordinate system to identify particular positions on the graph specifies a sequence of (offset, branch) pairs. Pairs may be specified as a list of integers, each separated by a period “.”. Each integer in the coordinate may also be referred to as a coordinate component. The sequence of (offset, branch) pairs indicates a unique path that may be taken to reach a particular position within the graph. For example, starting from the source vertex 130 (i.e., the first vertex on the backbone branch 102), each (offset, branch) pair represents (i) the desired offset to be reached on the current branch, followed by (ii) an indication of whether an edge to an alternate branch (and having a new relative offset) is followed before that offset. Subsequent pairs in the coordinate are similarly followed. In this exemplary system, the first integer is zero-based (i.e., the first offset is “0”, the second offset is “1”, etc.), whereas the second integer begins from 1 (i.e., the first outgoing edge is “1”, the second outgoing edge is “2”, etc.).

For example, an “A” 110 in the first branch 104 may be identified by proceeding to the first offset (1), and moving to an alternate branch, i.e., by following a second outgoing edge 120 (2) from the vertex immediately before that offset. Therefore, the location of the “A” 104 in the reference graph 100 can be represented by the coordinate 1.2. While not necessary, the coordinate may also include that no further offset movements are required. Therefore, the “A” 110 can also be represented by the coordinate 1.2.0.

Similarly, if no divergent paths from the current branch are required to reach the desired position, the second integer in the coordinate can be omitted. In another example, a “T” nucleotide 112 on the backbone branch 102 may be specified by the coordinates “2,” which simply represents a traversal to the nucleotide having offset “2” in the backbone branch 102 from the source vertex 130.

More complicated coordinates can include multiple (offset, branch) pairs, indicating the traversal across multiple branches to reach a desired position. For example, a “C” nucleotide 114 in the third branch 108 can be reached by proceeding, from the source vertex 130 on the backbone branch 102, to the nucleotide having offset 3 (3), following a second outgoing edge 122 (2) before that offset to the second branch 106, proceeding to offset 3 (3) on the second branch 106, taking a second outgoing edge 124 (2) just before that offset to the third branch 108, and then traversing to offset two (2) on the third branch 108. As a result, the location of the “C” nucleotide 114 in the reference graph 100 can be represented by the coordinate 3.2.3.2.2.

In use, an initial offset may be much larger than the prior examples due to the large size of a reference genome. For example, a SNP on chromosome 1 at position 203,422,877 could be identified by the coordinates: “203422877.2”, representing a traversal at that position along a second outgoing edge (2) to an alternate branch representing the SNP.

In some examples, an offset may be larger than the number of nucleotides on a current branch. In these cases, the offset is considered to “fall off” the branch (e.g., along an outgoing edge 120 from the first branch 104, an outgoing edge 126 from the second branch 106, or an outgoing edge 128 from the third branch 108) and continue along the connected edges until the desired relative offset is reached. For example, the second “C” nucleotide 116 in the backbone branch 102 (at offset 5) can also be reached by coordinates “3.2.6”, representing a traversal to the vertex having offset 3 on the backbone branch 102, following the second outgoing edge 122 to the second branch 106 before that vertex, and then traversing an additional six offsets. However, after offset 4, the only available outgoing edge is edge 126, which “falls off,” or returns to the backbone branch 102. An additional two offsets are then traversed to arrive at the “C” nucleotide (having offset 5 on the backbone branch 102).

The coordinates essentially describe a path that may be taken to identify a particular position in the reference graph 100 from the source vertex 130. As graphs are multi-dimensional structures, a single position may be represented (or reached) by multiple different coordinates. Each unique coordinate describes a unique path to the selected position within the reference graph 100. For example, the “T” nucleotide 112 in the backbone branch 102 may also be represented by, the coordinate “1.2.1.” The coordinate 1.2.1 represents a path that: traverses to the nucleotide having offset 1 (i.e., the first “C” nucleotide in the backbone branch 102), follows the second outgoing edge 120 before that offset to the first branch 104, and traverses an additional two offsets to the “T” nucleotide 112. Similar to the previous example, the offset specified by the coordinate component is relative because the first branch 104 ends and returns to the backbone branch 102. Similarly, the “C” nucleotide 114 in the third branch 108 may be represented by the coordinates “3.2.3.2.2” and also “1.2.2.2.3.2.2”.

When a single coordinate is used to represent a single point or offset within the graph, they essentially describe a path that may be taken, starting with the leftmost or “source” vertex 130, to identify a particular position in the reference graph 100. However, pairs of coordinates may be specified indicating a start position and an end position. The trajectory between the start position and end position defines a path that may be taken through the reference graph 100. By specifying a pair of coordinates, a path may be specified that starts from a coordinate other than the source vertex 130. FIG. 11 illustrates three exemplary paths 200, 220, 240 between pairs of coordinates in the reference graph 100 in which the starting positions are specified by a starting coordinate, such that the path to identify the desired position can begin from vertices other than the source vertex 130. This can be particularly useful when handling sequence information from next-generation sequencing machines, such as for representing the alignment of a plurality of sequence reads to a reference graph.

Briefly, next-generation sequencing reads may be aligned against a graph genome (such as the reference graph 100) using a graph-aware alignment algorithm or another algorithm that has been modified to align to a graph. See, e.g., U.S. Pub. 2015/0057946 and U.S. Pub. 2015/0056613, both incorporated by reference. The start position of each sequence read may be located at any position within the graph genome. Further, alignments that describe variation (such as SNPs, short insertions or deletions, or larger structural variations) may have paths that span multiple branches. Paths and alignments can be represented in the reference graph 100 by defining a trajectory or path between a first coordinate and a second coordinate. The first coordinate can represent the position at which the 5′ end of the sequence read aligns with the reference sequence, and the second coordinate can represent the position at which the 3′ end aligns. Further, the path defined by the second coordinate can be relative to the first coordinate, and describe the path taken from the first coordinate over which the sequence read aligns.

For example, to describe the path 200 representing the position of the nucleotide sequence “AAT” (as generally indicated by edges 202, 120, 204), the path 200 begins at the source vertex 130 (e.g., offset 0), proceeds to the “C” nucleotide at offset 1, follows the second outgoing edge 120 before that offset, and then traverses two offsets to the “T” nucleotide 112. This path can be represented by the starting coordinate “0” and the ending coordinate “1.2.1”. The coordinates may be combined by a delimiter (e.g., “..”) to define a path “0..1.21.”

In another example, to describe the path 220 representing the position of the nucleotide sequence “TGACTGA” (as generally indicated by edges 204, 122, 222, 224, 226, 228), the path can be specified by the pair of coordinates “2..3.2.6.” In this case, the path begins at offset two on the backbone branch 102 at the “T” nucleotide 112 (2), and then proceeds (“..”) to offset three (3), follows the second outgoing edge 122 to the third branch 106 before the offset (2), and then proceeds seven offsets to the “C” nucleotide 116 (6). Note that proceeding seven offsets returns the path to the backbone branch 102.

When defining paths, the second coordinate may be specified relative to the branch reached by first coordinate. For example, the path 240 representing the position of the nucleotide sequence “GACTAC” begins at coordinates “3.2,” which begins the path at offset 0 in the second branch 106 (“G”). The second coordinate is “3.2.2”, meaning proceed to the third offset (3) on the current branch (i.e., the second branch 106), follow an outgoing directed edge (i.e., the edge 124) to an alternate branch (i.e., the third branch 108) before the third offset, and then proceed to offset two (2) on the alternate branch.

Specifying the second coordinate relative to the first coordinate has an advantage in that fewer bytes are required for storing the coordinates. However, in certain embodiments, both the first and second coordinates may be specified starting from a source vertex.

As noted above, the second coordinate defines a particular path to be taken from the first coordinate, which can be used to describe the path associated with a sequence read aligned to the reference graph 100. Therefore, the second coordinate may only be described by a single unique sequence of integers, and not by multiple different choices, as these would define different paths. However, the first coordinate may be specified in any manner. For example, the path 240 “GACTAC” may be specified by either “3.2..3.2.2” or “1.2.2.2..3.2.2.” In some cases, the coordinate requiring fewer or smaller integers is selected so as to maximize efficiency.

Reverse coordinates and paths (i.e., coordinates and paths that follow incoming as opposed to outgoing edges) may also be specified. As the second coordinate is relative to the branch of the first coordinate, when on the base branch, a reverse path may simply be specified by switching the first and second coordinates (e.g., “7..5”). In cases where the reverse path takes a divergent path along a vertex, negative numbers may be used. With coordinates specifying forward paths (as detailed above with reference to FIGS. 10 and 11), the first integer refers to an edge having a particular position offset, and the second integer refers to an outgoing edge on the vertex immediately before the offset (i.e., the vertex in the opposite direction of the directed edge having that offset). Conversely, for reverse paths, a negative number refers to the incoming (as opposed to outgoing) edge pointing toward the vertex immediately after the offset (i.e., the vertex in the direction of the directed edge having that offset). In other words, if a reference graph is specified such that the directionality of the vertices is generally from left to right, the vertex used for identifying edges to alternate branches are different: forward paths are specified relative to the left vertex, and reverse paths relative to the right vertex.

FIG. 12 provides several examples of reverse paths in the reference graph 100. As shown in FIG. 12, a first reverse path 260 begins at starting coordinate “7” and ends at ending coordinate “5” against the directionality of edge 288 (“7..5”), and describes the nucleotide sequence “GTC”.

A second reverse path 280 is specified by the pair of coordinates “2..1.−2.−1”. This describes a path having a start position at coordinate “2” (i.e., the “T” nucleotide 112 on the backbone branch 102) and proceeds “..” to offset 1 on the same branch. The second path 280 identifies the vertex connected in the direction of the edge at offset 1, and then follows the second incoming edge to an alternate branch (“−2”) (“A”). The second path 280 then traverses one additional offset against the directionality of the edge to arrive at offset 0 in the backbone branch 102 (“A”). Taken together, the pair of coordinates “2..1.−1.−1” thus represents the position of the nucleotide sequence “TAA”.

A third reverse path 300 is specified by the pair of coordinates “3.2.3.2.2..−2”. This describes a path having a start position at coordinate “3.2.3.2.2” (i.e., the “C” nucleotide in the third branch 108). The third reverse path 300 then proceeds (“..”) to offset −2, which is equivalent to offset 1 on the second branch 106. This represents the position of the nucleotide sequence “CATCA”.

Reverse paths may be used, for example, to indicate alignments that are the reverse complement of a forward path. In these cases, the 3′ end of a sequence read may have a mapped position before the mapped position of the 5′ end of the read.

Encoding the Coordinate

Conventional variable length encoding maps source symbols to a variable number of bits. For example, a conventional variable length code would store coordinate components using only those bits required for each component. However, this can be troublesome because a decoder will then require prior knowledge of the number of bits required for each component, which introduces additional overhead costs. Moreover, conventional variable length coding will result in a non-byte aligned format. However, most operating systems expect binary data to be represented in bytes, i.e., as 8-bit octets. Therefore, deviating from this norm can lead to problems and confusion during the development of software that uses a graph coordinate system.

FIG. 13 diagrams an encoding process, such as the encoding process 906 of the method 901 of FIG. 9. A computer system scans the assigned coordinates (as described above) and determines 1302 the number of bits sufficient to represent the coordinate. A computer system then encodes 1304 the coordinate using the determined number of bits and generates 1306 a header value that indicates the total number of octets representing the encoded coordinate. The header value may be embedded together with the encoded coordinate. Optionally, auxiliary information (flags) can also be encoded.

In an embodiment, the data is divided into two parts: header values and a data representation. The header provides information about the length of the data representation and allows the data representation to be interpreted accurately. A computer system may be programmed with instructions that when executed by a processor cause the system to interpret the header and use that information to decode the data. In certain embodiments, the header is encoded according to a scheme that depicts a length of what is encoded, certain useful flags, and the encoded representation, all using 8-bit octets (where the encoded representation here will be a length of the data section, such as the number of octets following the header octet). Thus, methods and systems of the invention implement a length-flag-representation 8-bit scheme.

FIG. 14 shows a representation of encoded information 301 using a variable length encoding scheme (e.g., a length-flag-representation-8 bit octet (LFR8) encoding scheme). The proposed scheme can be used to serialize information about coordinates, paths, and positions in a graph. In this example, the data is divided into a series of octets. A first octet 302 defines a length 308 of the encoded information 301. For example, the length 308 indicates the number of octets (or bytes) following the first octet 302. In some cases, the first octet 302 is also referred to as a header octet. The first octet can optionally also include additional bits of information (e.g., auxiliary binary information), which can function as flags 310. In some cases, these flags can indicate the context of the encoded coordinate or coordinate component. For example, a flag can be used to indicate whether the sign of a coordinate component is negative or positive, or to indicate whether a coordinate component is the first or last component in the coordinate. If the first octet 302 has available space (e.g., as shown in FIG. 14) not occupied by length bits or flag bits, the first octet 302 can also include a portion of the bits required for value representation 312, i.e., data bits representing the binary form of the encoded integer. By using any available space within the first octet 302, the use of available space is maximized. Similarly, if the available space in the first octet 302 is insufficient to represent all of the data bits, then the first octet can be followed by, additional octets 303. For example, the encoded information 301 can also include a second octet 304 to an Nth octet 306 to represent all of the data bits in the value representation 312.

FIG. 15 shows an encoding of a coordinate value 402 (as shown, the integer “4,954”), in a length-representation 8-bit encoding scheme where 3 bits in a header octet represent a length value. For example, when represented in binary form, the coordinate “4,954” on a reference genome becomes 1001101011010. The binary form requires 13 bits, and thus requires, at least, two octets, of which three bits are unused in one octet. LR8<3> refers to LR8 format using three (3) length bits in the header octet. The remainder of the header octet is then available for storing the representation of the value, i.e., the encoded coordinate component. As shown in this example, the encoded representation in LR8<3> format comprises two octets: a header octet 404, and a value representation octet 406. The three length bits 308 in the header octet indicate that there is one additional octet in the representation (“001”). Based on the size of the encoded coordinate, the value representation parameter of the encoding scheme can be selected to accommodate the number. A user or a computer could identify the range of numbers expected to be encountered in a particular application, and then select an appropriate value representation parameter that is large enough to accommodate that number. For example, the number of length bits used by the format can vary, e.g., LR8<4> format, LR8<5> format, and so on.

FIG. 16 shows an encoding of the coordinate value 402 (“4,954”), in a length-flag-representation 8 bit (LFR8) encoding scheme where three length bits 308 in a header octet represent a header value and three flag bits 310 in the header represent auxiliary binary information (flags). (This leaves an additional 2 bits available within the header octet, which may be used by value representation bits 312). In some instances, it may be desirable to include additional binary, information within the encoded representation of the coordinate component, e.g., as a set of flags. Flags can be included in the header octet using LFR8 format after the length bits. Sometimes, there is a need to store additional flags with a position. The encoding scheme parameters can be adjusted to accommodate these bits. In these cases, the flag information can be encoded as flag bits 310 (e.g., “101”). As shown in FIG. 16, the encoded representation comprises three octets: a header octet 408, and value representation octets 410, 412. Three flag bits 310 occupy additional bits in the header, leaving insufficient bits available to store the coordinate value 402 (“4,954”) in only one additional octet. Accordingly, two (“010”) additional octets are required for the LFR8 representation.

As shown in FIGS. 15-16, the number of length bits and flags may be provided as a parameter, such as by LR8<X> or LFR8<X,Y>, wherein X represents the number of length bits and Y represents the number of flag bits. The actual value to be encoded and flags may also be provided as secondary parameters, such as by LR8<X>(Value) or LFR8<X,Y>(Value, Flags). Accordingly, the LFR8 and LR8 encoding schemes can be written as a function that encodes and outputs an integer value and a set of flags as a variable number of bits. In this way, the disclosed encoding schemes describe a highly efficient, yet lossless, means for compressing data related to graph coordinates, paths, and positions.

Once encoded, a decoding agent or algorithm would receive the first octet of a coordinate component and then determine, based on information within the first octet, the total size of the representation.

Modifying the Encoding Schemes

In the encoding schemes discussed above (e.g., LFR8 and LR8 formats), the total number of length bits and flag bits does not exceed eight so that the header information remains within a first octet. If some bits are unused, these bits can be allocated to any position within the encoded representation. In the examples described above, the value representation is shifted to the rightmost position of the octets such that the last bit in the last octet is the first bit of the encoded integer. However, these value representations can be shifted in a different position or permuted in a different manner.

Any unused bits (i.e., to the left of the value and after the length and flag bits) may be padded as zeros. However, the LFR8 and LR8 formats could be modified to use additional header octets to store additional flags as needed. Similarly, flags could also be stored in the additional representation octets (e.g., LFR8<3,6> would require at least one flag bit in an additional octet).

As previously described, an LFR8 format is compatible if the particular application requires storage of additional flags. In these cases, encoding an integer may require an additional octet because there will be additional flag bits in the header. For example, encoding the integer “4,954” requires only two or three octets, depending on whether LR8 or LFR8 format is used. For example, a position 201,348,366 on human chromosome 1, represented in binary form, becomes: 1100000000000101010100001110. The binary form of this integer requires at least 28 bits, and thus requires at least four octets in order to be byte-aligned (32 bits), of which half of one octet is unused. When stored using LFR8 format, the header octet that includes the value representation begins with “011”, indicating that three additional octets follow the header octet.

The encoded representation, including the length (L), padded bits (P), and binary value (V) thus becomes:

LLLPVVVV VVVVVVVV VVVVVVVV VVVVVVVV 01101100 00000000 01010101 00001110

If flags are desired, the encoding scheme can also accommodate flags (F), for example, following the length bits. In this case, if we would like to include four flags in the header octet, we would require five octets, as the binary value of the integer is then shifted to the right, as shown in the example below:

LLLFFFFP PPPPVVVV VVVVVVVV VVVVVVVV VVVVVVVV 10010110 00001100 00000000 01010101 00001110

In the exemplary coordinate system described above, the first coordinate component (here, an integer) in a coordinate will typically be much larger than subsequent coordinate components. This is because the subsequent components typically describe relative offsets or branch numbers. (The use of relative offsets or branch numbers helps to reduce the size of the encoded representation.) Thus, the first component in a coordinate may be quite large and require a large number of bits, but subsequent components will be small and require a minimal number of bits. For example, if a coordinate or path describes traversing to offset 201,348,366 on the main branch, taking the second branch from that offset, and then proceeding to offset 154, the coordinate component “154” could be encoded as follows:

LLLPPPPP VVVVVVVV 00100000 10011010

The number “2” (for the second branch) is small enough to fit entirely within the leftover value representation bits of a header octet, and above it has been shown that the number 201,348,366 can be encoded using four octets. Thus, using an LR8 encoding scheme, the entire coordinate (201,348,366.154) is represented using only 7 octets.

Padding (e.g., “0” bits) can be used when there are unused bits in the representation. Of course, these bits can be used to store additional binary information (e.g., up to 5 binary symbols or flags if three representation bits are used). For example, if we need to store 3 flags, we can convert the LR8<3> schema to LFR8<3,3>, e.g.:

LLLFFFPP VVVVVVVV 00101100 10011010

This particular encoding scheme (i.e., one in which three length bits are present in the header) allows for up to seven additional octets to be used. However, if the header octet is entirely occupied by length bits and flag bits (that is, there are no bits left over for value representation), then, in certain embodiments, the encoding scheme imposes a special rule: add at least one extra octet, and the numbers encoded in the length bits hold the number of additional value representation octets to append above and beyond this. So, for example, in an LFR8<3,5> encoding scheme, the representation would include one header octet, and the number of additional value representation octets would range from one to eight, for a total code size (including the header) in the range of two to nine bytes. Accordingly, eight additional representation octets are available in addition to the header octet, allowing for the representation of a value having 2⁶⁴positions—a value far in excess of the number of nucleotides in any presently known genome.

One reason to incorporate this special rule is so that when we encode the number “0”, it does not end up being represented as a “null” value (i.e., just a header octet, with no assigned value representation bits). The reason this is useful is because if the value is changed later on, e.g., from “0” to “1”, then the representation does not need to be resized in order to get the number “1” to fit. For example, imagine several million LFR8/LR8 representations, all assigned to a single contiguous block of memory. If one representation is resized by adding an extra byte, then every representation to the right would presumably have to be moved over by one byte in order to accommodate the change, which can be extremely inefficient.

When encoding, an encoder first determines, based on a coordinate component or value to be encoded, the number of octets required for representation in a particular LFR8 or LR8 format. Exemplary pseudocode is provided below which gives one example.

# Invalid input: negative values should be represented # by setting a flag bit. if value < 0 code_size_in_bytes = −1 else n_repbits_in_header = 8 − (n_lenbits + n_flagbits) # Hypothetically, if we used the value representation # bits in the header to store the least significant # binary digits of the input value, this would be the # number encoded by the remaining binary digits that we # would have to fit inside the variable length value # representation bytes. value_rightshifted = value / (2 {circumflex over ( )} n_repbits_in_header) if value_rightshifted == 0 # Edge case: the only way both “if” clauses are true # is if value == 0, but you should newer represent # the number zero as a null, always reserve some # value representation bits in case we want to # increment later, then we don't need to resize # immediately. if n_repbits_in_header == 0 n_repbytes = 1 # No representation bytes needed, non-zero value # fits entirely in header else n_repbytes = 0 else n_repbytes = ceil(log256(value_rightshifted)) # Code size includes 1 byte for the header, plus any value # representation bytes, if required. code_size_in_bytes = 1 + n_repbytes

As shown in the above pseudocode, first the encoder receives a value (such as an integer or coordinate component) to encode. (In this example, if the value is negative, the encoder outputs an error.) The encoder then determines the number of available representation bits in the header by subtracting the number of length bits and the number of flag bits from the size of the header (i.e., 8 bits). For example, in an LFR8<3,2> representation, the number of available representation bits would be 8−3−2=3 bits. The encoder then calculates a “right shifted” value, which represents the number that would need to be encoded by the remaining binary digits once the least significant binary digits are stored in the (e.g., 3 bits of) the header. If the “right shifted” value is not zero, the encoder determines the number of representation octets required by calculating the logarithm of the right shifted value with base 256, and then identifying the ceiling the smallest integer value greater than or equal to this value) of the result to yield the number of additional octets required for the representation. The resulting code size is thus 1 plus this number (i.e., the header octet plus the additional octets required for the representation). However, if the “right shifted” value is zero, the encoder considers whether there are any available representation bits in the header. If not (meaning that the value itself is zero), then the encoder provides for one additional representation octet. (This is to ensure that if the value itself later changes, additional octets do not need to be added; see the description of the “special rule” above.) However, if it is the case that there are available representation bits in the header, the entire representation is fit into a single octet.

FIG. 17 is a table showing how the proposed encoding scheme can be modified to accommodate both a variety of length bits and flags. The number of bits and flags may be modified as required to suit a particular application. In practice, three length bits are typically all that is needed, as this allows for storing positions far in excess of those needed for any known reference genome. Additionally, there may be many instances in which only a single octet is required for the total representation; for example, using LR8<1> format, 7 data bits would be available, which could be used to encode integers less than 27 (128) in a single octet.

Note in the table of FIG. 17 the special rule previously described: if the number of value representation length bits and number of flags occupies all 8 bits of the header octet, then the value of the length bits refers to the number of additional octets in the encoded representation, giving us a total code length of between 2 and 9 bytes. In contrast, if the number of value representation bits occupies less than 8 bits of the header octet, then the value of the length bits refers to the total number of octets in the encoded representation, including the header, giving us a total code length of between 1 and 8 bytes. (However, in other embodiments, the special rule may not be applied.)

Encoding Graph Coordinates and Paths

Graph coordinates and paths as described above may be efficiently encoded and stored using combinations of LFR8 and LR8 encoding schemes. As previously noted, graph paths and coordinates according to the disclosure comprise a set of integers describing a path (whether relative or absolute) to arrive at a particular position within the graph. For example, a path having coordinate components X₁, X₂, . . . X_kcould be represented as a sequence of LFR8 encoded integers as shown below in Equation (1).

PathRep(X₁,X₂, . . . X_k)=LFR83,2(|X₁|,[ε₁,γ₁]),LFR83,2(|X₂|,[ε₂,γ₂]), . . . LFR83,2(|X_k|,[ε_k,γ_k]) (1)

In Equation (1), ε is a single flag bit representing whether the encoded integer is the last integer in the coordinate, and, γ is a single flag bit representing the sign (whether positive or negative) of the stored integer. If only forward paths (i.e., no negative integers) need to be considered, then the path representation can be simplified as shown in Equation 2.

ForwardPathRep(X₁,X₂, . . . X_k)=LFR83,1(X₁,ε₁),LFR83,1(X₂,ε₂), . . . LFR83,1(X_k,ε_k) (2)

In contrast to Equation (1), Equation (2) requires only a single flag bit E, which indicates whether the stored integer is the last component in the coordinate, and represents the end of the path followed to reach that coordinate. (Delineating the end of the path in this way allows many paths to be stored together in a single contiguous block of memory without ambiguity as to where one path ends and the next one begins.) Accordingly, a decoder reading a sequence of LFR8 encoded integers would understand that a particular integer is the last integer in a coordinate by reading the value for the flag bit E.

As an example, the coordinates 1056.2.5 (representing a traversal to offset 1056 along the base branch, moving to a new branch having identifier “2”, and traversing to offset 5) could be encoded as follows using the forward path representation in Equation (2):

ForwardPathRep(1056.2.5)=LFR83,1(1056,0),LFR83,1(2,0),LFR83,1(5,1).

This would encode the coordinates 1056.2.5 as:

LLLFVVVV VVVVVVVV LLLFPPVV LLLFPVVV 00101100 00100000 00000010 00010101

A decoder configured to decode format, when encountering the first octet above, would read the first three length bits of the first octet and understand there to be one additional representation octet. The entire representation may then be decoded to yield the component “1056”. The decoder would then read the next header octet, which indicates no additional representation octets, and is decoded to yield the component “2”. The next header octet indicates that the encoded integer is the last in the series of integers for the coordinate, and can be decoded to “5”. Thus, the decoded representation yields the coordinates “1056.2.5”. In this way, the decoder does not require any prior knowledge of the size of the representation of each component, but determines the size by analyzing the header octet. Further, additional octets are only used if required, leading to efficient encoding of the component.

Efficient Encoding of Edge Insertions

Systems and methods herein may be used to provide a specification that describes an edge as an insertion to a genomic graph. The specification makes use of coordinates that define the insertion and that specification is compactly encoded using the described LFR8/LR8 variable length encoding, allowing for fast storage, querying, and instantiation times. Since an edge is described as an insertion to a graph using a specification that is compactly encoded, the format of the string of bits describing an edge may be referred to as a compressed insert specification, or CIS format.

Before the variable length encoding, an edge is described as an insertion to a graph using a specification includes at least a position on the graph and information that defines the variation in genetic sequence represented by the edge. In preferred embodiments, an edge can be represented by a tuple that includes the 5′ starting position, 3′ ending position, and the length of the associated genetic sequence:

X₁,X₂, . . . X_m,Y₁,Y₂, . . . Y_n,L),

Wherein each X_iand Y_iare the coordinate components of the 5′ and 3′ coordinates of the inserted edge, respectively. In the case of a SNP, L can be 1, for storing the single nucleotide associated with the SNP. As will be described in detail below, each edge can be efficiently and compactly encoded using LFR8/LR8 encoding, allowing for fast storage, querying, and instantiation times.

Single Coordinate Insertions

In certain cases, the end coordinate of an inserted edge can be inferred from the start coordinate and the type of variation represented by the edge. In these cases, the end coordinate can be omitted from the representation, leading to a more efficient representation of the edge. In particular, edges that affect only a single position in the reference genome (one of the most commonly observed types of genomic variation between of individuals within a population) allow for the 3′ end to be omitted from the edge representation. For these kinds of variations, the edge can be represented in CIS format in the following manner:

InsertionSpecificationRep(X₁,X₂, . . . L)=LFR83,5(X₁,[ε₁,λ],ForwardPathRep(X₂, . . . X_m),LR83(L) (3)

wherein X represents each coordinate component in the stored coordinate, L represents the length of the associated nucleotide sequence, ε₁is a flag that indicates whether the position is the last integer in the coordinate (i.e., it represents the end of the path followed to identify the position in the graph), and X is a vector of flag bits that uniquely describes the kind of insertion. In one embodiment of CIS format, X has four bits, of which one of the bits indicates whether there will be an ending coordinate specified (as will be described in more detail below with respect to FIG. 18, we arbitrarily select the second bit, counting upward from the least significant bit on the right, and assign “0” to mean that an ending coordinate is not specified), while the other three bits uniquely describe the type of variation. (Of course, the particular positions of the bits shown in FIG. 18 can be permuted while still keeping with the spirit of the disclosure.)

FIG. 18 depicts a table showing five different kinds of sequence variations which affect only a single position (e.g., at a particular nucleotide) in the reference genome and can be represented in this manner.

As shown in FIG. 18, a single nucleotide deletion is identified using the vector “0000”. In this case, the 5′ end of the inserted edge will connect with the vertex prior to the specified starting coordinate, and the 3′ end will connect with the vertex after the specified starting coordinate. (This is because the coordinates describe the location of sequence data within the edges of the graph, whereas the ends of the inserted edge must connect at a particular vertex.) An edge representing a single nucleotide polymorphism (0001) or a multiple nucleotide replacement (1011) will reconnect with the graph in a similar manner. However, in the case of either a single (0010) or multiple (0011) nucleotide insertion, both ends of the inserted edge connect with the vertex prior to the specified coordinate. A decoder would simply need to read the vector to determine where to connect the inserted edge in the graph based on the sole coordinate provided.

In this way, only a single coordinate needs to be associated with the edge, and information regarding whether the edge reconnects with the vertex after or prior to the specified edge can be encoded in the vector λ as a set of flag bits (in contrast to storing an entire additional coordinate). As the vast majority of catalogued variations in the human reference genome are small SNPs and insertions and deletions, storing only a single coordinate helps to minimize the amount of storage space required for each insertion.

Two Coordinate Insertions

Other variations may be more complex, and have edges with ending coordinates that differ from their starting coordinates. In these cases, both the starting and ending coordinates must be specified in the representation. However, often edges will have starting (5′) and ending (3′) coordinates that share several coordinate components. For example, a structural variant representing a CT→GAC mutation (e.g., as shown in the table in FIG. 19, described in further detail below) may be represented by a new edge that connects at a vertex before the coordinate 10.2.5, has one edge representing the alternate nucleotides (GAC), and reconnects with the graph at the vertex before coordinate 10.2.7. Thus, the information that needs to be efficiently encoded is the common path prefix (“10.2”), the 5′ edge suffix (“5”), the 3′ edge suffix (“7”), and the length of the associated nucleotide sequence:

X₁,X₂, . . . X_k,X_k+1, . . . X_m,Y₁,Y₂, . . . Y_k,Y_k+1, . . . Y_n,L),

where X₁=Y₁, . . . X_k=Y_kis a common path prefix (“10.2”), X_k+1, . . . X_mis a unique path suffix for the 5′ end (“5”), Y_k+1, . . . Y_nis a unique path suffix for the 3′ end (“7”), and L represents the number of symbols, such as nucleotides, that are associated with the edge. In this case, L is 1, for storing the nucleotide G. In CIS format, this representation can be specified as follows:

(4) InsertionSpecificationRep(X₁. ... X_k.X_k+1. ... X_m, Y₁. ... Y_k. T_k+1. ... Y_n, L) = CommonPathPrefixRep(X₁. ... X_k), End5′PathSuffixRep(X_k+1. ... X_m), End3′RelativePathSuffixRep(X_k+1. Y_k+1. ... Y_n), LR8(3)(L).

As shown above, the first component of the insertion specification representation is the common path prefix representation, followed by the representation of the unique suffixes for the start (5′) and end (3′) coordinates, and finally the representation of the number of symbols or nucleotides in the insertion. The common path between the start and end coordinates can be represented by:

CommonPathPrefixRep((X₁, . . . ,X_k)=LFR83,5(X₁,[ε₁,ϑ₁,α,δ]),LFR83,2(X₂,[ε₂,ϑ₂]), . . . LF83,2(X_k,[ε_k,ϑ_k]) (5)

wherein ε is a flag that represents whether the position is the last integer in the coordinate (“end of path”), and the remaining four flag bits [ϑ₁, α, δ] are assigned in such a way that their combination will never result in one of the five values already reserved in the single coordinate insertions shown in the table of FIG. 18. Within the remaining four bits, ϑ is a flag that represents whether the integer is the first-different integer between the start and end coordinates, δ is a flag that indicates whether the first-different integer for the 5′ end is less than or greater than the 3′ end, and α is a vector (here, a pair) of flag bits defining the type of variation. As shown in the two coordinate table of FIG. 19, the first bit of α is always “1”, to indicate, in contrast to the single coordinate variant types in FIG. 18, that a two coordinate description will follow.

FIG. 19 is a table illustrating two kinds of sequence variations that affect multiple positions in the genome, including a vector α and a corresponding graph depicting the variation. Here, the two variation types are a multiple nucleotide deletion and a multiple nucleotide variation. It should be noted that the first LFR8 representation in Equations (3) and (5) are very similar. The third bit (i.e., the second bit of λ, or first bit of α) can be used to inform a decoder as to which representation is being used. Described in another way, sequence variations can be compactly encoded using an LFR8<3,5> header in which the encoding comprises a 5-bit vector [ε, α, λ]. ε is a one-bit flag that represents whether we are at the end of a path. α is a one-bit flag that represents whether this is a one-coordinate (“0”) or two-coordinate (“1”) specification. λ is a 3-bit flag that follows the following rule. If α is set to “0”, then λ is allowed to take any values that uniquely indicate a variation type (for example, the 1^st, 3^rd, and 4^thbits and corresponding variation types as shown in the table of FIG. 18). However, if α is set to “1”, then λ is further subdivided into 3 one-bit components: ϑ, δ, and φ. Two of those bits (ϑ and δ) will be defined in the same way as described above, whereas the third bit (φ) will specify whether we intend to encode either the first (“0”) or second (“1”) variation types, such as the multiple nucleotide deletion and multiple nucleotide variation shown in the table of FIG. 19.

The second component of the insertion specification representation is the unique suffix representation of the starting coordinate. Occasionally this component may not be necessary. For example, an insertion starting at position “10.2” and ending at position “10.2.5” will have no unique suffix for the start coordinate. If the start coordinate does have a unique suffix, this can simply be represented by the forward path representation in Equation (2) above:

End5′PathSuffixRep(λ_k+1, . . . X_m)=ForwardPathRep(X_k+1, . . . X_m) (6)

The third component is the suffix of the ending (i.e., 3′) coordinate. To further minimize the storage of integers, the 3′ coordinate may be encoded differently, i.e., by having a suffix that is relative to the 5′ coordinate. In other words, the coordinate component of the 3′ coordinate that is aligned with the 5′ coordinate is subtracted from the coordinate component of the 5′ coordinate. This representation can be:

End3′RelativePathSuffixRep(X_k+1,Y_k+1, . . . Y_n)=LFR83,1(|Y_k+1−X_k+1|,ε_k+1),LFR83,1(Y_k+2,ε_k+2), . . . LFR83,1(Y_n,ε_n) (7)

Finally, the number of nucleotides in the new branch can be encoded by LR83(L). It should be noted that in the case of variations representing deletions or single bases, the LR83(L) parameter may be omitted because it can be inferred from the encoded variation type. For example, a decoder reading the representation could be configured to understand that the length parameter is omitted due to the associated vector; e.g., for the first 3 cases shown in the table of FIG. 18, the LR83(L) parameter can be omitted because for those three variant types, by definition, L=1.)

It should also be noted that while FIG. 19 shows edges associated with nucleotide symbols, any kind of symbols may be used. For example, edges could be associated with protein information, such as a sequence of amino acids. Edges may also lack symbols or associations (such as for the single nucleotide deletion example above). Further, while the present disclosure describes edge-based graphs, vertex-based graphs could also be substituted and similarly encoded.

Random Access of CIS Graph Genome Elements

Once a graph has been linearized, it is often advantageous to load or deserialize only, portions or subsets of the graph into memory for use, e.g., by loading only those CIS representations that are needed. In practice this presents some issues. A CIS-encoded graph is typically presented as a serialized stream. CIS efficiently encodes each insertion into a minimal number of bits, but this results in a variable length format, because each insertion will be represented by a variable number of bits. Thus, a decoder is not able to predict the location of a particular offset in an encoded stream simply based on its magnitude. To identify particular insertion representations for deserialization, a decoder must start at the first insertion in the encoded stream and continue until the desired insert is identified. Thus, if a graph has N insertions, it requires on average O(N/2) access time to locate the desired encoded insert for instantiation. When applied to genomic data scales, this access time may be too slow. In these cases, randomly accessing serialized portions of the graph for deserialization can be aided by use of a random-access entry point map.

FIG. 20 illustrates a random-access entry point map. A random-access entry point map is essentially an indexing structure that indicates, for K intervals, the location in memory of particular offsets of encoded CIS insertion representations for a serialized graph. To identify the location of a particular insertion, a decoder consults the random access entry point map for the nearest located insertion. The decoder then continues from that insertion to identify the desired insertion. Thus, use of a random-access entry point map decreases the average time complexity of random access to O(N/K). As the random-access entry point map for each CIS stream is unique, each encoded CIS stream representing a particular graph may be accompanied with its own random-access entry point map.

Typically, the value chosen for K will be much smaller than N, as otherwise the random access entry point map itself may begin to take up a significant amount of memory. K can be adjusted to a variety of settings. In practice, suitable values for K are those in which O(N/K) is between 100 and 1,000. This presents a good compromise in random-access entry point map size and CIS lookup speed. For example, if an encoded CIS stream has 10⁶insertions, a random-access entry point map having 1000 entries will result in a distance of 1000 insertions between each entry. A decoder thus would only need to traverse, on average, about 500 entries before locating a desired insertion.

While this disclosure refers to a random-access entry point map, various other data structures could be used to more efficiently locate a particular (IS insert in an encoded stream. For example, a tree-based structure having multiple levels of random access could be used. The tree could be divided into several levels in which each branch is traversed until the desired level indicating the location in memory of a particular CIS insert is reached.

One application of the random-access entry point map is to generate specialized graphs from a larger graph specification. For example, one may wish to build a graph that includes only variations from certain databases, such as dbSNP. Alternately, one may wish to build a graph appropriate for particular genetic populations. In these cases, one could leverage the random-access entry point map to quickly identify and deserialize only those desired insertions into the graph. In such examples, the backbone branch would be loaded first, followed by the selected insertions. Note also that the edges should be inserted in an order in which they do not depend on other edges, which may not have been inserted. Care should be taken during graph construction in this regard; for example, preferably dbSNP insertions should not depend on (i.e., branch from) insertions from other variation databases, and similarly insertions representative of one genetic population should not depend on those from another. However, various embodiments are considered to be within the scope of the disclosure.

It should also be noted that the random-access entry point map discloses the location in memory of CIS insertions, which may not correspond to actual genome position. For example, the map may be able to inform a decoder as to the location of the 100,000^thinsertion, but is not necessarily useful in finding insertions located near a particular position (e.g., at the middle of chromosome 17). In these cases, a random-access entry point map may also contain additional information specifying the relative genome position of certain insertions. Alternately, a second table or map storing this information may be used.

Integration with Existing Representations of Variant Data

As noted above, a graph genome can be constructed from a reference sequence by, continuously adding new edges representing variants. Each edge can be defined in Compressed Insertion Specification (CIS) format. Converting existing variant information to CIS format can be relatively straightforward. For example, Variant Call Format (VCF) is an existing file format that describes variations from a reference sequence as an alternate base at a particular position on a reference genome. A VCF->CIS conversion program could iterate through each entry of a VCF file and create a new CIS entry for that variant describing an insertion on a graph. In this way, existing VCF data and data represented in other formats can be converted directly to CIS format.

Advantages of CIS

The proposed specification efficiently stores graphs as a plurality of efficiently encoded edges. In particular, the proposed specification minimizes the number of edges that need to be stored by constructing a graph by adding edges individually, thus segmenting previously inserted edges into multiple edges. Further, when combined with the LFR8/LR8 encoding scheme, the specification requires only a minimal amount of storage space, leading to greatly compressed structures. For example, a single SNP in the human genome (the most common type of variant) requires only 40 bits to store. Further, use of a random-access entry point map or other indexing structure allows for fast random data access by allowing for particular inserts or components of graphs to be quickly identified and loaded into memory when needed.

Representative System

FIG. 21 diagrams a system 401 that may perform methods disclosed herein. The system 401 includes at least one processor coupled to a tangible memory subsystem 475 containing instructions. The at least one processor may be provided by any one or more of a computer 433, a server 409, or a dedicated lab computer such as a sequence computer 451 operably coupled to a nucleic acid sequencing instrument 455. Any of the computer 433, the server 409, the sequence computer 451, or the sequencing instrument 455 may be operably coupled to one another via a communications network 415. The tangible memory subsystem 475 contains instructions that when executed by the processor cause the system to assign a coordinate that defines a path from a first position to a second position within the genomic graph and encode the coordinate using a variable length encoding scheme where the encoded coordinate includes information about the length of the data representation.

Processor refers to any device or system of devices that performs processing operations. A processor will generally include a chip, such as a single core or multi-core chip, to provide a central processing unit (CPU). A processor may be provided by a chip from Intel or AMD, A processor may be any suitable processor such as the microprocessor sold under the trademark XEON E7 by Intel (Santa Clara, Calif.) or the microprocessor sold under the trademark OPTERON 6200 by AMD (Sunnyvale, Calif.).

The memory subsystem 475 contains one or any combination of memory devices. A memory device is a mechanical device that stores data or instructions in a machine-readable format. Memory may include one or more sets of instructions (e.g., software) which, when executed by one or more of the processors of the disclosed computers can accomplish some or all of the methods or functions described herein. Preferably, each computer includes a non-transitory memory device such as a solid state drive, flash drive, disk drive, hard drive, subscriber identity module (SIM) card, secure digital card (SD card), micro SD card, or solid-state drive (SSD), optical and magnetic media, others, or a combination thereof.

Using the described components, the system 401 is operable to, cause the system to assign a coordinate that defines a path from a first position to a second position within the genomic graph, and encode the coordinate using a variable length encoding scheme where the encoded coordinate includes length information. The system may receive or obtain a genomic graph representing genomic sequences from one or more organism through the use of one or more input/output device, labelled “I/O” in FIG. 21. An input/output device is a mechanism or system for transferring data into or out of a computer. Exemplary input/output devices include a video display unit p., a liquid crystal display (LCD) or a cathode ray tube (CRT)), a printer, an alphanumeric input device (e.g., a keyboard), a cursor control device (e.g., a mouse), a disk drive unit, a speaker, a touchscreen, an accelerometer, a microphone, a cellular radio frequency antenna, and a network interface device, which can be, for example, a network interface card (NIC), Wi-Fi card, or cellular modem.

In certain embodiments, the reference graph 100 is stored in the memory subsystem 475 using adjacency lists or index-free adjacency, which may include pointers to identify a physical location in the memory subsystem 475 where each vertex is stored. In a preferred embodiment, the graph is stored in the memory subsystem 475 using adjacency lists. In some embodiments, there is an adjacency list for each vertex. For discussion of implementations see ‘Chapter 4, Graphs’ at pages 515-693 of Sedgewick and Wayne, 2011, Algorithms, 4th Ed., Pearson Education, Inc., Upper Saddle River N.J., 955 pages, the contents of which are incorporated by reference and within which pages 524-527 illustrate adjacency lists.

Sequencing Technologies

In use, sequence reads are aligned to edges or paths through a reference graph, and then genotyping or variant calling can be performed by counting the number of reads across each path or by identifying mismatches across each path. Because the sequence reads are aligned to the reference graph which includes known variation, the subsequent step of identifying mutations by, comparing to a table of known mutations can be eliminated. Further, alignment to reference graphs results in greatly improved results for larger structural variants, as well as indels and SNPs located near structural variants. Additionally, in some cases, there may be no need to perform an additional variant calling step because an ideal reference graph will not have any mismatches from the sequence read (of course, variant calling can still be performed to detect unknown variations).

Sequence reads may be obtained in a variety of ways. FIG. 22 illustrates obtaining 2201 a set of sequence reads 2204 from a sample 2202. As a preliminary step (not depicted), nucleic acid may be isolated and/or enriched.

In certain embodiments, sequence reads are obtained by performing sequencing on the sample 2202 from a subject (however in some embodiments, sequence reads are obtained when a read file is transferred into a system of the invention). Sequencing may be by any method and sequencing instrument known in the art. See, generally, Quail, et al., 2012, a tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences, and Illumina MiSeq sequencers, BMC Genomics 13:341. DNA sequencing techniques include classic dideoxy sequencing reactions (Sanger method) using labeled terminators or primers and gel separation in slab or capillary, sequencing by synthesis using reversibly terminated labeled nucleotides, pyrosequencing, 454 sequencing, Illumina/Solexa sequencing, allele specific hybridization to a library of labeled oligonucleotide probes, sequencing by synthesis using allele specific hybridization to a library of labeled clones that is followed by ligation, real time monitoring of the incorporation of labeled nucleotides during a polymerization step, polony sequencing, and SOLiD sequencing.

A sequencing technique that can be used includes, for example, use of sequencing-by-synthesis systems and sequencing instruments sold under the trademarks GS JUNIOR, GS FLX+ and 454 SEQUENCING by 454 Life Sciences, a Roche company (Branford, Conn.), and described by Margulies, M. et al., Genome sequencing in micro-fabricated high-density picotiter reactors, Nature, 437:376-380 (2005); U.S. Pat. No. 5,583,024; U.S. Pat. No. 5,674,713; and U.S. Pat. No. 5,700,673, each incorporated by reference. 454 sequencing involves two steps. First, DNA is sheared into blunt-end fragments attached to capture beads and then amplified in droplets. Second, pyrosequencing is performed on each DNA fragment in parallel. Addition of nucleotides generates a light signal that is recorded by a CCD camera in a sequencing instrument.

Another sequencing technique and instrument that can be used is SOLID technology by Applied Biosystems from Life Technologies Corporation (Carlsbad, Calif.). In SOLiD sequencing, genomic DNA is sheared into fragments, and adaptors are attached to generate a fragment library. Clonal bead populations are prepared in microreactors containing beads, primers, template, and PCR components. Following PCR, the templates are denatured and enriched and the sequence is determined by a process that includes sequential hybridization and ligation of fluorescently labeled oligonucleotides.

Another example of a DNA sequencing technique and instrument that can be used is ion semiconductor sequencing using, for example, a system sold under the trademark ION TORRENT by Ion Torrent by Life Technologies (South San Francisco, Calif.). Ion semiconductor sequencing is described, for example, in Rothberg, et al., An integrated semiconductor device enabling non-optical genome sequencing, Nature 475:348-352 (2011); U.S. Pubs. 2009/0026082, 2009/0127589, 2010/0035252, 2010/0137143, 2010/0188073, 2010/0197507, 2010/0282617, 2010/0300559, 2010/0300895, 2010/0301398, and 2010/0304982, each incorporated by reference. DNA is fragmented and given amplification and sequencing adapter oligos. The fragments can be attached to a surface. The addition of one or more nucleotides releases a proton (H+), which signal is detected and recorded in a sequencing instrument.

Another example of a sequencing technology that can be used is Illumina sequencing. Illumina sequencing is based on the amplification of DNA on a solid surface using fold-back PCR and anchored primers. Genomic DNA is fragmented and attached to the surface of flow cell channels. Four fluorophore-labeled, reversibly terminating nucleotides are used to perform sequential sequencing. After nucleotide incorporation, a laser is used to excite the fluorophores, and an image is captured and the identity of the first base is recorded. Sequencing according to this technology is described in U.S. Pub. 2011/0009278, U.S. Pub. 2007/0114362, U.S. Pub. 2006/0024681, U.S. Pub, 2006/0292611, U.S. Pat. No. 7,960,120, U.S. Pat. No. 7,835,871, U.S. Pat. No. 7,232,656, U.S. Pat. No. 7,598,035, U.S. Pat. No. 6,306,597, U.S. Pat. No. 6,210,891, U.S. Pat. No. 6,828,100, U.S. Pat. No. 6,833,246, and U.S. Pat. No. 6,911,345, each incorporated by reference.

Other examples of a sequencing technology that can be used include the single molecule, real-time (SMRT) technology of Pacific Biosciences (Menlo Park, Calif.) and nanopore sequencing as described in Soni and Meller, 2007 Clin Chem 53:1996-2001.

As shown in FIG. 22, sequencing generates a set of sequence reads 2204. Reads according to the invention generally include sequences of nucleotide data anywhere from tens to thousands of bases in length. A set of sequence reads 904 will typically be provided in a suitable format such as, for example, a FASTA or FASTQ file. FASTA is originally a computer program for searching sequence databases and the name FASTA has come to also refer to a standard file format. See Pearson & Lipman, 1988, Improved tools for biological sequence comparison, PNAS 85:2444-2448, A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (“>”) symbol in the first column. FASTQ files are similar to PASTA but further include a line of quality scores. Typically, sequence reads will be obtained in a format such as FASTA, FASTQ, or similar.

In some embodiments, the sequence reads 904 are assembled to provide a contig or consensus sequence, which contig or consensus sequence may be used in finding alignments to a reference graph. Sequence assembly may include any suitable methods known in the art including de novo assembly, reference-guided assembly, others, or combinations thereof. In a preferred embodiment, sequence reads are assembled using graph-based alignment methods. See, e.g., U.S. Pub. 2015/0057946 and U.S. Pub. 2015/0056613, both incorporated by reference. Embodiments of a graph and its use are discussed in greater detail below. The result of assembly is a sequence representing the corresponding portion of the original nucleic acid. The contig or consensus sequence or one or more of individual sequence reads 2204 may be mapped to a reference graph to find an alignment with an optimal score.

In certain embodiments, each sequence read 2204 is mapped to a reference graph and an alignment is found. As previously noted, alignments of sequence reads to a reference graph may be used as evidence to identify the likelihood of the possible diploid genotypes from an interval in a reference graph.

Exemplary Encoded and Decoded Coordinates

FIG. 23 illustrates a report 1002 of encoded coordinates 1004 that may be generated by the system 401. The report 1002 may be generated by following the processes shown in FIGS. 9 and 13, for example. As shown in the report 1002, a sequence of encoded LFR8 and/or LR8 coordinate components (e.g., the encoded coordinates 1004) may be stored adjacent to one another, such that a decoder executing on the system 401 can receive each octet sequentially and determine the number of octets used for each representation, in particular, the report 1002 depicts the coordinate “201348366.2.154” encoded as a sequence of LFR8<3,3> encoded integers. Of course, in certain examples, various formats of LFR8 and LR8 integers may be used in different combinations.

FIG. 24 illustrates a report 1102 that may be generated by the system 401 which provides decoded coordinates 1104. The report 1102 may be generated by decoding the encoded coordinates 1004 from the report 1002 of FIG. 23, for example. A decoder can then decode the encoded coordinates 1004 by first receiving an octet, such as the first octet in the report 1002 of FIG. 23. The decoder then analyzes the first octet to determine the total number of octets in the current representation. The decoder then reads the total number of octets and decodes the encoded component accordingly. In some embodiments, the decoder could also determine whether the encoded component is a last component of the coordinate, e.g., based on additional or auxiliary binary information embedded within the first octet or elsewhere in the representation. As shown, the report 1102 indicates the result of decoding the LFR8<3,3> integers from the report 1002 of FIG. 23, i.e., “201348366.2.154”. In certain embodiments, the report 1102 can also include a summary of the analysis resulting from such a comparison. The proposed variable length encoding schemes described herein compactly and efficiently represent a plurality of values whose frequency may reasonably be expected to follow a descending probability distribution. For example, graph coordinates will typically involve a large first component, representing a traversal to a random position in the genome (e.g., 201348366). This component can be represented using 5 octets (e.g. as shown in 11-12). However, the remaining components “2” and “154” may be encoded using only 1 and 2 octets, respectively. Similarly, paths in which the second coordinate is relative to the first coordinate will typically have smaller components that can be represented using a minimal number of bytes. Accordingly, a few values will require a large number of bytes to represent; however, a large number of values will only require a few bytes. Therefore, in common with other variable length encoding schemes, the LFR8/LR8 standard achieves overall compactness of representation by assigning short code words to the most frequently used symbols or values, and longer code words to the less frequent ones.

Exemplary Serialized Genomic Graphs

Genomic graphs, such as the reference graphs that may find use in sequence assembly, may be serialized, e.g. by the method 101 of FIG. 1. Specifically, all of the edges of a genomic graph may be stored so that the entire genomic graph is stored as a sequence of strings of bits, each string of bits comprising a first portion representing a length of the string and a second portion representing coordinates.

FIG. 25 illustrates an example of a serialized genomic graph 1202 produced by an embodiment of the disclosure. The serialized genomic graph 1202 comprises a sequence of bits 1204 made up of each individual strings of bits that were created 113 as edges of the graph were converted 109 into the strings of bits. Thus, the file 1201 illustrates the storage of a plurality of edges from the genomic graph, each as its own string of bits. More specifically, the file 1201 may be taken to illustrate the storage of all of the edges from the genomic graph so that the entire genomic graph is stored as a sequence of strings of bits 1204, each string of bits comprising a first portion representing a length of the string and a second portion representing the position of an edge representing the string within the genomic graph. Thus the file 1201 includes a serialized genome graph 1202. This was produced by performing the specifying and storing steps to convert the genomic graph into a linear string of bits comprising a representation of each of the plurality of edges.

INCORPORATION BY REFERENCE

References and citations to other documents, such as patents, patent applications, patent publications, journals, books, papers, web contents, have been made throughout this disclosure. All such documents are hereby incorporated herein by reference in their entirety for all purposes.

EQUIVALENTS

Various modifications of the invention and many further embodiments thereof, in addition to those shown and described herein, will become apparent to those skilled in the art from the full contents of this document, including references to the scientific and patent literature cited herein. The subject matter herein contains important information, exemplification, and guidance that can be adapted to the practice of this invention in its various embodiments and equivalents thereof.

Claims

1. A method of encoding a genomic graph representing variations in genetic sequence, the method comprising:

providing a genomic graph comprising nodes connected by edges, wherein the nodes and edges are stored as objects in non-transitory memory of a computer system and wherein an edge within the genomic graph represents a genomic sequence;

selecting at least one edge from the genomic graph, the at least one edge representing a single genetic sequence and having a position on the genomic graph;

encoding the at least one edge as a string of bits, in which at least a first component of the string of bits stores at least a portion of the position of the branch on the genomic graph.

2. The method of claim 1, wherein selecting at least one edge comprises selecting edges having an associated identifier.

3. The method of claim 1, wherein selecting at least one edge from the genomic graph further comprises traversing the genomic graph and identifying paths beginning with nodes having multiple outgoing edges, ending at nodes having multiple incoming edges, and having at least one directed edge specifying a linear path between.

4. The method of claim 3, wherein selecting at least one edge from the genomic graph comprises selecting the at least one directed edge specifying a linear path between a node having multiple outgoing edges and a node having multiple incoming edges.

5. The method of claim 1, wherein selecting at least one edge from the genomic graph further comprises determining an ordering of the edges in the genomic graph.

6. The method of claim 5, wherein determining an ordering comprises identifying a longest path in the graph from a source vertex to a sink vertex.

7. The method of claim 1, wherein the position of the at least one edge on the genomic graph further comprises a starting position and an ending position; and further wherein encoding the branch as a string of bits comprises encoding at least a portion of the starting position and a least a portion of the ending position of the at least one edge on the genomic graph.

8. The method of claim 1, wherein the starting position and the ending position share a common prefix, and encoding the branch as a string of bits further comprises including third component that includes the common prefix.

9. The method of claim 1, wherein encoding the branch as a string of bits further comprises encoding the length of the associated genomic sequence.

10. The method of claim 1, wherein encoding the at least one edge as a string of bits further comprises encoding the associated genomic sequence.

11. The method of claim 1, wherein encoding the branch as a string of bits further comprises encoding the type of variation represented by the genomic sequence.

12. The method of claim 11, wherein the type of variation is encoded as a vector.

13. The method of claim 11, wherein the type of variation is a deletion of a single nucleotide, a single nucleotide polymorphism, a single nucleotide insertion, a multiple nucleotide insertion, or a replacement of one nucleotide by multiple nucleotides.

14. The method of claim 13, wherein encoding the at least one edge as a string of bits comprises encoding only a start position of the at least one edge on the genomic graph.

15. The method of claim 1, wherein encoding the branch as a string of bits further comprises encoding the associated genomic sequence.

16. The method of claim 1, further comprising removing the at least one edge from the graph.

17. The method of claim 16, further comprising selecting and encoding at least one second edge from the genomic graph.

18. The method of claim 1, wherein the first component includes length bits indicating a length of the first component and representation bits that include the portion of the starting coordinate.

19. The method of claim 18, wherein each component is stored as a sub-string of the string of bits wherein a first portion of the sub-string represents a length of the sub-string and a second portion of the sub-string includes stored information.

20. The method of claim 19, further comprising performing the selection and encoding steps to convert the genomic graph into a linear string of bits comprising a representation of each of the plurality of edges.

21. The method of claim 1, further comprising storing a plurality of edges from the genomic graph, each as its own string of bits.

22. The method of claim 1, further comprising storing all of the edges from the genomic graph so that the entire genomic graph is stored as a sequence of strings of bits, each string of bits comprising a first portion representing a length of the string and a second portion representing the position.

23. A method of decoding a genomic graph from a serialized specification, the method comprising:

receiving a representation of a genomic sequence;

associating the genetic sequence with a first edge of a genomic graph;

receiving a representation of a variation of the genetic sequence, the representation further comprising a position on the genomic graph of the variation;

associating the representation of the variation with a second edge; and

inserting the second edge into the genomic graph at the position.

24. The method of claim 23, further comprising segmenting the first edge at the position of the variation.