METHOD AND SYSTEMS FOR THE REPRESENTATION AND PROCESSING OF BIOINFORMATICS DATA USING REFERENCE SEQUENCES

Info

Publication number: 20190214111
Type: Application
Filed: Jul 11, 2017
Publication Date: Jul 11, 2019
Inventors: Claudio Alberti (Geneva), Giorgio Zoia (Lausanne), Daniele Renzi (Lausanne), Mohamed Khoso Baluch (Chantilly, VA)
Application Number: 16/337,639

Abstract

Method and apparatus for the representation and processing of genome sequence data, produced by genome sequencing machines, when aligned on one or more reference sequences. Sequence reads are coded by aligning them with respect to pre-existing or constructed reference sequences. After the alignment, the coding process is composed of a classification of the reads into data classes, followed by the coding of each data class in terms of a multiplicity of descriptors layers. Specific source models and entropy coders are used for the coding of the sub-sets of descriptors used to represent each data class.

Description

Description

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims priority to and the benefit of Patent Applications PCT/EP2016/074311, PCT/EP2016/074301, PCT/EP2016/074307, PCT/EP2016/074297, PCT/US2017/17842, PCT/US2017/17841.

TECHNICAL FIELD

This disclosure provides a novel method of representation for genome sequencing data aligned with respect to one or more reference sequences having multiple alignment coordinates or alignments which require the fragmentation of reads into smaller segments (also known as “spliced reads”). The disclosed representation reduces the utilized storage space and improves access performance by providing new functionality that are not available with known prior art methods of representation.

BACKGROUND

An appropriate representation of genome sequencing data is fundamental to enable efficient genomic analysis applications such as genome variants calling and all other analysis performed with various purposes by processing genome sequencing data and metadata.

Human genome sequencing has become affordable by the emergence of high-throughput low cost sequencing technologies. Such opportunity opens new perspectives in several fields ranging from the diagnosis and treatment of cancer to the identification of genetic illnesses, from pathogen surveillance for the identification of antibodies to the creation of new vaccines, drugs and the customization of personalized treatments.

Hospitals, genomic analysis providers, bioinformatics and large biological data storage centers are looking for affordable, fast, reliable and interconnected genomic information processing solutions which could enable scaling genomic medicine to a world-wide scale. Since one of the bottleneck in the sequencing process has become data storage, methods for representing genome sequencing data in a compressed form are increasingly investigated.

The most used genome information representations of sequencing data are based on zipping FASTQ and SAM formats. The objective is to compress the traditionally used file formats (respectively FASTQ and SAM for non-aligned and aligned data). Such files are constituted by plain text characters and are compressed, as mentioned above, by using general purpose approaches such as LZ (from Lempel and Ziv, the authors who published the first versions) schemes (the well-known zip, gzip etc). When general purpose compressors such as gzip are used, the result of compression is usually a single blob of binary data. The information in such monolithic form results quite difficult to archive, transfer and elaborate particularly when like in the case of high throughput sequencing the volume of data are extremely large. The BAM format is characterized by poor compression performance due to the focus on compression of the inefficient and redundant SAM format rather than on extracting the actual genomic information conveyed by SAM files and due to the adoption of general purpose text compression algorithms such as gzip rather than exploiting the specific nature of each data source (the genomic data itself).

Another important limitation of SAM is the lack of appropriate support of the representation of multiple alignments (also known as multiple mappings) associated to genomic sequence reads or read pairs. Genomic sequence reads alignment is a process consisting in reconstructing the genomic information of a sequenced sample from the sequence reads generated by Next Generation Sequencing technology. The reconstruction can be performed either without any prior knowledge of the originating genome, or using a pre-existing genome as a reference. The latter approach is known in the field as “reference-based alignment”. In reference-based alignment the genomic sequence reads generated from a sequenced sample are compared to a pre-existing reference sequence to find the region of the reference sequence that presents the smallest number of differences (if any) with respect to the sequence read. This process is called “aligning” or “mapping” the sequence reads against a reference sequence.

Due to the repetitive nature of some genomic regions, sequence reads can be aligned to several positions with the same degree of accuracy. For example the same sequence read can perfectly (i.e. without any mismatch) match two or more segments of the same length on the reference sequence. In this case the two or more alignments are considered to be equivalent and the sequence read is said to have “multiple alignments”. This case is illustrated in FIG. 15. In other situations the different alignments can have different degrees of accuracy, for example one alignment can have no mismatches (perfect match) and another can have one or more mismatches. In this case a scoring system is used to rank the multiple alignments.

In some cases, in order to find alignment position that meet pre-established matching criteria such as the maximum number of mismatches, sequence reads have to be split into two or more sub-segments. In this case the reads are called “spliced reads” and each sub-segment is called a “splice”. This case is illustrated in FIG. 16.

The current specification of SAM does not support the representation of multiple alignments and splices using the eleven mandatory fields, but requires the use of optional fields that are used in different and thus inefficient ways by different implementations of tools used for sequence reads alignment. The invention described in this document provides a solution to the problem of both representing multiple alignments and spliced reads and preserving compression and accessibility efficiency.

A more sophisticated approach to genomic data compression that is less used, but more efficient than BAM in terms of compression is CRAM. CRAM provides more efficient compression for the adoption of differential encoding with respect to a reference (it partially exploits the data source redundancy), but it still lacks features such as incremental updates, support for streaming and selective access to specific classes of compressed data and appropriate representation of multiple alignments and spliced reads.

These approaches generate poor compression ratios and data structures that are difficult to navigate and manipulate once compressed. Downstream analysis can be very slow due to the necessity of handling large and rigid data structures even to perform simple operation or to access selected regions of the genomic dataset. CRAM relies on the concept of the CRAM record. Each CRAM record represents a single mapped or unmapped reads by coding all the elements necessary to reconstruct it.

CRAM presents the following drawbacks and limitations that are solved and removed by the invention described in this document:

1. CRAM does not support data indexing and random access to data subsets sharing specific features. Data indexing is out of the scope of the specification (see section 12 of CRAM specification v 3.0) and it is implemented as a separate file. Conversely the approach of the invention described in this document employs a data indexing method that is integrated with the encoding process and indexes are embedded in the encoded (i.e. compressed) bit stream.

2. CRAM is built by core data blocks that can contain any type of mapped reads (perfectly matching reads, reads with substitutions only, reads with insertions or deletions (also referred to as “indels”)). There is no notion of data classification and grouping of reads in classes according to the result of mapping with respect to a reference sequence. This means that all data need to be inspected even if only reads with specific features are searched. Such limitation is solved by the invention by classifying and partitioning data in classes before coding.

3. CRAM is based on the concept of encapsulating each read into a “CRAM record”. This implies the need to inspect each complete “record” when reads characterized by specific biological features (e.g. reads with substitutions, but without “indels”, or perfectly mapped reads) are searched.

Conversely, in the present invention there is the notion of data classes coded separately in separated information layers and there is no notion of record encapsulating each read. This enables more efficient access to set of reads with specific biological characteristics (e.g. reads with substitutions, but without “indels”, or perfectly mapped reads) without the need of decoding each (block of) read(s) to inspect its features.

4. In a CRAM record each record field is associated to a specific flag and each flag must always have the same meaning as there is no notion of context since each CRAM record can contain any different type of data. This coding mechanism introduces redundant information and prevents the usage of efficient context based entropy coding.

Instead in the present invention, there is no notion of flag denoting data because this is intrinsically defined by the information “layer” the data belongs to. This implies a largely reduced number of symbols to be used and a consequent reduction of the information source entropy which results into a more efficient compression. Such improvement is possible because the use of different “layers” enables the encoder to reuse the same symbol across each layer with different meanings according to the context. In CRAM each flag must always have the same meaning as there is no notion of contexts and each CRAM record can contain any type of data.

5. In CRAM substitutions, insertions and deletions are represented by using different syntax elements, option that increases the size of the information source alphabet and yields a higher source entropy. Conversely, the approach of the disclosed invention uses a single alphabet and encoding for substitutions, insertions and deletions. This makes the encoding and decoding process simpler and produces a lower entropy source model which coding yields bitstreams characterized by high compression performance.

6. CRAM lacks the support of an appropriate representation of multiple alignments and spliced reads for both single reads and paired-end reads.

Once mapped at one or more positions on a reference sequence, genomic sequence reads can either perfectly match the reference sequence segment they are mapped to or present some mismatches. Mismatches can be of type:

- substitution: a single nucleotide in the mapped read is different from the corresponding nucleotide on the reference sequence
- insertion: the mapped read contains nucleotides that are not present in the reference sequence. After the alignment process they result “inserted” between mapped nucleotides
- deletions: one or more nucleotides present in the reference sequence are not present at the corresponding positions in the mapped sequence read
- soft clips: one or more nucleotides at the edges of the sequence read or read pair do not map on the reference sequence but are preserved by the alignment tool.
- hard clips: one or more nucleotides at the edges of the sequence read or read pair do not map on the reference sequence and are not preserved by the alignment tool as part of the mapped read. In this case the mapped read is shorter than the unmapped one.

The present invention aims at compressing genomic sequences by classifying data according to the result of the sequence reads alignment process and partitioning sequencing data so that the redundant information to be coded is minimized and features such as selective access and support for incremental updates are directly enabled in the compressed domain.

When genomic data are classified according to the result of the alignment process, the representation in terms of syntax elements disclosed in this invention enables more efficient entropy coding, selective access to data and incremental updates. Enhanced compression is due to the fact that the data are partitioned into independent data blocks with homogeneous statistical properties. When data are structured in such blocks which can be independently decompressed, selective access requires less computational power and bandwidth and incremental update is possible with the addition of new encoded data blocks without the need of re-encoding the entire dataset.

One of the aspects of the presented approach is the definition of classes of data and metadata structured in different layers and encoded separately. The more relevant improvements of such approach with respect to existing methods consist in:

1. the increase of compression performance due to the reduction of the information source entropy constituted by providing an efficient source model for each class of data or metadata;
2. the possibility of performing selective accesses to portions of the compressed data and metadata for any further processing purpose directly in the compressed domain;
3. the possibility to incrementally (i.e. without the need of decoding and re-encoding) update compressed data and metadata with new sequencing data and/or metadata and/or new analysis results associated to specific sets of sequencing reads.

The representation of genomic sequence data disclosed in this invention relies on the concept of “descriptors”. A descriptor is defined as an element of the syntax used to represent genomic sequence data to be compressed using entropy coders. The representation of the original genomic sequence data by means of different descriptors enables more efficient compression and enhanced selective access to data. More efficient compression is achieved by employing different entropy coders per each type of descriptor or sub-sets of descriptors sharing the same statistical properties. More efficient selective access is enabled by the definition of descriptors that enable partitioning the genomic information in sub-set of data according to the different biological meaning. Since each sub-set of data can be decoded independently of the other data, the required processing power is reduced and the decoding time is shorter.

The descriptors defined in this invention disclosure are structured in a multiplicity of “Descriptor Streams” which are blocks of compressed syntax element of the same type.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows the coordinate system on a Reference Sequence and mapping of reads and read pairs on a Reference Sequence.

FIG. 2 illustrates how the Genomic Record Length is defined as the number of genomic positions on a reference sequence separating the left-most mapped base from the right-most mapped base of a read or read pair. In case of read pairs this is the number of genomic position on the reference sequence separating the left-most base of read 1 from the right-most base of its mate read 2 when both reads are mapped on the same reference sequence.

FIG. 3 shows an example of calculation of Genomic Record Length in case of paired reads.

FIG. 4 shows how aligned reads or read pairs can span two or more Access Units. This happens when the respective genomic record length is larger than the distance of the read or read pair mapping position from the end of the AU containing the records. A consistent selective access algorithm shall be able to retrieve all bases overlapping the specified genomic region even if the read is encoded in an Access Units not covering the specified genomic region.

FIG. 5 shows how the position of the first read of three read pairs mapped are encoded in the “pos” descriptor.

FIG. 6 shows an example of read pair where read 1 has origin from strand 1 and read 2 from strand 2.

FIG. 7 shows how the reverse complement of read 2 is coded if strand 1 is used as reference.

FIG. 8 shows the four possible combinations of reads composing a reads pair and the respective coding in the “rcomp” descriptor.

FIG. 9 shows how to calculate the pairing distance for three read pairs.

FIG. 10 shows the descriptors used to encode a pair of constant length reads perfectly mapping on the reference sequence (Class P)

FIG. 11 shows the descriptors used to encode a read of variable length perfectly mapping on the reference sequence (Class P)

FIG. 12 shows the descriptors used to encode a pair of constant length reads with unknown bases with respect to the reference sequence (Class N)

FIG. 13 shows the descriptors used to encode a pair of constant length reads mapping on the reference sequence with at least one substitution (Class M)

FIG. 14 shows the descriptors used to encode a pair of constant length reads mapping on the reference sequence with at least one insertion, deletion or soft clipped base (Class I)

FIG. 15 shows a read pair with N alignments for the left-most read and M alignments for the right-most read.

FIG. 16 illustrates how the mapping of a read pair according to given constraints required to split the left-most read (Read 1) in two splices and the right-most read (Read 2) in four.

FIG. 17 shows multiple alignments without splicing. For each read or read pair N is the first value of the mmap descriptor and signals the number of alignments of the first read. In case of paired-end reads, the following N values of the mmap descriptor are used to calculate P which is the number of alignments of the second read.

FIG. 18 shows how to use the pos, pair and mmap descriptors to encode multiple alignments without splices.

FIG. 19 shows an example of multiple alignments with splices. N represents the number of splices of the first read and is encoded as first value of the mmap descriptor. P represents the number of splices of the second read and is calculated using the next N values of the mmap descriptor. N1 and N2 represent the number of alignments of the first and second read and are calculated using the N+P values of the splen descriptor.

FIG. 20 shows how to use the pos, pair, mmap and splen descriptors to represent multiple alignments with splices.

FIG. 21 shows the two coding modes which can be used to write coded genomic information on a storage medium according to this invention disclosure. When the Access Units Contiguous (AUC) mode is adopted, each Access Unit is stored on a contiguous area of the storage medium. When the Descriptor Stream Contiguous (DSC) mode is adopted, all descriptors of the same type are stored as a single block on a contiguous area of the storage medium. As a consequence each AU is “distributed” among different segments of the storage medium.

FIG. 22 shows how Access Units are composed by blocks of encoded descriptors. To enable transport over a network, blocks are further decomposed in packets.

FIG. 23 shows that an Access Unit is composed by a header and the multiplexing of data blocks of different streams (one per descriptor). Several packets of the same type are encapsulated in a block and a multiplicity of blocks are multiplexed in one Access Unit.

FIG. 24 shows how an “internal” reference can be constructed by clustering reads and assembling the segments taken from each cluster. Each cluster contributes to one segment of the constructed reference.

FIG. 25 shows that a strategy of constructing a reference consists in storing the most recent reads once a specific sorting (e.g. lexicographic order) has been applied to the reads.

FIG. 26 shows that a read belonging to the class of unmapped reads (Class U) can be coded using six descriptors stored or carried in the corresponding streams.

FIG. 27 shows an alternative coding of reads of Class U where a signed pos descriptor is used to code the mapping position of a read on the constructed reference.

FIG. 28 shows the architecture of a genomic encoder implementing this invention disclosure

FIG. 29 shows the architecture of a genomic decoder implementing this invention disclosure

FIG. 30 shows a read pair with three alignments for read 1 and four alignments for read 2. All alignments are on the same chromosome (e.g. Chr1) while the third alignment for read 1 is on ChrX. The first value of the pair descriptor is used to signal that one alignment is present on a different reference sequence than the others.

FIG. 31 shows an example of use of the mmsc descriptor to represent one secondary alignment not preserving the mapping contiguity of the primary alignment

FIG. 32 shows how half mapped read pairs (Class HM) can be used to fill unknown regions of a reference sequence by assembling longer sequences (“also known as “contigs”) with unmapped reads.

SUMMARY

The features of the claims below solve the problem of existing prior art solutions by providing:

a method for encoding genome sequence data, said genome sequence data comprising reads of sequences of nucleotides, said method comprising the steps of:
aligning said reads to one or more reference sequences thereby creating aligned reads,
classifying said aligned reads according to specified matching rules with said one or more reference sequences, thereby creating classes of aligned reads,
encoding said classified aligned reads as a multiplicity of streams of syntax elements,
wherein encoding said classified aligned reads as a multiplicity of streams of syntax elements comprises selecting said syntax elements according to said classes of aligned reads,
providing said streams of syntax elements with header information thereby creating successive data blocks in order to entropy code said genomic data blocks into separately accessible data units.

In another aspect the encoding method further comprises:

classifying said reads that do not satisfy said specified matching rules into a class of unmapped reads encoding said classified unmapped reads as a multiplicity of streams of syntax elements,
providing said streams of syntax elements and said encoded reference sequences with header information thereby creating successive Access Units.

In another aspect the encoding method further comprises that said classifying comprises identifying genomic reads having multiple alignment positions on the reference sequences used for alignment.

In another aspect the encoding method further comprises that said classifying comprises identifying genomic reads which need to be split into multiple segments named splices to satisfy the matching rules for alignment.

In another aspect the encoding method further comprises that the reads of the genomic sequence to be encoded are paired.

In another aspect the encoding method further comprises the steps of:

identifying the number of alignments of a read and representing this number with a specific syntax element,
identifying per each alignment the corresponding mapping position and representing each mapping position with a specific syntax element,
assigning an alignment score to each alignment to identify primary and secondary alignments,
identifying as primary alignment the alignment with the highest score,
identifying if any alignment is found on a different reference than the primary and representing this information using a specific descriptor,
identifying if any alignment does not preserve the a different contiguity on the reference sequence of the primary alignment and representing this information using a specific syntax element.

In another aspect the encoding method further comprises the steps of:

identifying reads which need to be split into two or more splices in order to be aligned on the reference sequence according to pre-defined matching rules defining the matching with said one or more reference sequences,
signaling the presence of spliced reads using a global configuration parameter,
representing the number of splices using a specific syntax element,
representing the length of each splice using a specific syntax element,

In another aspect the encoding method further comprises the steps of:

identifying the number of alignments of each read in a pair and representing this number with a specific syntax element,
identifying per each alignment of the left-most read in the pair the corresponding mapping position and representing each mapping position with a specific syntax element,
identifying per each alignment of the left-most read the associated alignments of the right-most read in the pair and representing the association with a specific syntax element,
assigning an alignment score to each pair of alignments to identify primary and secondary alignments,
identifying as primary alignment the pair of alignments with the highest score,
identifying if any alignment is found on a different reference than the primary and representing this information using a specific descriptor,
identifying if any alignment presents a different contiguity on the reference sequence than the primary alignment and representing this information using a specific syntax element.

In another aspect the encoding method further comprises the steps of:

identifying reads which need to be split into two or more splices in order to be aligned on the reference sequence according pre-defined matching rules,
signaling the presence of spliced reads using a global configuration parameter,
representing the number of splices of the left-most read in a pair using a specific syntax element,
representing the number of splices of the right-most read associated to each alignment of the left-most read with a vector of specific syntax elements,
representing the length of each splice using a specific syntax element.

In another aspect the encoding method further comprises that:

said streams of syntax elements comprise a genomic dataset header comprising,
a dataset group identifier used to uniquely identify each dataset group,
a genomic dataset identifier used to uniquely identify each dataset,
a brand identifier used to identify the data format specification the dataset complies with,
a minor version number used to identify the data format specification the dataset complies with,
the length of encoded genomic reads in nucleotides used to signal constant length reads,
a flag signaling the presence of paired end reads,
a flag signaling the presence of blocks headers,
a flag signaling the mode in which Access Units are stored on a storage medium in order to facilitate data access when decoding said Access Units,
the type of alphabet used to encode mismatches of sequence reads with respect to the reference sequences,
the number of reference sequences used to code the dataset,
a numeric identifier per each reference sequence used to uniquely identify each reference sequence,
a string identifier per each reference sequence used to uniquely identify each reference sequence,
the number of coded Access Units per reference sequence used to count the Access Units associated to each reference sequence,
the type of coded genomic data used to distinguish among aligned reads, unaligned reads,
unmapped reads and reference sequences,
the number of data classes coded in the dataset,
the number of descriptors used per each data class coded in the dataset used during the decoding process,
the total number of clusters used to index encoded unmapped reads,
the number of bits used to represent the integer values used to encode the cluster signatures used to decode encoded clusters signatures,
a flag signaling if all the cluster signatures have the same length in terms of number of nucleotides the length of the clusters signatures.

In another aspect the encoding method further comprises that:

said streams of syntax elements comprise a master index table, containing one section for each Class and sub-class of aligned reads, said section comprising,
the mapping positions on said one or more reference sequences of the primary alignment of the left-most read of each Access Units of each Class or sub-class of data,
the position on said one or more reference sequences of the right-most mapped base among all primary alignments of each Access Units of each Class or sub-class of data,
the offset in bytes of each block of coded syntax elements composing each Access Unit,

In another aspect the encoding method further comprises that said Master Index Table further contains the size of each coded descriptors stream.

In another aspect the encoding method further comprises that said Master Index Table further contains the size of each Access Unit.

In another aspect the encoding method further comprises that genomic reads have multiple alignments and said Master Index Table contains:

the mapping positions on said one or more reference sequences of the left-most alignment among all reads of each Access Units of each Class or sub-class of data,
the position on said one or more reference sequences of the right-most mapped base among all alignments of each Access Units of each Class or sub-class of data,
the offset in bytes of each block of coded syntax elements composing each Access Unit.

In another aspect the encoding method further comprises that said Access Units contain coded read pairs.

In another aspect the encoding method further comprises that said Master Index Table is jointly coded with said Access Unit data.

In another aspect the encoding method further comprises that said Genomic Dataset Header is jointly coded with said Access Unit data.

In another aspect the encoding method further comprises that said streams of syntax elements further comprise information related to the type of reference used (pre-existing or constructed) and the segments of the read that do not match on the reference sequence.

In another aspect the encoding method further comprises that said classified aligned reads as multiplicity of streams of syntax elements comprises the step of associating a specific source model and a specific entropy coder to each descriptor stream.

In another aspect the encoding method further comprises that said entropy coder is one of a context adaptive arithmetic coder, a variable length coder or a golomb coder.

A method for decoding encoded genomic data comprising the steps of:

parsing Access Units containing said encoded genomic data to extract multiple streams of syntax elements by employing header information,
decoding said multiplicity of streams of syntax elements to extract aligned reads according to specific matching rules defining their classification with respect to one or more reference sequences.

In another aspect the decoding method further comprises decoding unmapped genomic reads.

In another aspect the decoding method further comprises decoding a genomic dataset header containing global configuration parameters.

In another aspect the decoding method further comprises decoding a master index table containing one section for each class of reads and the associated relevant mapping positions and coded blocks offsets.

In another aspect the decoding method further comprises decoding information related to the type of reference used: pre-existing, transformed or constructed.

In another aspect the decoding method further comprises that said genomic reads are paired.

In another aspect the decoding method further comprises that said genomic data are entropy decoded.

In another aspect the decoding method further comprises the decoding of multiple alignments information comprising the steps of:

decoding the number of alignments of each read,
decoding the position of each alignment,
identifying the primary alignment by decoding the score associated to each alignment,
identifying if any secondary alignment has a different contiguity with respect to the reference sequence than the primary alignment by decoding the corresponding syntax elements.

In another aspect the decoding method further comprises the steps of:

Identifying if any coded read is split into two or more splices,
decoding the length of each splice,
decoding the mapping position of each splice.

In another aspect the decoding method further comprises that the coded genomic read are paired end reads and the steps of:

decoding the number of alignments of the right-most read associated with each alignment of the left-most read,
decoding the pairing information associating each alignment of the left-most read with one or more alignment of the right-most read.

In another aspect the decoding method further comprises that the coded genomic read are split into two or more splices and the steps of:

decoding the length of each coded splice,
decoding the mapping position of each splice.

The present invention further provides a genomic encoder (2810) for the compression of genome sequence data 289, said genome sequence data 289 comprising reads of sequences of nucleotides, said genomic encoder (2810) comprising:

an aligner unit (281), configured to align said reads to one or more reference sequences thereby creating aligned reads,
a constructed-reference generator unit (282), configured to produce constructed reference sequences,
a data classification unit (284), configured to classify said aligned reads according to specified matching rules with the one or more pre-existing reference sequences or constructed reference sequences thereby creating classes of aligned reads (288),
one or more descriptors stream encoding units (285-287), configured to encode said classified aligned reads as streams of syntax elements by selecting said syntax elements according to said classes of aligned reads,
a multiplexer (2816) for multiplexing the compressed genomic data and metadata.

In another aspect the genomic encoder further comprises:

A data classification unit (284) contains encoders of data classes N, M and I configured with vectors of thresholds generating sub-classes of data classes N, M and I.

In another aspect the genomic encoder further comprises the features suitable for executing all the aspects of the previously mentioned coding methods.

The present invention further provides a genomic decoder (298) for the decompression of a compressed genomic stream (291) said genomic decoder (298) comprising:

a demultiplexer (290) for demultiplexing compressed genomic data and metadata,
parsing means (292-294) configured to parse said compressed genomic stream into streams of syntax elements (295),
one or more syntax elements streams decoders (296-297), configured to decode the descriptors streams into classified reads of sequences of nucleotides (2911),
genomic data classes decoders (299) configured to selectively decode said classified reads of sequences of nucleotides on one or more reference sequences so as to produce uncompressed reads of sequences of nucleotides.

In another aspect the genomic decoder further comprises that the one or more reference sequences are stored in the compressed genome stream (291).

In another aspect the genomic decoder further comprises that the one or more reference sequences are provided to the decoder via an out of band mechanism.

In another aspect the genomic decoder further comprises that the one or more reference sequences are built at the decoder.

The present invention further provides a computer-readable medium comprising instructions that when executed cause at least one processor to perform all the aspects of the previously mentioned coding methods.

The present invention further provides a computer-readable medium comprising instructions that when executed cause at least one processor to perform all the aspects of the previously mentioned decoding methods.

The present invention further provides a support data storing genomic encoded according perform all the aspects of the previously mentioned coding methods.

DETAILED DESCRIPTION

The genomic or proteomic sequences referred to in this invention include, for example, and not as a limitation, nucleotide sequences, Deoxyribonucleic acid (DNA) sequences, Ribonucleic acid (RNA), and amino acid sequences. Although the description herein is in considerable detail with respect to genomic information in the form of a nucleotide sequence, it will be understood that the methods and systems for compression can be implemented for other genomic or proteomic sequences as well, albeit with a few variations, as will be understood by a person skilled in the art.

Genome sequencing information is generated by High Throughput Sequencing (HTS) machines in the form of sequences of nucleotides (a.k.a. “bases”) represented by strings of letters from a defined vocabulary. The smallest vocabulary is represented by five symbols: {A, C, G, T, N} representing the 4 types of nucleotides present in DNA namely Adenine, Cytosine, Guanine, and Thymine. In RNA Thymine is replaced by Uracil (U). N indicates that the sequencing machine was not able to call any base and so the real nature of the nucleotide at that position is undetermined. In case the IUPAC ambiguity codes are adopted by the sequencing machine as vocabulary, the alphabet used for the symbols is composed of the following symbols: {A, C, G, T, U, W, S, M, K, R, Y, B, D, H, V, N or -}.

In the context of the present invention a “Genomic Dataset” is defined as any structured set of genomic data including, for example, genomic data of a living organism, one or more sequences and metadata generated by the genomic sequencing of a living organism or by any other step of genomic data processing performed on the original sequencing data.

The nucleotides sequences produced by sequencing machines are called “reads”. Sequence reads can be composed of a number of nucleotides ranging between a few dozens to several thousand. Some sequencing technologies produce sequence reads composed of “pairs” of which one read is originated from one DNA strand and the other is originated from the other strand. A read associated to another read in a sequencing process producing pairs is said to be its “mate”.

Throughout this disclosure, a reference sequence is a sequence of nucleotides associated to a mono-dimensional integer coordinate system for which each integer coordinate is associated to a single nucleotide. Coordinate values can only be equal or larger than zero. This coordinate system in the context of this invention is zero-based (i.e. the first nucleotide has coordinate 0 and it is said to be at position 0) and linearly increasing from left to right.

When mapping sequence reads on a reference sequence, said reference sequence is used as axis of a mono-dimensional coordinate system in which the left-most position is denoted as position 0. In a sequence read, mapped to a reference sequence, the nucleotide composing the read mapped at the reference sequence position identified by the smallest coordinate number is usually referred to as the “left-most” nucleotide, whereas the nucleotide composing the read mapped at the reference sequence position identified by the largest coordinate number is referred to as the “right-most” nucleotide. This is illustrated in FIG. 1. Throughout this disclosure, a nucleotide is also referred to as a “base”.

When a sequence read is mapped to a reference sequence, the coordinate of the left-most mapped base is said to represent the mapping position of the read on the reference sequence.

A reference genome is composed by one or more reference sequences and it is assembled by scientists as a representative example of a species' set of genes. For example GRCh37, the Genome Reference Consortium human genome (build 37) is derived from thirteen anonymous volunteers from Buffalo, N.Y. However, a reference sequence could also consist of a synthetic sequence conceived and merely constructed to improve the compressibility of the reads in view of their further processing.

Throughout this disclosure, a genomic record is defined as a coded representation of

- either a single sequence read or a paired sequence read optionally associated with alignment information, read identifier and quality values
- a reference sequence (e.g. a chromosome) or a portion thereof.

Throughout this disclosure, a genomic record position is defined as the position on the reference sequence of the left-most mapped base of the read or read pair coded in a genomic record. Mapped bases in a read or read pair record include:

- a base of the aligned read matching the corresponding base on the reference sequence
- a base of the aligned read that does not match the corresponding base (a.k.a. Single Nucleotide Polymorphism)

A base present in the aligned read and not present in the reference sequence (a.k.a. insertion) and bases preserved by the alignment process, but not mapped on the reference sequence (a.k.a. soft clips) do not have mapping positions.

The read composing a read pair with a base mapping on the smallest coordinate on a reference is referred to, in this disclosure, as “Read 1” whereas its mate is referred to as “Read 2”.

The distance, expressed as number of nucleotides (or bases), separating two reads generated as a pair, by a sequencing machine using today technology, is unknown, and it is determined by mapping both reads composing the pair (i.e. minimizing appropriate matching functions) to a reference sequence.

Throughout this disclosure, a genomic record length is defined as the number of coordinate positions between the left-most mapped base coded in the record and the right-most mapped base coded in the record.

Throughout this disclosure, a pairing distance is defined as the number of coordinate positions between the left-most mapped base coded in the record and the left-most mapped base of Read 2 coded in the record. An example of pairing distance is shown in FIG. 1.

Throughout this disclosure, in case of single reads the genomic record length (GRL) is calculated as the integer obtained by subtracting the mapping position of the left-most base from the mapping position of the right-most mapped base and adding “1”.

GRL=(right-most base position)−(left-most base position)+1

Throughout this disclosure in case of read pairs, the genomic record length (GRL) is calculated as the integer obtained by subtracting the mapping position of the left-most base of the read mapping at the smallest position on a reference sequence (Read 1) from the coordinate of the mapping position of the right-most base of its mate (Read 2) and adding “1”. Such definition of genomic record length is illustrated in FIG. 3.

Throughout this disclosure, in case of genomic record encoding a reference sequence or a portion thereof, the genomic record length is defined as the number of nucleotides composing the encoded sequence.

Throughout this disclosure, a genomic range is defined as a contiguous segment of coordinates on a reference sequence defined by a start coordinate S and an end coordinate E such that S≤E. The start and end position of a genomic range are always included in the Range.

This invention aims at defining a new method enabling the efficient access to aligned genomic sequence reads mapped to any genomic region when the sequence reads are compressed by means of sets of descriptors included into a number of data blocks called Access Units.

Throughout this disclosure, an Access Unit (AU) is defined as a logical data structure containing a coded representation of genomic information or related metadata to facilitate the bit stream access and manipulation. It is the smallest data organization that can be decoded by a decoding device implementing the invention described in this disclosure.

According to the type of coded information, an AU can be decoded either independently of any other AU or using information contained in other AUs.

AUs can be of a multiplicity of types according to the nature of the coded data. An Access Unit contains either a reference sequence, or a portion thereof, or encoded reads or read pairs belonging to a single class of data. For example an Access Unit may contain the entire chromosome 1 of GRCh37, the Genome Reference Consortium human genome (build 37). Another Access Unit may contain the coded representation of nucleotides of chromosome 1 of GRCh37 that are located between coordinates 50,000 and 150,000. Another Access Unit may contain only reads or read pairs that perfectly map on the reference sequence without any mismatch. Another Access Unit may contain reads or read pairs that only contain “N” symbols as mismatches with respect to the reference sequence. Another Access Unit may contain reads or read pairs that contain any type of substitutions (e.g. one base present in the read or read pair is different from the base at the corresponding mapping position in the reference sequence). Another Access Unit may contain reads or read pairs that contain mismatches, insertions, deletions and soft clipped bases. Another Access Unit may contain only read or read pairs that do not map on the reference sequence. Another Access Unit may contain only read pairs in which one read is mapped and the other is not mapped on the reference sequence. Another type of Access Unit may contain only encoded segments of a reference genome composed by one or more reference sequences (for example chromosomes).

The essential characteristic of an Access Unit is that it contains in compressed form all elements needed to reconstruct the genomic information (sequence reads or read pairs, reference sequences), associated alignment information and metadata of reads or read pairs it represents. In other words, to fully reconstruct the reads, read pairs or reference sequence and associated information carried by an Access Unit it is only necessary to retrieve the Access Unit itself and, when needed, the Access Units containing the reference sequence it refers to.

In each Access Unit the descriptors listed in the next section representing the encoded read or read pairs are aggregated in separate data blocks—one per type—in order to exploit their homogeneous statistical properties when entropy coding is applied to compress them.

Each Access Unit contains the compressed sub-set of descriptors representing sequence reads or read pairs belonging to the same class mapped to a genomic region on a reference sequence. Such genomic region on the reference sequence is defined by a start coordinate (or start position) and an end coordinate (or end position).

An example of Access Unit is illustrated in FIG. 22. Access Units are composed by blocks of encoded descriptors (described in the next section). To enable transport over a network, blocks are further decomposed into packets.

Descriptors are syntax elements representing part of the information necessary to reconstruct (i.e. decode) coded reference sequences, sequence reads and the associated mapping information or pairs of sequence reads and associated mapping information. Different types of descriptors are defined to express:

- the mapping position of a read on a reference sequence,
- the distance between a read and its mate,
- the length of a sequence read,
- the position of mismatches in aligned reads with respect to reference sequences,
- the types of mismatches with respect to reference sequences at the associated positions,
- the bases that could not be mapped on the reference sequence by the mapping procedure and are classified as “soft clipped” bases,
- the sequence reads lengths,
- the mapping flags as specified by the SAM specification,
- the multiple mapping positions that are associated to a single read or read pair by the mapping procedure,
- the identification of the existence of spliced reads (i.e. reads that when split in chunks admits mapping positions with higher degrees of matching accuracy then when they are mapped as single connected read mapped on a single position on a reference sequence)
- the specific type of reference sequence used such as:
  - reference genomes as those published by consortia like the Genome Reference Consortium (e.g. GRCh37), the University of California Santa Cruz (e.g. hg19),
  - reference sequences built by using specified sets of reads and specified sets of assembly rules,
- the positions and types of modifications applied to reference sequences with the objective of reducing the entropy of the descriptors used to represent the mismatches of sequence reads mapped on such modified reference sequences,
- the representation of sequence reads that cannot be mapped at any position of the reference sequence with specified degrees of matching accuracy,
- the representation of entire reference sequences or portions thereof.

The complete list and the precise definition of each descriptor referred to in this disclosure, and used by this invention is provided in the following sections.

As mentioned above, a multiplicity of descriptors are used by this invention to represent in compressed form reference sequences, sequence reads or read pairs so that they can be fully reconstructed with their associated information. In case reads or read pairs are also classified and separated into different classes, according to the results of the mapping on reference sequences, and entropy coded into separate data blocks, different sub-sets of descriptors are used for representing each class or reads or read pairs. Therefore, an Access Unit contains only those entropy coded descriptors that are needed to represent either a reference sequences—or portions thereof—or reads or read pairs belonging to the same class. This is shown in FIG. 11 for reads of variable length and in FIG. 12, FIG. 13 and FIG. 14 for read pairs of constant length.

Throughout this disclosure, entropy coded descriptors of the same type are said to compose a Descriptor Stream.

Throughout this disclosure, an Access Unit Start Position is defined as the left-most Genomic Record Position among all Genomic Records contained in the Access Unit.

Throughout this disclosure, an Access Unit End Position is defined as the right-most mapped base position among all mapped bases of all Genomic Records contained in the Access Unit

Throughout this disclosure, the Access Unit Range is defined as the Genomic Range comprised between the AU Start Position and the right-most genomic record position among all genomic records contained in the Access Unit. Its value in number of positions can be calculated by subtracting the AU Start Position from the AU End Position and adding “1”.

Throughout this disclosure, the Access Unit Covered Region is defined as the Genomic Range comprised between the AU Start Position and the AU End Position.

Throughout this disclosure, an Access Unit is also said to cover the genomic region between its Start Position and its End Position.

Some genomic records coded in an AU can have mapping position at a distance from the AU end position that is smaller than the genomic record length. This implies that some bases belonging to the read or read pair coded in the genomic record are mapped in genomic regions covered by one of the following AUs. This is shown in FIG. 4.

According to the definitions provided above two useful ways of building Access Units can be identified:

- 1. A so-called “non-overlapping mode” in which the genomic ranges of Access Units of the same data class never overlap,
- 2. A so-called “overlapping mode” in which the genomic ranges of Access Units of the same data class may overlap.

The “non-overlapping mode” is preferable in scenarios when genomic data are compressed and stored in memory as files as well as in streaming scenarios when stored files are streamed from one storage device to another since it enables the implementation of efficient selective access procedures. The “overlapping mode” supports streaming scenarios when portions of the genomic datasets become available for coding into Access Units and transmission before the entirety of the genomic sequence data is available at the transmitting device.

The main innovative aspects of the disclosed invention are the following:

1 The encoding of each sequence read or read pair into a genomic record is implemented by means of a sub-set of descriptors that expresses how the encoded sequence read or read pair is mapped to a reference sequence and how it can be fully reconstructed from its compressed representation.
2 A specific sub-set of descriptors is used to represent sequence reads or read pairs belonging to each class of data in which reads or read pairs are partitioned according to the result of the mapping on a reference sequence. This representation supports multiple mapping positions and spliced reads.
3 A specific sub-set of descriptors is used to represent reference sequences or portions thereof.
4 In the case of sequence read pairs, with each read having the same length, each pair is entropy encoded as a single entity in a genomic record associated to a single sub-set of descriptors so as to maximize the coding performance.
5 The sub-set of descriptors in compressed form (i.e. entropy coded) coded in a genomic record, is included in an Access Unit having range that includes the left-most mapped base of the read, or the left-most mapped base of Read 1 for read pairs.
6 The definition of a coding method that determines if the reads belonging to one pair have to be encoded as a single entity and the sub-set of descriptors representing the pair, has to be included in compressed form in a single Access Unit or if the pair has to be split and represented in compressed form as two separate reads in two different Access Units. The method defines a splitting rule using as input parameter the distance between the two reads in the pair and the Access Unit range.
7 The application of the new coding method at the encoder side and the transmission of the encoding parameters in the compressed bitstream enables at decoder side the identification of the minimum number of Access Units that need to be decoded when it is required to access all sequence reads and read pairs mapped (i.e. belonging) to a specific genomic region.
8 The definition of the coding method, the transmission of the parameters of the coding method at decoder side, the determination of the minimum number of Access Units and their identification for retrieving all reads mapping on any given genomic region enable the implementation of high-performance selective access to sequence reads and read pairs stored or transported (i.e. streamed) in compressed form.
9 Thanks to the use of descriptors and their partitioning in homogeneous data blocks within Access Units, effective entropy coding can be applied to compress the genomic information representation.
The definition of an indexing tool called Master Index Table (MIT) containing
- a. per each class of data an index containing the Start Position and End Position of each AU contained in a genomic dataset. In each index the AU Start Positions are sorted in ascending order.
- b. per each class of data a vector of pointers to the physical locations on a storage medium of the coded AU belonging to each class of data. Each vector is sorted as the corresponding index of AU Start Positions

The MIT enables efficient random access to reads or read pairs mapped to specified genomic regions by associating genomic regions on a reference sequence with the corresponding locations on a storage device of the Access Units containing a compressed representation of said reads or read pairs.

11 In the case of single reads with variable length (i.e. reads with a length that may assume any value), a new coding method that evaluates the length distribution of the reads coded in each Access Units and their mapping coordinates, determines the largest value of the coordinate of the mapping position of the base belonging to a read contained into the Access Unit (AU End Position), writes a representation of such value in the MIT so that it is available by a decoder for implementing efficient selective access operations on specific genomic regions. The decoder, by inspecting only the MIT to retrieve the value of the End Position of each AU covering the specified selective access region, can determine the minimum number of Access Units that need to be decoded to guarantee that all reads covering the genomic region for which selective access is required are retrieved.

In the following, each of the above innovative aspects will be further described in full details.

Classification of the Sequence Reads According to Matching Rules

The sequence reads generated by sequencing machines are classified by the disclosed invention into six different “classes” according to the matching results of the alignment with respect to one or more “pre-existing” reference sequences.

When aligning a DNA sequence of nucleotides with respect to a reference sequence the following cases can be identified:

1. A region in the reference sequence is found to match the sequence read without any error (i.e. perfect mapping). Such sequence of nucleotides is referenced to as “perfectly matching read” or denoted as “Class P”.
2. A region in the reference sequence is found to match the sequence read with a type and a number of mismatches determined only by the number of positions in which the sequencing machine generating the read was not able to call any base (or nucleotide). Such type of mismatches are denoted by an “N”, the letter used to indicate an undefined nucleotide base. In this document this type of mismatch are referred to as “n type” mismatch. Such sequences belong to “Class N” reads. Once the read is classified to belong to “Class N” it is useful to limit the degree of matching inaccuracy to a given upper bound and set a boundary between what is considered a valid matching and what it is not. Therefore, the reads assigned to Class N are also constrained by setting a threshold (MAXN) that defines the maximum number of undefined bases (i.e. bases called as “N”) that a read can contain. Such classification implicitly defines the required minimum matching accuracy (or maximum degree of mismatch) that all reads belonging to Class N share when referred to the corresponding reference sequence, which constitutes an useful criterion for applying selective data searches to the compressed data.
3. A region in the reference sequence is found to match the sequence read with types and number of mismatches determined by the number of positions in which the sequencing machine generating the read was not able to call any nucleotide base, if present (i.e. “n type” mismatches), plus the number of mismatches in which a different base, than the one present in the reference, has been called. Such type of mismatch denoted as “substitution” is also called Single Nucleotide Variation (SNV) or Single Nucleotide Polymorphism (SNP). In this document this type of mismatch is also referred to as “s type” mismatch. The sequence read is then referenced to as “M mismatching reads” and assigned to “Class M”. Like in the case of “Class N”, also for all reads belonging to “Class M” it is useful to limit the degree of matching inaccuracy to a given upper bound, and set a boundary between what is considered a valid matching and what it is not. Therefore, the reads assigned to Class M are also constrained by defining a set of thresholds, one for the number “n” of mismatches of “n type” (MAXN) if present, and another for the number of substitutions “s” (MAXS). A third constraint is a threshold defined by any function of both numbers “n” and “s”, f(n,$). Such third constraint enables to generate classes with an upper bound of matching inaccuracy according to any meaningful selective access criterion. For instance, and not as a limitation, f(n,$) can be (n+s)½ or (n+s) or any linear or non-linear expression that sets a boundary to the maximum matching inaccuracy level that is admitted for a read belonging to “Class M”. Such boundary constitutes a very useful criterion for applying the desired selective data searches to the compressed data when analyzing sequence reads for various purposes because it makes possible to set a further boundary to any possible combination of the number of “n type” mismatches and “s type” mismatches (substitutions) beyond the simple threshold applied to the one type or to the other.
4. A fourth class is constituted by sequencing reads presenting at least one mismatch of any type among “insertion”, “deletion” (a.k.a. indels) and “clipped”, plus, if present, any mismatches type belonging to class N or M. Such sequences are referred to as “I mismatching reads” and assigned to “Class I”. Insertions are constituted by an additional sequence of one or more nucleotides not present in the reference, but present in the read sequence. In this document this type of mismatch is referred to as “i type” mismatch. In literature when the inserted sequence is at the edges of the sequence it is also referred to as “soft clipped” (i.e. the nucleotides are not matching the reference but are kept in the aligned reads contrarily to “hard clipped” nucleotides which are discarded). In this document this type of mismatch is referred to as “c type” mismatch. Deletion are “holes” (missing nucleotides) in the read with respect to the reference. In this document this type of mismatch is referred to as “d type” mismatch. Like in the case of classes “N” and “M” it is possible and appropriate to define a limit to the matching inaccuracy. The definition of the set of constraints for “Class I” is based on the same principles used for “Class M” and is reported in Table 1 in the last table lines. Beside a threshold for each type of mismatch admissible for Class I data, a further constraint is defined by a threshold determined by any function of the number of the mismatches “n”, “s”, “d”, “i” and “c”, w(n,s,d,i,c). Such additional constraint make possible to generate classes with an upper bound of matching inaccuracy according to any meaningful user defined selective access criterion. For instance, and not as a limitation, w(n,s,d,i,c) can be (n+s+d+i+c)⅕ or (n+s+d+i+c) or any linear or non-linear expression that sets a boundary to the maximum matching inaccuracy level that is admitted for a read belonging to “Class I”. Such boundary constitutes a very useful criterion for applying the desired selective data searches to the compressed data when analyzing sequence reads for various purposes because it enables to set a further boundary to any possible combination of the number of mismatches admissible in “Class I” reads beyond the simple threshold applied to each type of admissible mismatch.
5. A fifth class includes all reads that do not find any mapping considered valid (i.e not satisfying the set of matching rules defining an upper bound to the maximum matching inaccuracy as specified in Table 1) for each data class when referring to the reference sequence. Such sequences are said to be “Unmapped” when referring to the reference sequences and are classified as belonging to the “Class U”.

Classification of Read Pairs According to Matching Rules

The classification specified in the previous section concerns single sequence reads. In the case of sequencing technologies that generates read in pairs (i.e. Illumina Inc.) in which two reads are known to be separated by an unknown sequence of variable length, it is appropriate to consider the classification of the entire pair to a single data class. A read that is coupled with another is said to be its “mate”.

If both paired reads belong to the same class the assignment to a class of the entire pair is obvious: the entire pair is assigned to the same class for any class (i.e. P, N, M, I, U). In the case the two reads belong to a different class, but none of them belongs to the “Class U”, then the entire pair is assigned to the class with the highest priority defined according to the following expression:

P<N<M<I

in which “Class P” has the lowest priority and “Class I” has the highest priority.

In case only one of the reads belongs to “Class U” and its mate to any of the Classes P, N, M, I a sixth class is defined as “Class HM” which stands for “Half Mapped”.

The definition of such specific class of reads is motivated by the fact that it is used for attempting to determine gaps or unknown regions existing in reference genomes (a.k.a. little known or unknown regions). Such regions are reconstructed by mapping pairs at the edges using the pair read that can be mapped on the known regions. The unmapped mate is then used to build the so called “contigs” of the unknown region as it is shown in FIG. 32. Therefore providing a selective access to only such type of read pairs greatly reduces the associated computation burden enabling much efficient processing of such data originated by large amounts of data sets that using the state of the art solutions would require to be entirely inspected.

The table below summarizes the matching rules applied to reads in order to define the class of data each read belongs to. The rules are defined in the first five columns of the table in terms of presence or absence of type of mismatches (n, s, d, i and c type mismatches). The sixth column provide rules in terms of maximum threshold for each mismatch type and any function f(n,$) and w(n,s,d,i,c) of the possible mismatch types.

TABLE 1 Type of mismatches and set of constrains that each sequence reads must satisfy to be classified in the data classes defined in this invention disclosure. Number and types of mismatches found when matching a read with a reference sequence Number of Set of matching unknown Number of Number of Number of Number of accuracy Assignement bases (“N”) substitutions deletions Insertions clipped bases constraints Class 0 0 0 0 0 0 P n > 0 0 0 0 0 n ≤ MAXN N n > MAXN U n ≥ 0 s > 0 0 0 0 n ≤ MAXN and M s ≤ MAXS and f(n, s) ≤ MAXM n > MAXN or U s > MAXS or f(n, s) > MAXM n ≥ 0 s ≥ 0 d ≥ 0* i ≥ 0* c ≥ 0* n ≤ MAXN and I s ≤ MAXS and d ≤ MAXD and i ≤ MAXI and c ≤ MAXC w(n, s, d, i, c) ≤ MAXTOT d ≥ 0 i ≥ 0 c ≥ 0 n > MAXN or U s > MAXS or d > MAXD or i > MAXI or c > MAXC w(n, s, d, i, c) > MAXTOT *At least one mismatch of type d, i, c must be present (i.e. d > 0 or i > 0 or c > 0)

Compressed Representation of Genomic Sequence Reads and Reference Sequences

A common element of efficient approaches to genomic sequence reads compression is the exploitation of the correlation of sequence data with respect to a reference sequence. Even if the somatic profile of the human population is extremely diversified, the actual portion of the number of nucleotides that differs from person to person is about only 0.1% of the total number of nucleotides composing an entire genome. Therefore, the specific genomic information characterizing each individual is very limited with respect to the entire information carried by an entire genome. When a pre-existing reference genome is available, be it for previous sequencing or as a published “average” consensus reference, the most efficient way to encode the actual information is to identify and encode only the differences with respect to the reference genome.

In order to do so with raw sequence reads in the form of FASTQ data, a preliminary pre-processing step of mapping on an available reference genome is performed. In case a reference genome is not available or if the bias introduced by the usage of a specific reference is not desirable, the construction of a new reference sequence by means of assembling the available sequence reads into longer sequences is a possible alternative.

When sequence reads have been mapped with respect to a pre-existing or to a constructed reference sequence, each sequence read can be fully represented by a number of elements denoted in this disclosure as “read descriptors” or simply “descriptors”.

For example, in the case of a sequence read that perfectly matches a segment of a reference sequence, the only sub-set of descriptors needed to represent the sequence read is composed by the coordinate of the mapping position on the reference (usually the coordinate of the mapping position of the left-most base of the sequence read), the length of the sequence read itself and the information indicating if the read is mapping on the direct or reverse DNA strand with respect to the reference sequence strand.

In case it is not possible to find any mapping position for which all the bases of the sequence read match all bases of the reference sequence, the mapping (or mappings) with the minimal number of mismatches are retained. In such case a different sub-set of descriptors is needed to also express the substitutions, the insertions, the deletions and the clipped bases that may occur in correspondence of the mapping position with the minimal or close to minimal number of mismatches. With such sub-set of descriptors the sequence reads can be reconstructed using the information carried by the descriptors and the information carried by the reference sequence.

The mapping process can also produce other types of information such as: multiple possible mapping positions and related scores, the quality of mapping, the specification of spliced reads, the mapping on two different references (usually chromosomes) of reads belonging to a pair, features of the sequencing process (e.g. PCR or optical duplicate). All such information requires specific additional descriptors that extend each sub-sets that then is compressed by applying for each sub-set of descriptors appropriate entropy coding algorithms.

The genome sequencing process can generate read duplicates (i.e. two or more exact copies of the same genomic sequence) due to:

- the chemistry of the genome sequencing process (Polymerase Chain Reaction duplicate),
- the data acquisition process (optical duplicate). A read is called as an optical duplicate if the pair of reads are both on the same tile, and the distance between reads is less than a given configuration parameter depending on the experiment.

Each read or read pair can therefore be uniquely represented by a specific sub-set of descriptors according to the results of the mapping process.

Commonly used approaches such as SAM and CRAM do not encode reads or read pairs according to the specific sub-set of descriptors needed to represent their mapping information. SAM and CRAM do not classify sequence reads into data classes according to the number and type of mismatches they contain with respect to the reference sequence they are mapped to. Furthermore, these formats do not code sequence reads separately into Access Units containing in compressed form only sequence reads belonging to a single data class. In the case of sequence reads generated in pairs, state of the art approaches do not code them as single elements partitioned into classes according to their mapping accuracy with respect to the reference sequence. Such state of the art approaches are characterized by the following limitations and drawbacks:

- 1. The coding of reads or reads pairs without classifying sequence reads into separate data classes according to mapping results versus a reference sequence and using a unique super-set of descriptors is an inefficient approach that yields poor compression performance.
- 2. The coding of read pairs as separate sequence reads requires the duplication of several descriptors carrying the same information, thus results inefficient and yields poor compression performance.
- 3. The retrieval of the information needed to reconstruct read pairs result to be complex and inefficient since the process requires a brute-force sequential search in possibly the entire dataset which as in case of next generation sequencing (NGS) technology can be extremely large.
- 4. The selective access to read or read pairs mapped to specific genomic regions requires to search the entire dataset to have the guarantee that all reads or read pairs are retrieved.

When coding read pairs by means of a single sub-set of descriptors, the following technical advantages are evident to a person skilled in the art:

- 1. The information common to both reads, which is clearly redundant, is not replicated by coding a pair as single element (e.g. read pair identifiers, mapping distance, mapping reference identifiers, the various mapping quality information currently encoded by specific flags in the SAM file format)
- 2. The retrieval of the mutual pairing information (i.e. the information providing which read is the mate of any read at hand) is straightforward and does not require any further processing. Conversely in the state of the art approaches it may require to parse the entire volume of data.

In order to enable efficient selective access to specific portions of sequencing data and being able to transport them on a digital data network, the set of descriptors used to represent sequence reads aligned to a reference are structured in logically separate and independent data blocks called Access Units (AU). Each Access Unit contains only the compressed representation of a single data class, and can be decoded either independently of any other Access Units or using only Access Units carrying the coded representation of the reference sequence region used for mapping. This enables selective access and out-of-order transport capabilities.

In order to increase the compression efficiency this invention eliminates the need of specifying the “mapping reference identifier” descriptor for each read pair having both pairs mapped on the same reference sequence. Each Access Unit can contain only reads or pairs that map on the same reference. Using such solution the descriptor representing the reference sequence identifier needs to be encoded only once per each Access Unit or set of Access Units (and not repeated for each read as currently done in SAM/BAM formats).

The only exception of the rule expressed above is the case of read pairs having the two reads mapped on different reference sequences (e.g. chromosomes). In this case the pair is split and the two reads are coded as two separate genomic records and each encoded read contains the identifier of the reference sequence its mate is mapped to.

A person skilled in the art knows that classifying information in groups of elements with homogeneous statistical properties provides better compression performance with respect to the usage of a general purpose compressor (e.g. LZ type algorithm) applied to a heterogeneous set of data. As a consequence, when encoding genomic sequence reads in pairs by means of a specific sub-set of descriptors, higher compression is achieved thanks to the lower entropy characterizing each separate sub-set of descriptors and higher processing efficiency when reconstructing and retrieving read pairs.

Sequence Reads Descriptors

This section introduces the descriptors specified to represent genomic sequence reads mapped on a reference sequence. The specific sub-set of descriptors used to represent each read or read pair depends on the result of the mapping versus a reference sequence (i.e. presence or absence of mismatches between the read, or read pair and the reference sequence).

Position

A read or read pair position is defined as the mapping position on the reference sequence of the left-most base in the read or read pair. A descriptor of type “position” is necessary per each read or read pair. The value of the “position” descriptor represents:

- the value of the coordinate of the left-most base in the read or read pair on the reference sequence
- or the difference with respect to the coordinate of the previous read or read pair coded in the same Access Unit.

A “position” descriptor is required to represent each encoded read or read pair.

In this invention disclosure such descriptor will be referred to as pos descriptor.

Reads Pairing

In case of read pairs, the descriptor that represents how each read is associated with its mate in the pair can be expressed by several syntax elements as it may represent:

- The difference of coordinate between a base of a read with the respective base in the mate (e.g. the left-most mapped base in a read with the left-most mapped base in the mate). In this invention disclosure such descriptor will be referred to as pair descriptor.
- The absolute coordinate of the mate on the reference sequence with the identifier of the reference sequence the mate maps to. Such representation option is used when:
  - the two reads of a pair map on different reference sequences (e.g. chromosomes) or
  - when the two reads of a pair map on the same reference, but are separated by a number of bases exceeding a given value which is specified to be the maximum likely admissible pairing distance.

In this invention disclosure such descriptor will be referred to as abspair descriptor. In case of mate mapped on a different reference sequence the descriptor identifying the reference sequence is referred to as refid

- The number of encoded reads separating a read from its mate in case of paired reads. In this invention disclosure such descriptor will be referred to as pcount descriptor.

Read Length

In the case of reads with variable length, a descriptor per read is used to represent the length expressed as the number of nucleotides composing the read. Obviously, a read length descriptor is required per each read only in the case of variable read length.

In this disclosure this descriptor is also referred to as rlen descriptor.

Reverse Complement

The DNA is composed by a double helix in which each strand is the complement of the other since Adenine (‘A’) only couples with Thymine (‘T’) and Cytosine (‘C’) only couples with Guanine (‘G’). Therefore, it is only necessary to represent one strand to know the nucleotide composition of the other. This is the reason for which reference sequences are always represented by a single sequence, and mapping tools are capable of finding mapping positions for reads belonging to both strands. If a read is mapped on the complementary strand of the DNA helix, it is said to be “reverse complemented”. A descriptor is necessary to carry such information and carries the information indicating if the original read is the reverse complement or not of the reference sequence it is mapped to.

A reverse complement descriptor is required per each read.

In this disclosure such descriptor is also referred to as rcomp descriptor.

Unknown Bases Positions During the sequencing process it may occur that the machine is unable to call any base at a given positions of a read or read composing a pair. Such event is identified by a special symbol ‘N’ at the corresponding read position. A descriptor identifying each occurrence of an ‘N’ symbol in a read is thus needed.

The descriptor may represent:

- The absolute position of the ‘N’ symbol in read or read of a pair expressed as coordinate of the reference sequence or
- The relative position of the previous ‘N’ in the same read or read of a pair.

In this disclosure such descriptor is also referred to as nmis descriptor.

Mismatches Position and Type

Sequence reads mapped on a reference sequence may present mismatches with respect to the reference sequence segment they are mapped to. These mismatches can be classified and are denoted as substitutions, deletions or insertions according to the following cases:

- The presence of a different nucleotide (base) with respect to the reference sequence (substitution)
- The absence of a nucleotide in the mapped read (deletion)
- The presence of a nucleotide in the read that is not present in the reference (insertion)

The representation of each mismatch type implies the use of three descriptors, one representing the mismatch position in the read or read pair (also referred to as mmpos), another representing the type of mismatch when only substitutions are present (also referred to as subtype) and another representing the type of mismatch when substitutions, insertions and deletions are present (also referred to as mmtype).

Soft Clips

Genomic sequence reads mapped on a reference sequence may present at their edges portions of the sequence of nucleotides that do not match any or just a few of those present on the reference sequence at the mapping position. These sequences portions are called soft-clips and can be represented by a descriptor simply composed by the string of symbols representing the bases of the sequence portion.

Reads can admit only one or two soft clips, at the beginning of the read and/or at the end.

In this document such descriptor is also referred to as sclips descriptor.

Mapping Flags

Mapping flags are used to carry specific information relative to the alignment process such as:

- The presence of multiple mapping position for a read or read pair
- The presence of spliced reads
- The presence of PCR (Polymerase Chain Reaction) or optical duplicates
- Supplementary alignment (used in case the aligner has produced several possible mapping position for the same read or read pair)
- The read fails the quality check (i.e. technology vendor specific procedure to measure the quality of the sequencing process)

In this document such descriptors are also referred to as flags descriptors.

Unmapped Reads

In case a read is not mapped at any position of the reference sequences the read is classified as unmapped. In such case all unmapped reads are grouped according to some shared characteristics. This process is called “clustering”. The groups of reads sharing the same characteristic are called clusters. Throughout this invention disclosure the characteristic shared among sequence reads belonging to the same cluster is called cluster signature or signature.

A signature can be composed by any number of nucleotides from two to several thousands and signatures can have either constant length for all clusters or variable length. The alphabet of symbols that can belong to a signature depends on the specific genomic sample that has been sequenced to produce the sequence reads being processed. As an example, but not as a limitation, the following alphabets can be used:

- for DNA
  - {A, G, C, T, N}
  - {A, G, C, T, R, Y, S, W, K, M, B, D, H, V, N, ., -} (IUPAC notation)
- for RNA
  - {A, G, C, U, N}
  - {A, G, C, U, R, Y, S, W, K, M, B, D, H, V, N, ., -} (IUPAC notation)
- for amino acids
  - {A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y}

The type of alphabet used to compute a cluster signature is identified by a parameter Alphabet_ID carried by a data structure called Genomic Dataset Header described in this disclosure.

Signatures of clusters belonging to the same genomic dataset can be of constant or variable length. A global parameter encoded in the genomic dataset header is used to signal if the signatures length is constant or variable. If the signature length is constant a second global parameter represents the length in symbols of the clusters signatures. This value is 0 in case of variable signatures length. Each cluster of unmapped reads is coded in one or more Access Units.

Encoding Reads or Read Pairs without Mismatches

In case a read or read pair perfectly maps on the reference sequence (i.e. without any mismatch) the following sub-set of descriptors is needed to reconstruct the read and associated mapping information:

- A position descriptor per read or read pair (pos)
- A reverse complement descriptor per read or two per read pair (rcomp)
- A length descriptor per read (in case of variable length reads only) (Hen)
- A pairing descriptor per read pair (pair)
- A set of mapping flags (flags)

In this invention such read or read pair is classified as belonging to Class P.

The position descriptor pos represents the position on the reference genome of the left-most mapping base of the read or read pair. It's use is depicted in FIG. 5.

The reverse complement descriptor rcomp indicates whether a read is mapped on the direct or reverse strand of the reference sequence. The meaning and syntax of this descriptor are depicted in FIG. 7 and FIG. 8.

In case of variable length reads a descriptor rlen encodes the read length.

The pair descriptor carries the information necessary to reconstruct the entire pair. The syntax of the descriptor is provided in FIG. 9.

An example of the encoding of a read pair belonging to Class P is provided in FIG. 10, the corresponding example for a single read of variable length is provided in FIG. 11.

Encoding Read or Read Pairs with Mismatches Represented Only by Unknown Bases

In the case a read or read pair maps on the reference sequence, but contains at least one unknown base, the following sub-set of descriptors is needed to reconstruct the read and associated mapping information:

- A position descriptor per read or read pair (pos)
- A reverse complement descriptor per read or two per read pair (rcomp)
- A position per each unknown base (nmis)
- A length descriptor per read (in case of variable length reads only) (rlen)
- A pairing descriptor per read pair (pair)
- A set of mapping flags (flags)

Descriptors already present in Class P sub-set have the same syntax and semantics. The additional descriptor nmis provides the position in a read (pair) of the bases called as “unknown” by the sequencing process (symbol ‘N’).

In this invention such read or read pair is classified as to belonging to Class N.

An example of the encoding of a read pair of Class N is provided in FIG. 12.

Encoding Read or Read Pairs with Unknown Bases and Substitutions

In the case a read or a read pair maps on the reference sequence and presents at least one substitution, but no deletions or insertions, the following sub-set of descriptors is needed to reconstruct the read and associated mapping information:

- A position descriptor per read or read pair (pos)
- A reverse complement descriptor per read or two per read pair (rcomp)
- A descriptor per each substitution position (mmpos)
- A descriptor per each substitution type (subtype)
- A position per each unknown base (nmis)
- A length descriptor per read (in case of variable length reads only) (rlen)
- A pairing descriptor per read pair (pair)
- A set of mapping flags (flags)

Descriptors already present in class P sub-set have the same syntax and semantics. The additional descriptors used for such sequence read data class are mmpos to represent the position of substitutions and subtype to represent the type of substitution.

An example of the encoding of a read pair of this type is provided in FIG. 13.

In this invention disclosure such read or read pair is said to belong to Class M.

Encoding Read or Read Pairs with at Least One Insertion, Deletion or Soft Clip

When a read or read pair maps on the reference sequence with at least one insertion deletion or soft clip, the following sub-set of descriptors is defined:

- A position descriptor per read or read pair (pos)
- A reverse complement descriptor per read or two per read pair (rcomp)
- A descriptor per each position of mismatch (insertion, deletion, substitution) (mmpos)
- A descriptor per each type of mismatch (insertion, deletion, substitution) (mmtype)
- A descriptor per each sequence of soft clips (sclips)
- A position per each unknown base (nmis)
- A length descriptor per read (in case of variable length reads only) (Hen)
- A pairing descriptor per read pair (pair)
- A set of mapping flags (flags)

Descriptors already present in class M sub-set have the same syntax and semantics. The additional descriptors used in this case are mmpos to represent the position of substitutions, insertions and deletions, mmtype to represent the type of mismatch (substitutions, insertion or deletion) and sclips to represent the soft clipped bases.

In this invention disclosure such read or read pair is said to belong to Class I. An example of the encoding of a read pair belonging to Class I is provided in FIG. 14.

Read Pairs in which Only One Read is Mapped to a Reference Sequence

In the case a read pair is composed by a mapped read (belonging to any of the class P, N, M or I) and an unmapped read, the pair is classified as belonging to a separate class called Class HM (Half Mapped).

The read mapped on the reference sequence can be of any of the classes described above (P, N, M and I) and will be encoded using the sub-set of descriptors already described for each class. The unmapped read will be encoded by compressing the string of symbols representing it using an appropriate entropy coder.

Encoding Unmapped Reads or Read Pairs and Construction of “Internal” References

Reads belonging to Class U or the unmapped mate of read pairs belonging to Class HM cannot be mapped to any “existing” reference sequence satisfying the specified set of matching accuracy constraints. This invention discloses a method to construct one or more “internal” reference sequences to be used to align and compress reads belonging to these data classes.

Several approaches are possible to construct appropriate “internal” references such as for instance and not as limitation:

- the partitioning of the unmapped reads into clusters containing reads that share a common contiguous genomic sequence of at least a minimal size (signature). Each cluster can be uniquely identified by its signature as shown in FIG. 24.
- the sorting of reads in any meaningful order (e.g. lexicographic order) and the use of the last N reads as “internal” reference for the encoding of the N+1. This method is shown in FIG. 25.
- performing a so called “de-novo assembly” on a subset of the reads of class U so as to be able to align and encode all or a relevant sub-set of the reads belonging to said class according to the specified matching accuracy constraints or a new set of constraints.

If the read being coded can be mapped on the “internal” reference satisfying the specified set of matching accuracy constraints, the information necessary to reconstruct the read after compression is coded using syntax elements that can be of the following types:

- 1. Start position of the matching portion on the internal reference in terms of read number in the internal reference (pos descriptor). This position can be encoded either as absolute or differential value with respect to the previously encoded read.
- 2. Offset of the start position from the beginning of the corresponding read in the internal reference (pair descriptor). E.g. in case of constant read length the actual position is pos*length+pair.
- 3. Possibly present mismatches coded as mismatch position (snpp descriptor) and type (snpt descriptor)
- 4. Those parts of the reads (typically the edges identified by pair) that do not match with the internal reference (or do so, but with a number of mismatches above a defined threshold) are encoded in the indc descriptor. A padding operation can be performed to the edges of the part of the internal reference used in order to reduce the entropy of the mismatches encoded in the indc descriptor, as shown in FIG. 26. The most appropriate padding strategy can be chosen by the encoder according to the statistical properties of the genomic data being processed. Possible padding strategies include:
  - a. No padding
  - b. Constant padding pattern chosen according to its frequency in the currently encoded data.
  - c. Variable padding pattern according to the statistical properties of the current context defined in terms of the latest N encoded reads

The specific type of padding strategy will be signaled by special values in the indc descriptor stream header

- 5. A flag that indicates if the read has been encoded using an internal self-generated, external or no-reference (rtype descriptor)
- 6. Reads which are encoded verbatim (ureads descriptor).

FIG. 26 provides an example of such coding procedure.

FIG. 27 shows an alternative encoding of unmapped reads on the internal reference where pos+pair syntax elements are replaced by a signed pos. In this case pos would express the distance—in terms of positions on the reference sequence—of the left most nucleotide position of read n with respect of the position of the left most nucleotide of read n−1.

In case reads of class U present variable length, an additional descriptor rlen is used to store each read length.

This coding approach can be extended to support N start positions per read so that reads can be split over two or more reference positions. This can be particularly useful to encode reads generated by those sequencing technology (e.g. from Pacific Bioscience) producing very long reads (50K+ bases) which usually present repeated patterns generated by loops in the sequencing methodology. The same approach can be used as well to encode chimeric sequence reads defined as reads that align to two distinct portions of the genome with little or no overlap.

The approach described above can be clearly applied beyond the simple class U and could be applied to any stream containing syntax elements related to reads positions (pos streams).

Reads and Read Pairs with Multiple Mapping Positions

The case of a read or a reads pair being mapped to multiple coordinates of the reference sequence is supported by state of the art approaches like SAM or CRAM by replicating the encoded data records and by the ad-hoc additions of optional fields, with a consequent scattering in the data representation leading to an evident loss of compression efficiency due to the introduced redundancy. Moreover, some mapping configurations of paired-end reads whereby one read is paired to multiple mapping positions of the respective mate are supported only by user-defined optional fields. The same can be said for the representation of reads and read pairs which need to be split into two or more sub-sequences in order to find appropriate mapping on the reference sequence. The main results of such approaches is a proliferation of inefficient representations in terms of compression efficiency with also a major impact on data reusability and loss of information when passing from one partial representation to another.

The method described in this disclosure supports the representation of both multiple mapping positions and spliced reads and results much more efficient than existing solutions both in terms of compression and compressed data accessibility. Better compression is provided by the possibility of grouping descriptors with homogeneous statistical properties and different entropy coders with appropriate contexts. Better file accessibility is provided by the definition of indexing mechanisms that enable the decompression and retrieval of specific type of genomic data only without the need to decompress and access the entire coded information. For example the invention described in this disclosure provides indexing mechanism enabling the retrieval of sequence reads or reads pair which have multiple mapping position with or without spliced reads. This is currently not possible with state of the art genomic information representation formats such as SAM and CRAM.

When encoding read pairs with multiple alignments, SAM and CRAM lack the possibility to support the representation of multiple alignments of one read in a pair associated with a single alignment of the other read. A person skilled in the art knows that this is a frequent case in experiments such as RNA-seq and ChIP-seq and today SAM and CRAM have no way to support all the possible combinations of coupling among multiple alignments of reads in a pair. The solution proposed in this disclosure is able to support all the possible configuration today found in genomic data generated by High-Throughput Sequencing (HTS) machines.

When mapping genomic sequence reads to a reference sequence the following outcomes are possible:

- 1. the read maps to a single region of the reference sequence according to established constraints. In this can the mapping is said to be “unique”;
- 2. the read perfectly matches on more than one genomic region of the reference sequence;
- 3. a multiplicity of possible mapping positions are reported by the aligning tool. Each mapping presents a different editing distance from the respective region, but the aligner considered all of them valid. Aligners usually assign a score to each alignment in a completely implementation-dependent way. According to the assigned scores a “primary” alignment is usually flagged. It is possible that an alignment presents exactly the same editing distance and the same mismatches than some others.

Case 1 simply requires that the uniqueness of the alignment is encoded.

Case 2 need to acknowledge that a “primary” or a “secondary” alignment does not exists since all mappings are equally probable. The only additional information to be encoded is the vector of all mapping positions.

Case 3 requires that all the mapping positions are encoded as an array of encoded reads. All required descriptors disclosed in this invention have to be replicated per each mapping position when needed. Different mapping positions can present different levels of errors (substitutions, indels, clipped bases) with respect to the reference.

Spliced Reads

A spliced read is defined as a sequence read that needs to be split into two or more sub-reads in order to find suitable mapping regions on a reference sequence. In this case, the distance among the sub-reads (called “junction”) is usually too large to be considered a deletion. The mapping of a spliced read can refer to the direct or reverse strand of the reference sequence, therefore this information, called “absolute strandedness” has to be preserved and encoded.

The encoding of spliced reads requires the preservation of the mapping position of each splice, which has to be considered as a variable length read, encoded in a genomic dataset containing constant length reads only.

Descriptors for Multiple Alignments and Spliced Reads

In the following description the term template is used as in the SAM specification and identifies a nucleotides sequence part of which is sequenced on a sequencing machine or assembled from raw sequences. According to the sequencing technology used, the sequencing of a template can produce either a single sequence of nucleotides (one read) or two sequences which are said to be “paired”. In this context, a segment is defined as a contiguous sequence or subsequence.

When encoding reads or read pairs with multiple alignment position, the information to be conveyed by the encoded data is the following:

- All the mapping positions for each segment in the template (e.g. Read 1 and Read 2)
- The presence of any primary mapping for the template (i.e. Read 1 and Read 2)

In this invention disclosure, the coding of multiple mapping positions and spliced reads is supported by:

- 1. two global flags defined at the level of the encoded dataset and valid for all Access Units composing the dataset:
  - ma: when set it indicates the presence of multiple alignments in the dataset
  - sr: when set it indicates the presence of splices reads in the dataset
- 2. the following descriptors:
  - the mmap descriptor is used to signal in how many positions the read or the leftmost read of a pair has been aligned; the first number (N) refers to the read as a single segment (if sr flag=0) or instead to all the segments into which the read has been spliced for the multiplicity of possible alignments. The value of N indicates how many values of the pos descriptor are coded for the template in this record. N is followed by one or more numbers M_ias described below for the case without splices and with splices.
  - the splen descriptor is used in the case of spliced alignments. It is a list of lengths signaling how many bases compose each splice of the current record. Since the length of each read is known (from the global reads length in case of constant reads length or from the len descriptor in case of variable length reads), this descriptor is used by the decoder to calculate how many segments have been used for each alignment.
  - the mmscore descriptor provides the value of the score per alignment and the indication of the primary alignment. A score provides the level of confidence assigned by the aligning tool to each specific mapping configuration. It usually assumes floating point values. For example the primary alignment (i.e. the one considered the best candidate) can be the one with the highest score.
  - the mmsc descriptor supports alternative secondary alignments which do not preserve the same contiguity of mapping of the primary alignment (the most common example is a case where the CIGAR string contains indels while the primary one only contains M). In this case it is necessary to encode the difference in contiguity between the primary and the other alignments.
    Multiple Alignments without Splices

If no splices are present in the dataset, the global sr flag is unset and the splen descriptor is not used. In paired-end sequencing, the mmap descriptor is structured into one value N followed by one or more numbers (M_i), with i going from 1 to the number of complete first (the leftmost) read alignments (N₁). If no splices are present for that read, then N₁=N. For each first read alignment, spliced or not, one value M_iis used to signal how many segments are used to align the second read (in this case equal to the number of alignments), and then how many values of the pair descriptor are coded for that alignment of the first read.

The values of M_ishall be used to calculate P=Σ_i=1^NM_iwhich indicates the number of alignments of the second read.

This method is illustrated in FIG. 17.

A special value of Mi=0 indicates that the i^thalignment of the left-most read is paired with an alignment of the right-most read that is already paired with a k^thalignment of the left-most read with k<i (then there is no new alignment detected, which is consistent with the equation above).

As an example, in the simplest cases:

- 1. If there is a single alignment for the leftmost read and two alternative alignments for the rightmost, N assumes the value of 1 and M₁assumes the value of 2.
- 2. If two alternative alignments are detected for the leftmost read, but only one for the rightmost, N assumes the value of 2, M₁of 1 and M₂of 0.

When M_iis equal to 0 the associated value of the pair descriptor points to an existing second read alignment.

FIG. 17 illustrates the meaning of N, P and M_iin case of multiple alignments without splices and FIG. 18 shows how the pos, pair and mmap descriptors are used to encode the multiple alignments information.

Multiple Alignments with Splices

In case of spliced reads, the splen descriptor is used. It is composed for each record of an array of N+P values. The first N values indicate the length of each aligned segment of the single read or the first (e.g. the leftmost) read of a pair. The following P values indicate the length of each aligned segment of the second read, in case of paired-end sequencing. P is calculated as calculate P=Σ_i=1^NM_iwhere each value of M_iapplies to each individual alignment of the first read to compose an alignment of the entire template.

The first N values of the splen descriptor for a record enable the calculation of N₁, which represents the number of alignments of the first read. If N₁=N no splices are present for the first read.

The following P values of the splen descriptor for a record enable the calculation of N₂, which signals the number of alignments of the second read. If N₂=P no splices are present for the second read.

The mmap and the splen descriptors defined enable the unique identification of how many reads or read pairs represent the multiple mappings and how many segments are composing each read or read pair mapping. This is shown in FIG. 19 and FIG. 20.

Alignment Score

The mmscore descriptor allows signaling the mapping score of an alignment. In single-end sequencing it has N₁values per template; in paired-end sequencing it has a value for each alignment of the entire template. In other words every pair composed by one alignment of read 1 and one alignment of read 2 can have an associated score. In case of paired end reads the number of total scores is calculated as

N.scores=MAX(N₁,N₂)+M₀

where N₁is the number of total alignments of read 1, N₂is the number of total alignments for read 2 and the number of M_ivalues that are equal to 0.

In case of single reads the number of scores equals to N₁

Multiple Alignments without Splices Descriptors

The table below summarizes the semantic and effect of the use of mmap and mmscore descriptors defined in this invention disclosure in case of multiple alignments without spliced reads.

Semantic mmap mmscore Effect Read (or paired- Single read: N Single read: N values Single read: end read) with Read pair: Read pair: MAX(N, Σ(M_i)) the read has multiple mappings multiple N, M_i values + Σ(M_i= 0) and it is encoded as a sequence mappings, not where Introducing a separator would of N consecutive segments spliced 1 ≤ i ≤ N enable having an arbitrary belonging to the class with the number of scores. Otherwise a highest ID. 0 should be used if not present. N pos descriptors are used These are floating point values. Read pair: the read pair has multiple mappings and it is encoded as a sequence of N segments for the first read P = Σ(M_i) pairings to the alignments of the second read N pos descriptors are used NP pair descriptors are used with NP = Σ_i=1^NMi + (no. of Mi = 0) + (optional) 1 The optional pairing descriptor is used when alignments are present on different reference sequences than the one currently encoded N + 1 mmap descriptors read (pair) 0 0 uniquely mapped

Multiple Alignments with Splices Descriptors

The table below summarizes the semantic and effect of the use of mmap and mmscore descriptors defined in this invention disclosure in case of multiple alignments with spliced reads.

Semantic mmap mmscore Effect Read (or paired- Single read: N Single read: N₁values Single read: end read) with Read pair: Read pair: Max(N₁, N₂) + the read has multiple mappings multiple N, M_i Σ(M_i== 0) values and it is encoded as a sequence mappings, with where N₁and N₂are calculated using of N consecutive segments splices 1 ≤ i ≤ N the N + P splen descriptors belonging to the class with the P = Σ_i=1^N1Mi highest ID. Introducing a separator would N pos descriptors are used enable having an arbitrary Read pair: number of scores. Otherwise a the read pair has multiple 0 should be used if not present. mappings and it is encoded These are floating point values as a sequence of as defined for the MAF format N segments for the first read P = Σ(M_i) pairings to the alignments of the second read N pos descriptors are used read (pair) 0 0 uniquely mapped

Multiple Alignments on Different Chromosomes

It may happen that the alignment process finds alternative mappings to another reference sequence than the one where the primary mapping is positioned.

In this case it is of paramount importance, in terms of applications, to maintain a fast (in terms of random access complexity) link between the two or more, necessarily different, Access Units where the several alternative mappings for a template are coded.

For read pairs that are uniquely aligned, this invention disclosure defines a descriptor named pair used to represent a chimeric alignment where the two reads in a pair are mapped on different chromosomes. This descriptor can be used to signal the reference and the position of the next record containing further alignments for the same template. This is shown in FIG. 30.

In case one or more alignments for the leftmost read in a pair are present on a different reference sequence than the one related to the currently encoded AU, a reserved value of the pair descriptor shall be used (not the same as the one used for alignments present to another reference in case of unique alignment). The reserved value shall be followed by the reference and position of the leftmost alignment among all those contained in the next AU (i.e. the first decoded value of the pos descriptor for that record).

Multiple Alignments with Insertions, Deletions, Unmapped Portions

In some cases multiple alignments can present different configurations of matching and mismatching bases, insertions, deletions and soft clips. For example it is not infrequent that while a primary alignment only has matching or mismatching bases and it is therefore mapped as a contiguous sequence of nucleotides, secondary alignments may present insertions, deletions, soft clips or splices. Throughout this disclosure, when a mapped read does not contain insertions, deletions or soft clipped bases is said to have mapping contiguity. This invention disclosure defines a third descriptor signaling whether the secondary alignment belongs to a data class which preserves the same mapping contiguity of the primary alignment (like if it were P, M or N) or not (I, U, spliced). This descriptor, mmsc (for multiple mapping sub-class) is in principle only a flag per alignment. When mmsc is set, it is followed by the verbatim representation of the SAM cigar string generated by the aligner extended with an additional symbol “U” representing unmapped nucleotides and followed by the string of unmapped nucleotides. FIG. 31 illustrates an example of use of this descriptor. The syntax of this descriptor is the following:

- If all alignments share the same mapping contiguity a single value is present and set to 0 (N=0).
- If at least one secondary alignment does not preserve the mapping contiguity the descriptor is structured as follow:
  - The first value N>0 indicates how many secondary alignments do not preserve mapping contiguity. The following elements are then repeated N times per each of the secondary alignments not preserving mapping contiguity:
  - One value indicates the pairing value related to the secondary alignment not preserving the mapping contiguity
  - The following value indicates which read in the pair does not preserve the mapping contiguity
  - The following value contains the cigar string representing the mapping characteristics of the read not preserving the mapping contiguity
  - One optional field present only if the cigar string contains the symbol “S” contains the verbatim string of soft-clipped nucleotides.

Reference Sequences Descriptors

A reference sequence is commonly represented as a string of symbols representing the nucleotides that can be found in the corresponding biological sample. In the case of DNA the nucleotides are four and represented by the symbols A, C, G and T. In the case of RNA the T is replaced by U. A fifth symbol is added to denote a coordinate in the sequence where the sequencing device was not able to determine the type of nucleotides according to the confidence required by the experiment. In this invention disclosure a reference sequence can be encoded either entirely in one Access Unit or partitioned into two or more sub-sequences.

The descriptor used to represent reference sequences or sub-sequences to be entropy coded is the verbatim representation of the sequence or sub-sequence in terms of the allowed symbols of the respective alphabet.

Source Models, Entropy Coders and Coding Modes

For each data class, sub-class and associated descriptor stream of the genomic data structure disclosed in this invention, different coding algorithms may be adopted according to the specific features of the data or metadata carried by each stream and its statistical properties. The “coding algorithm” has to be intended as the association of a specific “source model” of the descriptor stream with a specific “entropy coder”. The specific “source model” can be specified and selected to obtain the most efficient coding of the data in terms of minimization of the source entropy. The selection of the entropy coder can be driven by coding efficiency considerations and/or probability distribution features and associated implementation issues. Each selection of a specific “coding algorithm”, also referred to as “coding mode” can be applied to an entire “descriptor stream” associated to a data class or sub-class for the entire data set, or different “coding modes” can be applied for each portion of descriptors partitioned into Access Units.

Each “source model” associated to a coding mode is characterized by:

- The definition of the syntax elements emitted by each source (i.e. the set of descriptors used to represent a class of data such as reads position, reads pairing information, mismatches with respect to a reference sequence as defined in Table 2).
- The definition of the associated probability model.
- The definition of the associated entropy coder.

Further Advantages

The classification of sequence data into the defined data classes and sub-classes permits the implementation of efficient coding modes exploiting the lower information source entropy characterizing by modelling the sequences of syntax elements by single separate data sources (e.g. distance, position, etc.).

Another advantage of the invention is the possibility to access only the subset of type of data of interest. For example one of the most important application in genomics consists in finding the differences of a genomic sample with respect to a reference (SNV) or a population (SNP). Today such type of analysis requires the processing of the complete sequence reads whereas by adopting the data representation disclosed by the invention the mismatches are already isolated into one to three data classes only (depending on the interest in considering also “n type” and “i type” mismatches). A further advantage is the possibility of performing efficient transcoding from data and metadata compressed with reference to a specific “external” reference sequence to another different “external” reference sequence when new reference sequences are published or when re-mapping is performed on the already mapped data (e.g. using a different mapping algorithm) obtaining new alignments.

FIG. 28 shows an encoding apparatus 287 according to the principles of this invention. The encoding apparatus 287 receives as input a raw sequence data 289, for example produced by a genome sequencing apparatus 280. Genome sequencing apparatus 280 are known in the art, like the Illumina HiSeq 2500, the Thermo-Fisher Ion Torrent devices or the Oxford Nanopore MinION. The raw sequence data 289 is fed to an aligner unit 281, which prepares the sequences for encoding by aligning the reads to a reference sequence 2820. Alternatively, a dedicated module 282 can be used to generate a reference sequence from the available reads by using different strategies as described in this document in section “Construction of internal references for unmapped reads of Class U” and “Class HM”. After having been processed by the reference generator 282, reads can be mapped on the obtained longer sequence. The aligned sequences are then classified by data classification module 284. The data classes 288 are then fed to descriptors encoders 285-287. The genomic descriptors streams 2811 are then fed to arithmetic encoders 2812-2814 which encode the layers according to the statistical properties of the data or metadata carried by the layer. The result are one or more genomic streams 2815.

FIG. 29 shows a decoding apparatus 298 according to the principles of this disclosure. A decoding apparatus 298 receives a multiplexed genomic bitstream 2910 from a network or a storage element. The multiplexed genomic bitstream 2910 is fed to a demultiplexer 290, to produce separate genomic bitstreams 291 which are then fed to entropy decoders 292-294, to produce genomic descriptors streams 295. The extracted genomic descriptors streams are fed to descriptors decoders 296-297 to further decode the descriptors into classes of sequence reads. Class decoders 299 further process the genomic descriptors 2911 and the transformed reference 2914, and merge the results to produce uncompressed reads of sequences, which can then be further stored in the formats known in the art, for instance a text file or zip compressed file, or FASTQ or SAM/BAM files.

Class decoders 299 are able to reconstruct the original genomic sequences by leveraging the information on the original reference sequences carried by one or more genomic bitstreams. In case the reference sequences are not transported by the genomic streams they must be available at the decoding side and accessible by the class decoders.

The inventive techniques herewith disclosed may be implemented in hardware, software, firmware or any combination thereof. When implemented in software, these may be stored on a computer medium and executed by a hardware processing unit. The hardware processing unit may comprise one or more processors, digital signal processors, general purpose microprocessors, application specific integrated circuits or other discrete logic circuitry.

The techniques of this disclosure may be implemented in a variety of devices or apparatuses, including mobile phones, desktop computers, servers, tablets and similar devices.

Access Unit Types

Throughout this invention disclosure, the genomic data classified in data classes and structured in compressed or uncompressed layers are organized into different Access Units as defined above. Access Units are differentiated by:

- type, characterizing the nature of the genomic data and data sets they carry and the way they can be accessed,
- order, providing a unique order to Access Units belonging to the same type.

Access units of any type can be further classified into different “categories”.

Hereafter follows a non-exhaustive list of definition of different types of genomic Access Units:

1) Access units of type 0 do not need to refer to any information coming from other Access Units to be accessed or decoded and accessed. The entire information carried by the data or data sets they contain can be independently read and processed by a decoding device or processing application. As an example, but not as a limitation, Access Units of type 0 can be used to carry encoded reference sequences such as chromosomes or entire reference genomes or parts thereof.
2) Access units of type 1 contain data that refer to data carried by Access Units of type 0. Reading or decoding and processing the data contained in Access Units of type 1 requires having access to one or more Access Units of type 0. Access unit of type 1 encode genomic data related to sequence reads of “Class P”
3) Access Units of type 2 contain data that refer to data carried by Access Units of type 0. Reading or decoding and processing the data contained in Access Units of type 2 requires having access to one or more Access Units of type 0. Access unit of type 2 encode genomic data related to sequence reads of “Class N”
4) Access Units of type 3 contain data that refer to data carried by Access Units of type 0. Reading or decoding and processing the data contained in Access Units of type 3 requires having access to one or more Access Units of type 0. Access unit of type 3 encode genomic data related to sequence reads of “Class M”
5) Access Units of type 4 contain data that refer to data carried by Access Units of type 0. Reading or decoding and processing the data contained in Access Units of type 4 requires having access to one or more Access Units of type 0. Access unit of type 4 encode genomic data related to sequence reads of “Class I”
6) Access Units of type 5 contain reads that cannot be mapped on any available reference sequence (“Class U”) and are encoded used an internally constructed reference sequence. Access Units of type 5 contain data that refer to data carried by Access Units of type 0. Reading or decoding and processing the data contained in Access Units of type 5 requires having access to one or more Access Units of type 0.
7) Access Units of type 6 contain read pairs where one read can belong to any of the four classes P, N, M, I and the other cannot be mapped on any available reference sequence (“Class HM”). Access Units of type 6 contain data that refer to data carried by Access Units of type 0. Reading or decoding and processing the data contained in Access Units of type 6 requires having access to one or more Access Units of type 0.
8) Access Units of type 7 contain metadata (e.g. quality scores) and/or annotation data associated to the data or data sets contained in the access unit of type 1. Access Units of type 7 may be classified and labelled in different layers.
9) Access Units of type 8 contain data or data sets classified as annotation data. Access Units of type 8 may be classified and labelled in layers.
10) Access Units of additional types can extend the structure and mechanisms described here. As an example, but not as a limitation, the results of genomic variant calling, structural and functional analysis can be encoded in Access Units of new types. The data organization in Access Units described herein does not prevent any type of data to be encapsulated in Access Units being the mechanism completely transparent with respect to the nature of encoded data.

Access Units of type 0 are ordered (e.g. numbered), but they do not need to be stored and/or transmitted in an ordered manner (technical advantage: parallel processing/parallel streaming, multiplexing).

Access Units of type 1, 2, 3, 4, 5 and 6 do not need to be ordered and do not need to be stored and/or transmitted in an ordered manner (technical advantage: parallel processing/parallel streaming).

FIG. 22 shows how Access Units are composed by a header and one or more streams of homogeneous data. Each stream can be composed by one or more blocks. Each block contains several packets and the packets are a structured sequence of the descriptors introduced above to represent e.g. reads positions, pairing information, reverse complement information, mismatches positions and types etc.

Each Access unit can have a different number of packets in each block, but within an Access Unit all blocks have the same number of packets.

Each data packet can be identified by the combination of 3 identifiers X Y Z where:

- X identifies the access unit it belongs to
- Y identifies the block it belongs to (i.e. the data type it encapsulates)
- Z is an identifier expressing the packet order with respect to other packets in the same block

FIG. 23 shows an example of Access Units and packets labelling where AU_T_N is an access unit of type T with identifier N which may or may not imply a notion of order according to the Access Unit Type. Identifiers are used to uniquely associate Access Units of one type with those of other types required to completely decode the carried genomic data.

Access Units of any type can be further classified and labelled in different “categories” according to different sequencing processes. For example, but not as a limitation, classification and labelling can take place when

- 1. sequencing the same organism at different times (Access Units contain genomic information with a “temporal” connotation),
- 2. sequencing organic samples of different nature of the same organisms (e.g. skin, blood, hair for human samples). These are Access Units with “biological” connotation.

Data Storage Modes

When storing coded descriptors on a storage medium, two are the approaches described in this invention disclosure:

- 1. Access Unit Contiguous (AUC) mode
- 2. Descriptor Stream Contiguous (DSC) mode

When AUC is applied encoded data blocks belonging to the same Access Unit (but different Descriptors Streams) are stored in contiguous areas of the storage medium.

The AUC mode can be implemented in two different ways in terms of data storage:

- a) Class Contiguous (CC) in which the order of the Access Units contiguously stored on the storage medium can be class-based, i.e. all Access Units of the same class (Class_ID) are stored contiguously in the storage medium.
- b) Genomic Region Contiguous (GRC) in which the order of the Access Units reference-based, i.e. all Access Units with the same AU_ID (i.e., mapping to the same genomic region) are stored contiguously.

The AUC/CC mode method is more efficient when accessing data of a single class. The AUC/GRC mode is more efficient when accessing data of any class mapping to the same genomic region. The invention described in this disclosure and the related syntax supports all modes DSC, AUC/CC and AUC/GRC methods and leaves the encoder with the freedom to select any mode according to the desired selective access performance. If AUC/CC mode or AUC/GRC mode is used is signaled by a flag named CC_Mode_Flag carried by the Genomic Dataset Header as listed in Table 2.

When DSC is applied blocks belonging to the same Descriptors Stream are stored in contiguous areas of the storage medium. Genomic data are in fact organized per Descriptors Streams (composed by one or more Blocks), representing homogeneous data in terms of entropy encoding.

The storage alternative used in a genomic dataset among the coding methods is signaled by flags referred to as Block_Header_Flag stored in the Genomic Dataset Header as listed in Table 2.

The difference between the AUC and DSC modes is illustrated in FIG. 21 where Access Units are built in a coordinate system having on the vertical dimension the Descriptor_ID which identifies the type of descriptor coded in data block Bn_m and on the horizontal dimension the Access Unit ID. The data block Bn_m contains coded descriptors of type (i.e. identifier) n for Access Unit m.

Indexing Compressed Genomic Data for Efficient Selective Access

In order to support selective access to specific regions of the aligned data, two data structures are described in this disclosure: a Genomic Dataset Header carrying global parameters and the indexing tool called Master Index Table (MIT) used during the encoding and decoding process. The syntax of the Genomic Dataset Header is provided in Table 2 and the syntax of the Master index table is provided in Table 3.

Genomic Dataset Header

The Genomic Dataset Header is a data structure carrying global parameters used by the encoder and the decoder to manipulate the encoded genomic information.

The information carried by the Genomic Dataset Header includes:

- a dataset group identifier used to uniquely identify each dataset group,
- a genomic dataset identifier used to uniquely identify each dataset,
- a brand identifier used to identify the data format specification the dataset complies with,
- a minor version number used to identify the data format specification the dataset complies with,
- the length of encoded genomic reads in nucleotides used to signal constant length reads,
- a flag signaling the presence of paired end reads,
- a flag signaling the presence of blocks headers,
- a flag signaling which AU coding mode is used in the dataset: AUC/CC or AUC/GRC,
- the type of alphabet used to encode mismatches of sequence reads with respect to the reference sequences,
- the number of reference sequences used to code the dataset,
- a numeric identifier per each reference sequence used to uniquely identify each reference sequence,
- a string identifier per each reference sequence used to uniquely identify each reference sequence,
- the number of coded Access Units per reference sequence used to count the Access Units associated to each reference sequence,
- the type of coded genomic data used to distinguish among aligned reads, unaligned reads, unmapped reads and reference sequences,
- the number of data classes coded in the dataset,
- the number of descriptors used per each data class coded in the dataset used during the decoding process,
- the total number of clusters used to index encoded unmapped reads,
- the number of bits used to represent the integer values used to encode the cluster signatures used to decode encoded clusters signatures,
- a flag signaling if all the cluster signatures have the same length in terms of number of nucleotides,
- the length of the clusters signatures.

The syntax and semantics of each element of the Genomic Dataset Header are listed in Table 2 below.

TABLE 2 Genomi Dataset Header syntax Syntax Semantic Dataset_Header { Dataset_Group_ID Identifier of Dataset Group ID Dataset_ID Identifier of the Dataset. Its value shall be one of the IDs listed in the Datasets Group Header. Major_Brand Brand identifier, identifying the data (compression) format specification the Dataset complies with. Minor_Version Minor version number of the data (compression) format specification the Dataset complies with. Reads_Length Length of reads in nucleotides. 0 in case of variable reads length. Paired_Reads 1 if all reads in the Dataset are paired, 0 otherwise. Block_Header_Flag If set, all Blocks composing the Dataset are preceded by a Block Header CC_Mode_Flag Used to signal the AU coding mode: AUC/CC or AUC/GRC. If set, two Access Units of one class cannot be separated by Access Units of a different class in the storage device. If unset, Access Units are ordered by Access Unit Start Position in the storage device. Alphabet_ID A unique identifier for the type of alphabet used to encode mismatches of sequence reads with respect to the reference sequences. For example it is used to signal if DNA, RNA, IUPAC alphabets Seq_Count are used. Number of reference sequences used in this Dataset. for (seq=0; seq<Seq_Count; seq++) { Reference_ID[seq] Unique identification number of the reference within this Dataset Group. Sequence_ID[seq] Unambiguous identifier identifying the reference sequence(s) used in this Dataset. In common reference genomes such as GRCh37 these include at least the 24 chromosomes identifiers. } for (seq=0; seq<Seq_Count; seq++) { Seq_Blocks[seq] Number of AUs per sequence. A value of 0 means not specified (e.g. in Transport Format). } Dataset type Type of genomic data coded in the dataset. For example “aligned”, “not aligned”. Num_Classes Number of classes encoded in the Dataset. for (Class_ID =0;Class_ID<Num_Classes; Class_ID ++) { Num_Descriptors[Class_ID] Maximum number of Descriptors per Class encoded in the Dataset. } U_clusters_num Number of clusters of unmapped reads U_signature_constant_length 1 = constant length signature 0 = variable length signature U_signature_size Size in bits of each integers used to represent the encoded signatures if(U_signature_constant_length){ U_signature_length Length of clusters signature in nucleotides } }

Master Index Table

An indexing tool called Master Index Table (MIT) is disclosed in this invention.

The Master Index Table (MIT) is a data structure based on multi-dimensional array containing:

- The position, as number of nucleotides, of the left-most matching base among the primary alignments of all reads or read pairs contained in the Access Unit, as a set of Blocks from different Descriptors Streams, with regards to the reference sequence. This is represented by Start_AU_Ref_Position[Sequence_ID][Class_ID][AU_ID] in Table 3.
- The position, as number of nucleotides, of the right-most matching base among the primary alignments of all reads or read pairs contained in the Access Unit, as a set of Blocks from different Descriptors Streams, with regards to the reference sequence. This is represented by End_AU_Ref_Position[Sequence_ID][Class_ID][AU_ID] in Table 3.
- The byte offset of the first byte of each coded Block of descriptors composing each AU of each class encoded with respect to each reference sequence. The offset is calculated with respect to the first byte of the Dataset Payload (0-based). If the Block is empty and (1) Block_Header_Flag is set it is equal to 0xFFFFFFFF. If the Block is empty and (2) Block_Header_Flag is unset, it is equal either to the Block_Byte_Offset value of the next block in the Descriptor Stream or, for the last Block in the Descriptor Stream, to Descriptor_Stream_Size[Class_ID]. This is represented by Block_Byte_Offset[Sequence_ID][Class_ID][AU_ID][Descriptor_ID] in Table 3.
- According to the coding method used, which is signaled by a global configuration parameter, two alternative blocks of information:
  - The size of each Access Unit in bytes if each Access Unit is stored on the storage medium as a contiguous block of data, or
  - The size of each block of encoded descriptors if all descriptors of the same type are encoded and stored on the storage medium as a contiguous block of data.

The last section of the MIT contains two alternative sections that are used according to the presence of headers prepended to each coded Block of descriptors. If Block headers are present (Block_Header_Flag set) the MIT contains the size in byte of each Descriptor Stream. If Block headers are not present (Block_Header_Flag unset) the MIT contains the size in bytes of each Access Unit. The alternative between the two coding methods is signaled by the flag called Block_Header_Flag in Table 2.

TABLE 3 Master Index Table Syntax Semantics Master_Index_Table { for (Sequence_ID=0;Sequence_ID<Seq_Count;Sequence_ID++) { Sequence_ID is a unique identifier of a reference sequence and corresponds to the variable seq in the Dataset Header, as defined in Table 2. Seq_Count is the total number of reference sequences and is encoded in the Genomic Dataset Header, as defined in Table 2. for (Class_ID=0;Class_ID<Num_Classes - 1;Class_ID++) { Num_Classes is equal to field Num_Classes of Genomic Dataset Header, as defined in Table 2. for (AU_ID=0;AU_ID<Seq_Blocks[Sequence_ID];AU_ID++) { AU_ID is a unique identifier of the Access Unit. Seq_Blocks is an array of number of blocks per reference sequence and is encoded in the Dataset Header, as defined in Table 2. Start_AU_Ref_Position[Sequence_ID][Class_ID][AU_ID] The absolute position on the reference sequence identified by Sequence ID of the left-most mapped base among the primary alignments of all reads or read pairs contained in the AU and belonging to the class identified by Class_ID, in Access Unit identified by AU_ID. A reserved value is used if the Access Unit is empty. End_AU_Ref_Position[Sequence_ID][Class_ID][AU_ID] The absolute position on the reference sequence identified by Sequence ID of the right-most mapped base among the primary alignments of all reads or read pairs contained in the AU and belonging to the class identified by Class_ID, in Access Unit identified by AU_ID. A reserved value is used if the Access Unit is empty. for (Descriptor_ID=0; Descriptor_ID < Num_Descriptors[Class_ID] Num_Descriptors[Sequence_ID][Class_ID]; Descriptor_ID++) { [Sequence_ID] is encoded in the Num_Descriptors field of the Dataset Header. Syntax of Num_Descriptors Dataset Header field is defined in Table 2. Byte offset of the first byte in the Block_Byte_Offset[Sequence_ID][Class_ID][AU_ID][Descriptor_ID] Block, with respect to the first byte of the Dataset Payload (0-based). If the Block is empty and (1) Block_Header_Flag is set it is equalt o the reserved value signaling empty AU. If the Block is empty and (2) Block_Header_ Flag is unset, it is equal either to the Block_Byte_Offset value of the next block in the Descriptor Stream or, for the last Block in the Descriptor Stream, to Descriptor_Stream_ Size[Class_ID]. } } } } if (!Block_Header_Flag) { Block_Header_Flag is encoded in the Dataset Header, as defined in Table 2. for (Class_ID=0;Class_ID<Num_Classes;Class_ID++) { for (Descriptor_ID=0; Descriptor_ID < Num_Descriptors[Seq_Count-1][Class_ID]; Descriptor_ID++) { Descriptor_Stream_Size[Class_ID][Descriptor_ID] Descriptors Stream size in bytes. } } } else { for (Sequence_ID=0;Sequence_ID<Seq_Count;Sequence_ID++) { for (Class_ID=0;Class_ID<Num_Classes;Class_ID++) { for (AU_ID=0;AU_ID<Seq_Blocks[Sequence_ID];AU_ID++) { AU_Size[Sequence_ID][Class_ID][AU_ID] Access Unit size in bytes. } } } } for (Cluster_ID=0;Cluster_ID<U_Clusters_num;Cluster_ID++) { for (AU_ID=0;AU_ID<Seq_Blocks[Sequence_ID];AU_ID++) { U_Cluster_Signature[Cluster_ID][AU_ID] Cluster signature for (Descriptor_ID=0; Descriptor_ID < Num_Descriptors[Sequence_ID][Class_ID]; Descriptor_ID++) { U_Cluster_Byte_Offset[Cluster_ID][AU_ID][Descriptor_ID] Byte offset of the first byte in the Block, with respect to the first byte of the Dataset Payload (0-based). } } }

Master Index Table and Multiple Alignments

In case of presence of multiple alignments, the MIT introduced above is replicated to provide an indexing tool which takes into consideration the multiple alignments of reads or reads pairs coded into the Access Unit. The extended Master Index Table contains

- The position, as number of nucleotides, of the left-most matching base among all alignments of all reads or read pairs contained in the Access Unit, as a set of Blocks from different Descriptors Streams, with regards to the reference sequence. This is represented by Start_AU_Ref_Position[Sequence_ID][Class_ID][AU_ID] in Table 3.
- The position, as number of nucleotides, of the right-most matching base among all alignments of all reads or read pairs contained in the Access Unit, as a set of Blocks from different Descriptors Streams, with regards to the reference sequence. This is represented by End_AU_Ref_Position[Sequence_ID][Class_ID][AU_ID] in Table 3.

Claims

1. A method for encoding genome sequence data, said genome sequence data comprising reads of sequences of nucleotides, said method comprising the steps of:

aligning said reads to one or more reference sequences thereby creating aligned reads,

classifying said aligned reads according to specified matching rules with said one or more reference sequences, thereby creating classes of aligned reads;

encoding said classified aligned reads as a multiplicity of streams of syntax elements;

wherein encoding said classified aligned reads as a multiplicity of streams of syntax elements comprises selecting said syntax elements according to said classes of aligned reads;

providing said streams of syntax elements with header information thereby creating successive data blocks in order to entropy code said genomic data blocks into separately accessible data units.

2. The encoding method of claim 1, further comprising:

classifying said reads that do not satisfy said specified matching rules into a class of unmapped reads;

encoding said classified unmapped reads as a multiplicity of streams of syntax elements,

providing said streams of syntax elements and said encoded reference sequences with header information thereby creating successive Access Units.

3. The method of claim 2, wherein said classifying comprises identifying genomic reads having multiple alignment positions on the reference sequences used for alignment

4. The method of claim 3, wherein said classifying comprises identifying genomic reads which need to be split into multiple segments named splices to satisfy the matching rules for alignment.

5. The encoding method of claim 4, wherein the reads of the genomic sequence to be encoded are paired.

6. The encoding method of claim 5, further comprising the steps of:

Identifying the number of alignments of a read and representing this number with a specific syntax element;

Identifying per each alignment the corresponding mapping position and representing each mapping position with a specific syntax element;

Assigning an alignment score to each alignment to identify primary and secondary alignments;

Identifying as primary alignment the alignment with the highest score;

Identifying if any alignment is found on a different reference than the primary and representing this information using a specific descriptor; and

Identifying if any alignment does not preserve the a different contiguity on the reference sequence of the primary alignment and representing this information using a specific syntax element.

7. The encoding method of claim 6 further comprising the steps of:

Identifying reads which need to be split into two or more splices in order to be aligned on the reference sequence according to pre-defined matching rules defining the matching with said one or more reference sequences;

Signaling the presence of spliced reads using a global configuration parameter;

Representing the number of splices using a specific syntax element; and

Representing the length of each splice using a specific syntax element.

8. The encoding method of claim 7, further comprising the steps of:

Identifying the number of alignments of each read in a pair and representing this number with a specific syntax element;

Identifying per each alignment of the left-most read in the pair the corresponding mapping position and representing each mapping position with a specific syntax element;

Identifying per each alignment of the left-most read the associated alignments of the right-most read in the pair and representing the association with a specific syntax element;

Assigning an alignment score to each pair of alignments to identify primary and secondary alignments;

Identifying as primary alignment the pair of alignments with the highest score;

Identifying if any alignment is found on a different reference than the primary and representing this information using a specific descriptor; and

Identifying if any alignment presents a different contiguity on the reference sequence than the primary alignment and representing this information using a specific syntax element.

9. The encoding method of claim 8, further comprising the steps of:

Identifying reads which need to be split into two or more splices in order to be aligned on the reference sequence according pre-defined matching rules;

Signaling the presence of spliced reads using a global configuration parameter;

Representing the number of splices of the left-most read in a pair using a specific syntax element;

Representing the number of splices of the right-most read associated to each alignment of the left-most read with a vector of specific syntax elements; and

Representing the length of each splice using a specific syntax element.

10. The method of claim 9, wherein said streams of syntax elements comprise a genomic dataset header comprising:

a dataset group identifier used to uniquely identify each dataset group,

a genomic dataset identifier used to uniquely identify each dataset,

a brand identifier used to identify the data format specification the dataset complies with,

a minor version number used to identify the data format specification the dataset complies with,

the length of encoded genomic reads in nucleotides used to signal constant length reads,

a flag signaling the presence of paired end reads,

a flag signaling the presence of blocks headers,

a flag signaling the mode in which Access Units are stored on a storage medium in order to facilitate data access when decoding said Access Units,

the type of alphabet used to encode mismatches of sequence reads with respect to the reference sequences,

the number of reference sequences used to code the dataset,

a numeric identifier per each reference sequence used to uniquely identify each reference sequence,

a string identifier per each reference sequence used to uniquely identify each reference sequence,

the number of coded Access Units per reference sequence used to count the Access Units associated to each reference sequence,

the type of coded genomic data used to distinguish among aligned reads, unaligned reads, unmapped reads and reference sequences,

the number of data classes coded in the dataset,

the number of descriptors used per each data class coded in the dataset used during the decoding process,

the total number of clusters used to index encoded unmapped reads,

the number of bits used to represent the integer values used to encode the cluster signatures used to decode encoded clusters signatures,

a flag signaling if all the cluster signatures have the same length in terms of number of nucleotides, and

the length of the clusters signatures.

11. The method of claim 10, wherein said streams of syntax elements comprise a master index table, containing one section for each Class and sub-class of aligned reads, said section comprising:

the mapping positions on said one or more reference sequences of the primary alignment of the left-most read of each Access Units of each Class or sub-class of data,

the position on said one or more reference sequences of the right-most mapped base among all primary alignments of each Access Units of each Class or sub-class of data,

the offset in bytes of each block of coded syntax elements composing each Access Unit.

12. The method of claim 11, wherein said master index table further comprises the size of each coded descriptors stream and the size of each Access Unit.

13. The method of claim 12, wherein genomic reads have multiple alignments and said master index table comprises:

the mapping positions on said one or more reference sequences of the left-most alignment among all reads of each Access Units of each Class or sub-class of data,

the position on said one or more reference sequences of the right-most mapped base among all alignments of each Access Units of each Class or sub-class of data, and

the offset in bytes of each block of coded syntax elements composing each Access Unit.

14. The method of claim 13, wherein said Access Units contain read pairs.

15. The method of claim 14, wherein said master index table is jointly coded with said Access Unit data.

16. The method of claim 15, wherein said genomic dataset header is jointly coded with said Access Unit data.

17. The method of claim 16, wherein said streams of syntax elements further comprise information related to the type of reference used (pre-existing or constructed) and the segments of the read that do not match on the reference sequence.

18. The method of claim 17, wherein the encoding of said classified aligned reads as multiplicity of streams of syntax elements comprises the step of associating a specific source model and a specific entropy coder to each descriptor stream.

19. The method of claim 18, wherein said entropy coder is one of a context adaptive arithmetic coder, a variable length coder or a golomb coder.

20. A method for decoding encoded genomic data comprising the steps of:

parsing Access Units containing said encoded genomic data to extract multiple streams of syntax elements by employing header information; and

decoding said multiplicity of streams of syntax elements to extract aligned reads according to specific matching rules defining their classification with respect to one or more reference sequences.

21. The decoding method of claim 20, further comprising the decoding of unmapped genomic reads.

22. The decoding method of claim 21, further comprising decoding a genomic dataset header containing global configuration parameters.

23. The decoding method of claim 22, further comprising decoding a master index table containing one section for each class of reads and the associated relevant mapping positions and coded blocks offsets.

24. The decoding method of claim 23, further comprising decoding information related to the type of reference used: pre-existing, transformed or constructed.

25. The decoding method of claim 24, wherein said genomic reads are paired.

26. The decoding method of claim 25, wherein said genomic data are entropy decoded.

27. The decoding method of claim 26, further comprising the decoding of multiple alignments information comprising the steps of:

decoding the number of alignments of each read,

decoding the position of each alignment,

identifying the primary alignment by decoding the score associated to each alignment, and

identifying if any secondary alignment has a different contiguity with respect to the reference sequence than the primary alignment by decoding the corresponding syntax elements.

28. The decoding method of claim 27, further comprising the steps of:

Identifying if any coded read is split into two or more splices,

decoding the length of each splice, and

decoding the mapping position of each splice.

29. The decoding method of claim 28, wherein the coded genomic read are paired end reads further comprising the steps of:

decoding the number of alignments of the right-most read associated with each alignment of the left-most read, and

decoding the pairing information associating each alignment of the left-most read with one or more alignment of the right-most read,

30. The decoding method of claim 29, wherein the coded genomic read are split into two or more splices further comprising the steps of:

decoding the length of each coded splice, and

decoding the mapping position of each splice.

31. A genomic encoder (2810) for the compression of genome sequence data 289, said genome sequence data 289 comprising reads of sequences of nucleotides, said genomic encoder (2810) comprising:

an aligner unit (281), configured to align said reads to one or more reference sequences thereby creating aligned reads;

a constructed-reference generator unit (282), configured to produce constructed reference sequences;

a data classification unit (284), configured to classify said aligned reads according to specified matching rules with the one or more pre-existing reference sequences or constructed reference sequences thereby creating classes of aligned reads (288);

one or more descriptors encoding units (285-287), configured to encode said classified aligned reads as streams of syntax elements by selecting said syntax elements according to said classes of aligned reads;

one or more entropy encoding units (2812-2814), configured to compress said streams of syntax elements according to their statistical properties to produce Genomic Streams (2815); and

a multiplexer (2816) for multiplexing the compressed genomic data and metadata.

32. The genomic encoder of claim 31, further comprising coding means suitable for executing the coding.

33. A genomic decoder (298) for the decompression of a compressed genomic stream (291) said genomic decoder (298) comprising:

a demultiplexer (290) for demultiplexing compressed genomic data and metadata,

parsing means (292-294) configured to parse said compressed genomic stream into genomic layers of syntax elements (295),

one or more layer decoders (296-297), configured to decode the genomic layers into classified reads of sequences of nucleotides (2911), and

genomic data classes decoders (299) configured to selectively decode said classified reads of sequences of nucleotides on one or more reference sequences so as to produce uncompressed reads of sequences of nucleotides.

34. The genomic decoder of claim 33, wherein the one or more reference sequences are stored in a compressed genome stream (291).

35. The genomic decoder of claim 34, wherein the one or more reference sequences are provided to the decoder via an out of band mechanism.

36. The genomic decoder of claim 35, wherein the one or more reference sequences are built at the decoder.

37. A computer-readable medium comprising instructions that when executed cause at least one processor to perform the encoding method of claim 1.

38. A computer-readable medium comprising instructions that when executed cause at least one processor to perform the encoding method of claim 2.

39. A computer-readable medium comprising instructions that when executed cause at least one processor to perform the encoding method of claim 3.

40. A computer-readable medium comprising instructions that when executed cause at least one processor to perform the encoding method of claim 6.

41. A computer-readable medium comprising instructions that when executed cause at least one processor to perform the encoding method of claim 7.

42. A computer-readable medium comprising instructions that when executed cause at least one processor to perform the encoding method of claim 8.

43. A computer-readable medium comprising instructions that when executed cause at least one processor to perform the encoding method of claim 9.

44. A computer-readable medium comprising instructions that when executed cause at least one processor to perform the decoding method of claim 27.

45. A computer-readable medium comprising instructions that when executed cause at least one processor to perform the decoding method of claim 28.

46. A computer-readable medium comprising instructions that when executed cause at least one processor to perform the decoding method of claim 29.

47. A computer-readable medium comprising instructions that when executed cause at least one processor to perform the decoding method of claim 30.

48. Support data storing genomic encoded according to the method of claim 1.

49. Support data storing genomic encoded according to the method of claim 2.

50. Support data storing genomic encoded according to the method of claim 3.

51. Support data storing genomic encoded according to the method of claim 6.

52. Support data storing genomic encoded according to the method of claim 7.

53. Support data storing genomic encoded according to the method of claim 8.

54. Support data storing genomic encoded according to the method of claim 9.