DOUBLE-PASS LEMPEL-ZIV DATA COMPRESSION WITH AUTOMATIC SELECTION OF STATIC ENCODING TREES AND PREFIX DICTIONARIES
A method includes receiving an input data stream at a processor, and for each byte sequence from a plurality of byte sequences of the input data stream, a hash is generated and compared to a hash table to determine whether a match exists. If a match exists, that byte sequence is incrementally expanded to include one or more additional adjacent bytes from the input data stream, to produce multiple expanded byte sequences. Each of the expanded byte sequences is compared to the hash table to identify a maximum-length matched byte sequence from a set that includes the byte sequence and the plurality of expanded byte sequences. A representation of the maximum-length matched byte sequence is stored in the memory. If a match does not exist, a representation of that byte sequence is stored as a byte sequence literal in the memory.
Latest Cyborg Inc. Patents:
- Semantic search and retrieval over encrypted vector space
- System and method for encrypted search using hash vectorization models
- SYSTEM AND METHOD FOR ENCRYPTED SEARCH USING HASH VECTORIZATION MODELS
- System and method for encrypted search using hash vectorization models
- Double-pass Lempel-Ziv data compression with automatic selection of static encoding trees and prefix dictionaries
This application is a Continuation of U.S. patent application Ser. No. 17/382,015, filed Jul. 21, 2021 and titled “Double-Pass Lempel-Ziv Data Compression with Automatic Selection of Static Encoding Trees and Prefix Dictionaries,” which claims the benefit of U.S. Provisional Patent Application No. 63/056,160, filed Jul. 24, 2020 and titled “Double-Pass Lempel-Ziv Data Compression with Automatic Selection of Static Encoding Trees and Prefix Dictionaries,” the entirety of each of which is incorporated by reference herein in its entirety.
FIELDThe present disclosure relates to systems and methods for compressing data in a lossless manner, with particular improvements in compression and decompression performance across a range of data.
BACKGROUNDThe process of reducing the size of a data file is often referred to as data compression. Data compression involves encoding information using fewer bits than the original representation, and can be lossless or lossy.
SUMMARYSystems and methods for encoding and decoding data are described. In addition, systems and methods for performing searches on compressed data, without decompressing the data, are described. Example features, structure and operation of various embodiments are described in detail below with reference to the accompanying drawings.
In some embodiments, a method includes receiving an input data stream at a processor. For each byte sequence from a plurality of byte sequences of the input data stream, a hash of that byte sequence is generated by the processor. The hash is compared, via the processor, to a hash table to determine whether a match exists. The hash table is stored in a memory operably coupled to the processor. If a match exists, that byte sequence is incrementally expanded, via the processor, to include one or more additional adjacent bytes from the input data stream, to produce a plurality of expanded byte sequences. Each expanded byte sequence from the plurality of expanded byte sequences is compared, via the processor, to the hash table to identify a maximum-length matched byte sequence from a set that includes the byte sequence and the plurality of expanded byte sequences, and a representation of the maximum-length matched byte sequence is stored in the memory. If a match does not exist, a representation of that byte sequence is stored as a byte sequence literal in the memory.
In some embodiments, a system includes a processor and a memory that is operably coupled to the processor and that stores instructions that, when executed by the processor, cause the processor to perform a method. The method includes, for each byte sequence from a plurality of byte sequences of an input data stream, generating a hash of that byte sequence and comparing the hash to a hash table to determine whether a match exists. The hash table is stored in a memory operably coupled to the processor. If a match exists, that byte sequence is incrementally expanded to include one or more additional adjacent bytes from the input data stream, to produce a plurality of expanded byte sequences. Each expanded byte sequence from the plurality of expanded byte sequences is compared to the hash table to identify a maximum-length matched byte sequence from a set that includes the byte sequence and the plurality of expanded byte sequences. A representation of the maximum-length matched byte sequence is stored in the memory. If a match does not exist, a representation of that byte sequence is stored as a byte sequence literal in the memory.
In some embodiments, a non-transitory, processor-readable medium stores instructions to perform a method. The method includes, for each byte sequence from a plurality of byte sequences of an input data stream, generating a hash of that byte sequence and comparing the hash to a hash table to determine whether a match exists. The hash table is stored in a memory operably coupled to the processor. If a match exists, that byte sequence is incrementally expanded to include one or more additional adjacent bytes from the input data stream, to produce a plurality of expanded byte sequences. Each expanded byte sequence from the plurality of expanded byte sequences is compared to the hash table to identify a maximum-length matched byte sequence from a set that includes the byte sequence and the plurality of expanded byte sequences. A representation of the maximum-length matched byte sequence is caused to be stored in the memory. If a match does not exist, a representation of that byte sequence as a byte sequence literal is caused to be stored in the memory.
In some embodiments, a method includes receiving, at a processor, an input bit stream, and generating, via the processor and for each byte sequence from a plurality of byte sequences of the input data stream, a hash of that byte sequence, to define a plurality of hashes. The method also includes storing, in a memory operably coupled to the processor, an array that includes (1) a plurality of positions, each position from the plurality of positions being a position within the input data stream of a hash from the plurality of hashes, and (2) a last observed position of each hash from the plurality of hashes. The method also includes identifying, via the processor, a plurality of potential matches between the plurality of byte sequences and a hash table based on the array, and calculating a score, from a plurality of scores, for each potential match from the plurality of potential matches. A subset of potential matches is selected from the plurality of potential matches, based on the plurality of scores, and a representation of the selected subset of potential matches is stored in the memory.
In some embodiments, a system includes a processor and a memory, operably coupled to the processor and storing instructions that, when executed by the processor, cause the processor to perform a method. The method includes generating, for each byte sequence from a plurality of byte sequences of the input data stream, a hash of that byte sequence, to define a plurality of hashes. The method also includes storing, in the memory, an array that includes (1) a plurality of positions, each position from the plurality of positions being a position within the input data stream of a hash from the plurality of hashes, and (2) a last observed position of each hash from the plurality of hashes. The method also includes identifying a plurality of potential matches between the plurality of byte sequences and a hash table based on the array, and calculating a score, from a plurality of scores, for each potential match from the plurality of potential matches. The method also includes selecting a subset of potential matches from the plurality of potential matches, based on the plurality of scores, and storing a representation of the selected subset of potential matches in the memory.
In some embodiments, a method includes generating, via a processor and for each byte sequence from a plurality of byte sequences of an input data stream, a hash of that byte sequence, to define a plurality of hashes. The method also includes comparing each hash from the plurality of hashes to a hash table to identify a plurality of matched hashes associated with a first subset of byte sequences from the plurality of byte sequences, a second subset of byte sequences from the plurality of byte sequences including byte sequences that are not associated with a matched hash from the plurality of matched hashes. The method also includes selecting a static Huffman tree to encode the second subset of byte sequences, based on a predefined encoding strategy, and calculating an entropy associated with the selected static Huffman tree. The method also includes calculating a result size associated with the selected static Huffman tree, based on the entropy, and determining whether the result size is within a predefined percentage of a number of byte sequences in the second subset of byte sequences. If the result size is within the predefined percentage of the number of byte sequences in the second subset of byte sequences, an encoding type is set to static encoding. If the result size is not within the predefined percentage of the number of byte sequences in the second subset of byte sequences, the number of byte sequences in the second subset of byte sequences is less than a predefined first threshold value, and the result size is less than the predefined first threshold value, the encoding type is set to an encoding procedure that is performed based on an inverted index array and a rank table. If the result size is not within the predefined percentage of the number of byte sequences in the second subset of byte sequences, and if at least one of: (1) the number of byte sequences in the second subset of byte sequences is not less than the predefined first threshold value, or (2) the result size is not less than the predefined first threshold value: a custom prefix is generated and compared to the selected static Huffman tree. If the custom prefix is preferable to the selected static Huffman tree, the encoding type is set to custom, and if the custom prefix is not preferable to the selected static Huffman tree, set the encoding type to static encoding.
In some embodiments, a method includes identifying, via a processor, a first subset of an input data and a second subset of the input data. A Huffman tree is selected, via the processor and from a plurality of Huffman trees, based on a predefined encoding level, and the first subset of the input data is encoded using the selected Huffman tree. A frequency curve is generated for the second subset of the input data, using one of (1) a Riemann Sum or (2) a triple lookup table having a plurality of accuracies, and the second subset of the input data is encoded based on the frequency curve.
As the availability of storage capacity and network bandwidth increases, the desirability of new and useful data compression techniques also increases. Data compression techniques can generally be divided into two major categories: lossy and lossless. Lossless data compression techniques are typically employed when it is particularly important that no information is lost during the compression/decompression process. Lossy data compression techniques are typically employed in processing applications such as the transmission and storage of digital video and audio data that can tolerate some information loss (e.g., since human vision is forgiving of potential artifacts). Lossy data compression techniques typically yield greater compression ratios than their lossless counterparts. Over the past 30 years, lossy data compression methods have gained tremendous importance for their use in video conferencing/streaming to a wide variety of devices, and home entertainment systems. Most other applications employ lossless data compression techniques.
For applications using data types such as video, it is possible to achieve compression ratios of 150:1 for Quarter Common Intermediate Format (QCIF)@15 fps over 64 Kbps (typically used in wireless video telephony applications) or 1080P High Definition (HD)@60 fps at 20 Mbps over broadband networks. These applications typically use the modern International Telecommunication Union (ITU) H.264 video compression standard, resulting in high quality video. However, for data types/files such as documents, spreadsheets, SQL files, etc., lossless data compression is generally strongly preferred. Compression ratios for lossless methods are typically much lower than those for lossy methods. For example, lossless compression ratios can range from 1.5:1 for arbitrary binary data files, to 3.0:1 for files such as text documents, in which there is substantially more redundancy.
Transmitting compressed data takes less time than transmitting the same data without first compressing it. In addition, compressed data uses less storage space than uncompressed data. Thus, for a device with a given storage capacity, more files can be stored on the device if the files are compressed. As such, two of the primary advantages for compressing data are increased storage capacity and decreased transmission time.
Embodiments of the present disclosure set forth novel methods for accomplishing data compression in lossless and/or lossy contexts. When a parsing quality of a data compression technique is higher, the compression ratio is typically also higher, however increasing the parsing quality can also result in a slower process. In view of this trade-off, multiple different embodiments of encoders (and associated methods) are presented herein.
LZ ModelingIn some embodiments, an encoder is part of a “Lempel-Ziv” (“LZ”)-modeled encoder family. LZ modeling makes it possible for the encoder/compressor to identify byte sequences that are similar to one another within an input bit stream. The identified similar byte sequences can, in turn, be used to compress the data of the input bit stream. For example, the first time that a given byte sequence appears within the input bit stream, the LZ modeling function may identify that byte sequence as a “literal byte” sequence. Subsequently, whenever the same byte sequence occurs, the LZ modeling function can identify that byte sequence as a “match.” The foregoing process is referred to herein as “parsing” the data. As discussed above, when the parsing quality is higher, the compression ratio is typically also higher, however increasing the parsing quality can also result in a slower process. In view of this trade-off, multiple different embodiments of encoders (and associated methods) are presented herein, ranging from encoders having a fastest compression, to encoders having a slowest compression but a highest compression ratio. The encoder embodiments set forth herein leverage modern processor architectures, while innovating the manner in which data is parsed, for example using different numbers of passes based on the parsing quality selected.
In some embodiments, LZ modeling is performed on the encoder but not on the associated decoder, and the quality of the parsing used on the encoder does not affect the decoder speed.
Single-Pass ModelingIn some embodiments, a processor-implemented encoder employs one-pass modeling, or single-pass modeling (SPM), operated by the function , and exhibits the fastest parsing of the encoders described herein. SPM includes creating a hash table to check and store the positions of each byte sequence in an input bit stream. Each instance of a byte sequence having a same hash value as a previously observed instance of the byte sequence is used to overwrite that previously observed instance. A size of the byte sequences can be, for example, four bytes or six bytes, and may be determined by a size of the input bit stream. In some implementations, a size of the hash table is relatively large (e.g., 64 kilobytes (KB)), e.g., to reduce the likelihood of collisions.
The following code illustrates a process to hash a single byte sequence, according to some embodiments:
In some embodiments, to leverage modern x86 architectures, SPM hashes four candidate byte sequences at a time (i.e., concurrently) before checking for a match against the hash table. This allows the processor to perform the comparisons Out-of-Order (OoO) and feed the pipeline. The following code illustrates a process to hash four consecutive candidate byte sequences, according to some embodiments:
The hashes of the four candidate byte sequences are then sequentially compared to the hash table to attempt to identify a match. If a match is found, a function is called and used to attempt to expand the size of the matching byte sequence in a forward direction within the input bit stream (e.g., incrementally expanding the byte sequence to include bits or bytes occurring subsequent to the byte sequence within the input bit stream). To obtain the size of a match, a De Bruijn sequence can be used, which allows a fast comparison of two byte sequences and returns the size of their common substring. Depending on the desired quality level, a match also can be expanded in a backward/reverse direction within the input bit stream (e.g., incrementally expanding the byte sequence to include bits or bytes preceding the byte sequence within the input bit stream) by the function . To store the match, a function is called. In some implementations, only the first match identified is stored, and the three other matches may be used as potential matches for future byte sequences, thereby improving the overall compression ratio of the encoder. If no matches are found among the four candidate byte sequences, the byte sequences may be stored (e.g., in a separate buffer) as byte literals.
A match can be represented by a structure that includes the following three variables, collectively referred to herein as a “triad”:
-
- Length: the size of the byte sub string returned by the De Bruijn technique+optional backward expansion
- Offset: the distance between the matching byte sequence and the current byte sequence
- Number of literals: the number of byte literals between the match found and the previous match, within the bit stream
Example code illustrating the storage of the triad is as follows:
In some embodiments, the foregoing process is repeated until an end of the input bit stream is reached, at which time the SPM returns the literal buffer and the triad buffer to be encoded (see “Byte Literal Encoding” and “Triad Encoding” sections, below).
The offset portion of the triad is stored as a 32-bit integer, pre-encoded as shown below (e.g., for faster retrieval), while the length and number of literals are respectively stored as 8-bit integers.
In some embodiments, a processor-implemented encoder employs two-pass modeling, or “double-pass modeling” (“DPM”), implemented by calling a function , which is an adaptative modeling technique that can accommodate multiple different parsing quality levels, and uses Markov chain compression to store potential matches, leveraging modern processor architectures. A first pass of DPM includes creating a chain buffer (or “buffer road” or “buffer road map”) based on the Markov chain and the SPM pre-hash approach described above, while a second pass of DPM scores and selects the best estimated matches to be encoded.
The first pass is processed through a function , which hashes four byte sequences at a time, following the SPM pre-hash technique, and stores the positions of the sequences in a separate array, following the Markov chain, by storing the last observed position of the same hash value at the current position. As such, a first position having the same hash value as a second position is linked to the second position by containing a reference to the second position.
The buffer road size is the size of the input bit stream (e.g., up to 128 KB (217)), and the first pass is completed once the entire input bit stream has been hashed and processed. At the end of the first pass, the hash table can be deallocated (i.e., the contents of the hash table can be cleared) as all the potential match positions are stored in the buffer road. Subsequent modeling is based on the chain (i.e., the “separate array” referenced above), the input bit stream only being accessed again if an expansion of a match is desired, which can serve to reduce cache misses.
During the second pass, searches for potential matches are performed using the buffer road, the potential matches are scored, and the best matches are selected. Searches of the buffer road are efficient since every position is linked to a position of a next match candidate. The variable refers to a number of searches performed in the chain, and can be defined by a desired parsing quality. Given that a match can overlap another match, several steps forward in the input bit stream defined by a variable may be taken (i.e., several additional bits, in a forward direction of the input bit stream, may be considered) to ensure that a preferable match is not overlooked/missed. For each step taken forward in the input bit stream, a same amount of checks () may be performed in the chain. The number of steps taken forward in the input bit stream can be defined by the desired parsing quality.
Below is example pseudocode for a subroutine (also referred to herein as a “scoring function”), which returns a best potential match:
In some embodiments, prior to encoding a literal buffer, an encoding strategy is selected, for example using a function. Depending on the desired compression level (e.g., a desired quality of parsing and/or speed of parsing), one or more different heuristics, e.g., implemented by a static encoding (or Huffman) tree selector implemented in code stored in memory and executed by a processor operably coupled to the memory, can be used for selecting an appropriate (or “best”) static Huffman tree, from a set of available Huffman trees, for subsequent comparison with a custom prefix-free tree previously generated for the literals in the literal buffer.
In some embodiments, selecting the encoding strategy includes counting frequencies of occurrence of characters in the literal buffer. A function , which counts the characters using an unrolled loop, can be used for this purpose.
Depending on the selected encoding level, the static tree selector can use, for example, principal component analysis (PCA) or a cross-entropy heuristic to determine the best static Huffman tree. PCA may be selected, for example, when faster compression is desired, while cross-entropy may be selected, for example, when a desired accuracy in the compression is prioritized over speed of compression.
The functions and , for example, can return a static tree index, and a size of the result of compression using the best chosen Huffman tree can be provided in the parameters passed by reference.
In a next step, an entropy of the literal buffer can be calculated, for example using a function , which uses loge for the precise calculation of the entropy. The preferred prefix result size is then calculated based on the entropy percentage. The selected prefix result size can account for (i.e., include sufficient capacity for) the size of the custom prefix that would be saved in a header by including a constant value for the custom prefix (e.g., a constant) as part of the selected prefix result size.
In some embodiments, if a calculated difference in predicted size between the static tree size and the preferred prefix tree falls within a predefined threshold percentage of the number of literals in the literal buffer (e.g., set to 1.5% of the number of literals), and the static tree would not expand the literals buffer, static encoding may be selected, in which case the static tree is copied to the literal lookup table (LUT).
If the static tree appears to be worse than (e.g., less compressed than) the preferred prefix, the number of literals in the literal buffer is less than a predefined value (e.g., a value, for example set to 2,000), and a calculated difference in predicted size between the static tree size and the selected prefix tree is still relatively small (e.g., not more than double the predefined threshold), a bubble-up encoder may be selected.
If the number of literals in the literal buffer is greater than the predefined value (e.g., ) and/or the calculated difference in predicted size between the static tree size and the selected prefix tree is greater than the predefined threshold, a custom canonical prefix may be generated, with a codeword length limited to 11 bits. Since the codeword length is limited, the custom canonical prefix may exhibit a reduced compression ratio. As such, the result size for the custom canonical prefix may be compared with the result size for the static tree, and the custom canonical prefix may be selected if it will produce a result having fewer bits than the static tree.
As discussed above, in some embodiments, a cross-entropy heuristic is used for the selection of static trees for the compression of byte sequences and/or buffers. When compressing a literal buffer, for example, a set of 256 distinct frequencies for each 8-bit character can viewed or used as a probability distribution.
In general, cross entropy is calculated by using the following formula:
Where p(x) is a current probability distribution, q(x) is a probability distribution to which p(x) is compared, and H(P, Q) refers to the cross entropy of variable P and variable Q. The formula below describes the relationship between cross-entropy and Kullback-Leibler divergence:
where H(P) is the entropy of P (which is the same as cross-entropy of P with itself).
In statistics, the Kullback-Leibler divergence is a measure of how a first probability distribution is different from a second probability distribution. As discussed above, the set of frequencies of symbols in the literal buffer can represent a probability distribution for the occurrence of a symbol in a buffer. In the case of compression, the Kullback-Leibler divergence can be used to represent the difference between an average codelength of a given static tree, and an average codelength of a preferred tree for the given set of symbol frequencies. As used herein, “codelength” refers to the length of encoding of a byte symbol, in bits (or, an “average codeword length”).
For a given encoding L={l1, l2, . . . , ln} and set of symbol frequencies P={p1, p2, . . . , pn}, an average codelength can be expressed by the following formula:
This average codelength for a Huffman code may approach an optimal codelength. When the codelengths for a given static tree are known, the cross-entropy can be expressed as a function of codelength, to obtain the following formula:
where lqi represents length of the i-th symbol in the static tree with distribution q. Consequently, Kullback-Leibler divergence is represented as:
To reduce the number of calculations, sizes of compressed literal buffers for a given static tree may be calculated, since the smallest size represents the smallest cross-entropy. A function , for example, can call a function , and then select the best static tree using a for-loop. Example function code is as follows:
In some embodiments, a set of 256 possible symbol frequencies for each 8-bit character of an input bit stream is treated as a set of dimensions, in the context of a PCA heuristic calculation. To identify a specific file type for a given file, principal components can be extracted from the file. After subsequently calculating the principal component for a current set of symbol frequencies, a static tree that is the most similar may be identified.
PCA can be performed by eigenvalue decomposition of a data covariance (or correlation) matrix, or by singular value decomposition of a data matrix after normalization of the initial data (where “initial data” refers to the frequencies of the byte symbols in the input file). A normalization can be performed, by “mean centering” (i.e., subtracting each data value from its measured mean to render its empirical average zero). In the case of a character/symbol frequency distribution, a normalizing of the variable using Z-scores may not be needed, as the values naturally add up to 1.
Singular value decomposition is based on following formula:
X=UΣWT
where X is a matrix having dimensions “n” by “p” (e.g., 64×256—a number of static trees by a number of characters), U is an “n” by “n” by “n” matrix, the columns of which are orthogonal unit vectors, of length “n,” referred to as the left singular vectors of X, Σ is an “n” by “p” diagonal matrix of positive numbers “a,” referred to as the singular values of X, and W is a “p” by “p” matrix whose columns are orthogonal unit vectors of length “p” and referred to as the right singular vectors of X.
To obtain the projection of data onto a small number of dimensions, the following calculation can be performed:
TL=XWL
where WL is a truncated matrix comprising the first most important singular vectors of X
In some embodiments, the task of reducing a set of 256 components to a few dimensions can be performed, for example, using Python libraries such as Pandas, NumPy and scikit-learn. For example, a header “pca.h” can be generated using a Python script. The header “pca.h” includes precomputed vectors derived from the saved symbol frequencies for the static trees that are computed during generation of the static tree header. The Python script also creates a sorted array that includes both projected values for each tree and their respective indexes. The sorted array is used to retrieve the index of the most similar static trees.
In some embodiments, to reduce the number of multiplications performed, only the first primary component is used, since the result of a scikit-learn function showed that the first primary component accounts for 49% of variance (and the subsequent two primary components account for 17% and 14% of variance, respectively). To obtain the first primary component, the symbol frequencies of the given literal buffer can be multiplied by the precomputed vector. The function can then be used to calculate the primary component, as shown in the following example code:
The first primary component can then be located, within the presorted array that includes all values for all static trees.
To improve the robustness of the PCA heuristic, operations can be performed on a range of similar static trees (e.g., with 12 static trees being considered in each pass). Using this reduced set of static trees we find, the static tree having the smallest file size can be identified, e.g., in a manner similar to that of the cross-entropy method, described herein. This reduced-set processing can increase the speed at which the best static tree identified, by at least four-fold, as the number of multiplications is significantly reduced.
The following is example source code for a function that is responsible for selecting a best static tree, using the PCA heuristic:
In experiments comparing the cross-entropy heuristic with the PCA heuristic, the PCA heuristic has been shown to perform significantly faster, however, it has a lower accuracy as compared with the cross-entropy heuristic, since the latter is certain to identify the best static tree available. As such, the PCA heuristic may be selected when the fastest possible compression level is desired, and where some accuracy can be sacrificed in favor of speed.
Building Custom Huffman TreesIn some embodiments, generating a custom prefix includes three phases. During a first phase, a array is generated by taking the raw counts of the symbol frequencies and adding 1 to each of the raw counts, to account for each possible character. A check may be performed to ensure that there is a codeword for each character, in case there was an error with counting the symbol frequencies in the buffer. The incremented counts of the characters are then placed into a structure, for example using a function . Symbol frequencies saved in are then sorted in ascending order.
During a second phase, a canonical prefix without a length limit is generated, e.g., using a function . As used herein, “canonical” means that none of the codewords is a prefix of another codeword (thereby rendering the decoding process more efficient). The prefix length can be limited, for example to a length of 11, which restricts the size of the look up table to 2048 elements (211). A look-up table facilitates the rapid decoding of characters that have been encoded with a canonical prefix.
In some embodiments, the function allocates memory for all nodes of a tree, then initializes parent nodes to include an unsigned integer (e.g., a maximum 32-bit unsigned integer). In a main loop, two consecutive nodes are joined to form a parent node if they have a smaller symbol frequency than the current parent node. Otherwise, the two consecutive nodes are joined with a parent node, or two parent nodes form a new parent node with the sum of their respective symbol frequencies. The foregoing process may be repeated until a root node is created.
The function returns a maximum depth of the tree. Example source code for the function is as follows:
In a third phase, if the maximum depth is more than 11, a Kraft-McMillan inequality (“Kraft's inequality”) can be used to adjust the lengths down to maximum. Kraft's inequality limits the lengths of codewords in a prefix code. For example, if one takes an exponential of the length of each valid codeword, the resulting set of values will have a probability mass function, i.e., a total measure less than or equal to one. Kraft's inequality can be thought of in terms of a constrained budget to be spent on codewords, with shorter codewords being more expensive. Characters having extraneous lengths can be moved upwards within the tree (i.e., their symbol frequencies can be incremented to achieve higher positions in the tree).
Encoding with a Huffman TreeIn some embodiments, once a best tree has been selected for compressing the literals, the process of encoding literals is similar whether a custom generated prefix tree has been selected or a precomputed static tree has been selected. For custom prefix trees, the following additional steps may be performed: generating the symbols used for encoding, and saving the custom prefix encoder to the header struct. Static trees can be saved in a file (e.g., ), for example in the form of arrays that already include precomputed symbols, as follows:
In one or more embodiments, the process of encoding the literals is performed using a Huffman encoding technique. To utilize out-of-order execution, four streams can be generated. The following example code illustrate the implementation of a main encoding loop:
The loop procedure shown above can be applied to both static trees and custom prefix trees. The saved positions in the output buffer facilitate out-of-order decoding of literals (described further below). Another advantage of using static trees is that the look-up tables used for decoding literals may be precomputed, and thus ready to use (as described herein).
Bubble Up EncodingIn some implementations, the static tree performance is similar to the performance of the custom prefix tree, for example when the size of the literal buffer is small (e.g., less than a predefined threshold, for example 2,000) and/or when the difference between the result size from the static tree is no more than twice the threshold.
In some embodiments, a “bubble up” process, as described herein, can be used to correct the result for the static tree, such that the corrected result is closer to an optimal entropy, by replacing or changing the positions of the characters in the sorted encoding length array. To facilitate bubble up corrections, a sorted array, sorted codewords, and rank tables for the symbols are generated at the same time that the static trees header is automatically generated. To apply bubble up, a first step is generating an inverted index array. The function can be used to perform this step, example code for which is as follows:
During bubble up encoding, as symbols are encoded using the look-up table, the rank of each symbol encountered in the literal buffer is updated. Consequently, the higher the frequency of a symbol, the shorter its length. Below is example code for implementing :
The foregoing process can be replicated during decoding (as further described herein), as both the rank table for characters and the sorted codewords are available for each static tree.
Custom Tree ReuseIn some embodiments, to accommodate large files, an internal chunking process is used, in which the input buffer is divided into chunks of a given size (e.g., 128K bytes each). As the custom prefix tree takes 128 bytes to store in the header, while the chunks of a specific file tend to have similar frequencies of symbols/characters, the custom prefix encoder arrays can be saved/stored for reuse, thereby saving the space that would otherwise be wasted for another custom prefix.
The saved custom prefix trees can be included in cross-entropy calculations. If one of the previously saved prefixes is determined to be preferable to a custom prefix, it can be reused, and then only its index saved in the header. On the decoding end, the custom look-up tables can also be saved in the same order.
Custom tree reuse can be beneficial to both encoding and decoding speeds. For encoding, custom tree reuse can save time and computing resources that would otherwise be spent on building a custom prefix tree, as the size of the custom prefix tree is only estimated. For decoding, custom tree reuse has the advantage of having a precomputed look-up table. The computational overhead associated with calculating cross-entropy for additional trees is negligible in comparison.
Automatic Static Tree GeneratorStatic trees described herein can be generated using a tree generator standalone application. In some embodiments, a tree generator application can be used to improve the compression ratio for a specific user-defined dataset. For example, the tree generator application may generate an appropriate set of static trees for the specific files types the user is compressing, and store them in a resource file. As the static trees would thus be more tailored to the needs of the user, they may offer higher compression and decompression speeds, at a ratio very close to the custom prefix trees.
Triad EncodingIn some embodiments, data to be encoded is treated as having two distinct parts—small integers in a lower range (e.g., the range of [0-100]), and large integers in an upper range (e.g., the range of [100-220]). In the triad structure, the small integers can represent the length and the number of literals smaller than 100, while the large integers can represent all of the offsets, as well as the length/number of literals larger than 100.
Small Integer Encoding can be performed using the same approach used for literal encoding, with a static tree selector and a custom tree builder (see “Byte Literal Encoding”).
Large Integer Encoding: representing a large integer using a variable length code word, and achieving as close as possible to Shannon Entropy, are major challenges of entropy coding. Known method for encoding large integers use pre-set intervals defined by the logarithm base two of the large integers:
interval=log2(integer)
reduced_integer=integer−(2{circumflex over ( )} interval)
encoded_integer=interval variable code word+reduced_integer
The interval variable code word is typically generated by a Huffman tree builder and decoded by a lookup table. Since the intervals are fixed, they may not be optimal, and a lookup table is also needed to decode the interval number.
To address this problem, and according to some embodiments, a novel Large Integer Encoding (LIE) technique is inspired by the Riemann Sum, examples of which are shown in
In some embodiments, the modeling single pass and double pass record the frequencies of the offsets that are large integers, and produce a frequency curve by using a triple lookup table called with different accuracies:
-
- [0-4,095]: using an accuracy of ±16 (High)
- [4,096-65,535]: using an accuracy of ±256 (Medium)
- [65,536: 1,048,575]: using an accuracy of ±4,096 (Low)
The amount of memory required is: 3*256*log2(input_size) bytes.
Compare to a classic frequency table: input_size*log2(input_size) bytes.
To lower the memory usage, the triple LUT is advantageous when the input size to encode is larger than 3*256=768 bytes (which represents all standard chunk sizes).
The following is example code to implement the function that embodies this process:
As discussed above, LIE can be used to encode large integers. The following is an example of a frequency curve approximation performed using the Riemann Sum (N=2{circumflex over ( )}20/accuracy):
From the size [size1-sizeN], a Riemann compression table can be generated, and the large integers can be encoded. However, the foregoing approach is limited by a predefined range, and may not compress with an entropy that is close to the optimal entropy. As such, LIE uses a reversed approach: making the range a variable and the integration of the curve a constant based on the preferred entropy to add granularity, for example as follows:
Here, the range is not predefined and is computed based on an approximation, constrained only by the accuracy selected, which provides a Riemann compression table much closer to the optimal entropy than is possible using known techniques:
The encoded large integer framework is:
Since the interval number is a fixed size, no lookup table is needed at the decoder.
The LIE table can be generated, for example, using the function . The following example code can be used to implement the main loop that accomplishes the process of building the ranges.
Since all of the ranges are base2, each time the integration reaches a preferred value, a check can be performed to determine whether the previous range was preferable. If so, the previous range can be selected, and the preferred value can be recomputed based on the previous range. As the frequency curve is split on the three arrays, if one range is across two arrays the function will respect the accuracy of both arrays by applying a correction to the range.
In some embodiments, once the LIE table is built, the triple lookup table used to store the frequencies is reused and gets populated, for example by for use during the encoding. The following example code can be used to implement one of the three loops used to populate the array:
In some embodiments, there are two ways in which an LIE tree can be encoded: (1) encoding the ranges by using 5 bits to describe the ranges, and (2) The delta coder, discussed below, to describe the difference between two consecutive ranges.
To determine which technique to use for the encoding of the tree, the number of bits to use to encode the biggest delta can be calculated, as shown below:
delta_size=log2(max_delta)
with being the biggest difference between two consecutives ranges, computed by .
If delta_size is greater than 4 bits, method (1) may be used. If delta_size is less than or equal to 4 bits, the delta coder, executed by , can be used, for example as follows:
In some embodiments, the LIE records large integer frequencies using variable ranges based on the chunk size, rather than saving the frequencies in a static triple LUT.
The following table presents example accuracy values, with the number of bits used by the different LUTs. When the size of the LUT is 256, to update the frequency of one offset, a shift may be performed by the accuracy specified, and the next 8 bits (28=256) masked.
The following example code can be used to implement the function , which selects an accuracy based on the chunk size:
is also updated to be able to use the new variable accuracies:
In some embodiments, to better scale to the frequency curve, the number of ranges can be variable.
Repeated Large Integer EncodingIn some embodiments, in LZ Offsets, large integers can be repeated sequentially. It can be advantageous to include repeated offsets, for example up to the fourth position from a current offset. These repeated offsets can be encoded as numbers from 0 to 3, and can be assigned ranges according to ratios of their encoded numbers to the number of total offsets. For example, if repeated offsets account for more than 50% of all offsets, half of the ranges will be assigned to repeated offsets.
The function assigns the ranges for all the repeated offsets, taking into account boundary conditions. For example, at least one range may be desired for each existing repeated offset, and a condition may be enforced such that the existing repeated offset(s) do not overlap with regular offset ranges.
Encoding Triad in BlocksIn some embodiments, once Riemann encoding of one or more offsets has been performed, Huffman trees are generated for both length and number of literals, in a manner similar to that of encoding the literals buffer. In the case of files that include more than 1 block of 128KB, these trees can be reused, as they are saved in a special structure. Triads can be encoded in blocks having a size that is defined in stealth structure.h (e.g., set to 128). The length buffer, number of literals buffer and offset buffer can be encoded separately, in chunks of the block size, for example to facilitate buffered decoding that relies on using stack memory. The foregoing procedure was shown to be faster on decoding, especially on smaller chunks. The following figure presents example schematic encoding of triads in an output_buffer:
In some embodiments, all of the triad elements are encoded using unrolled for loop to utilize Out-of-Order compiler optimization.
Byte Literal Decoding Custom Tree DecodingIn some embodiments, decoding literals with a custom tree includes rebuilding the array from a saved custom prefix. A function can be used to perform this task. The custom prefix can be read from the header for each chunk, and the following example code can be used to implement a loop that creates the look up table for decoding:
During decoding, the compressed buffer can be read into a 64-bit queue, and a logical AND operation can be performed, for example using a function . The resulting value is then used as an index in the look up table, to retrieve original symbol values and sizes. Afterwards, the queue is shifted by the number of bits saved in the size field of struct. This process can also be performed using out-of-order execution, to increase decoding speed. The values of block sizes saved during the encoding process are used to determine the location of the starting points for each of the four streams. The following example code can be used to implement the main decoding loop, whether decoding with a custom tree or a static tree:
The process of decoding literals encoded with a static tree can be faster than decoding literals encoded with a custom tree, since a LUT is not generated. In some embodiments, headers containing static trees also contain precomputed look-up tables for decoding. The decoding process can proceed in a manner similar to that of a custom prefix, with the exception that a is performed, of the precomputed table to a buffer.
Bubble-Up DecodingIn some embodiments, decoding literals encoded using bubble up encoding includes performing additional steps beyond what is performed for either of custom tree or encoding static tree encoding. For example, both ordered prefix lengths may first be copied from the sorted array using the index of the static tree saved in the header. Next, the inverted index array is reconstructed using the same function that was used during encoding. After each symbol has been decoded, a rank of that symbol is updated so it will reflect the bubble up rank increase. While the function is used during encoding, during decoding a function reverses the encoding and retrieves each original symbol from the table. The following example code can be used to implement the function:
In some embodiments, a number of literals and/or a length of a match can be encoded using a static tree or a custom prefix that, in turn, is saved in the header. A primary difference between custom prefixes for literals and custom prefixes for triad fields is that custom prefixes for triad fields have a dynamic size, since only elements up to a last non-zero element in the triad arrays are saved. There may also be extra values that do not fit in the original range. These extra values are encoded as “extras.” The number of extras and their respecting encoding, if any, can be saved after the encoded literal buffer.
Similarly, as in decoding literals, a first step, in the case of a custom prefix, is reconstructing the array, and subsequently reconstructing the look-up table. In the case of static trees, the look up tables are available to copy from the header, and as such, the LUT reconstruction step is skipped, resulting in a speed advantage for the static tree approach. If a file is sufficiently large, the decoding look-up tables can also be reused, thereby further speeding up the decoding.
In some embodiments, triads are decoded using Out-of-Order execution. The triad block sizes saved during the triad encoding process are read from the header to establish the starting points for each stream. The triad elements are decoded separately, in a manner similar to the manner in which they were encoded, in blocks having a size that is determined by the constant . Such an approach offers the advantage of using only stack memory on decoding triads. An example main decoding loop includes the following four steps:
-
- Decoding the length—similar to decoding literals from a literal buffer, length is decoded using a LZ length lookup table.
- Decoding the number of literals—similar to decoding length
- Decoding offsets (described below)—using LIE ranges obtained using functions described in the following section, the offsets are decoded by reading their headers and their values, and rebuilding the offsets
- Rebuilding buffer using decoded blocks of length, number of literals and offsets—involves copying the match and the literals that were not part of the match using values of match length, number of literals, and match offset.
Each block is decoded with an unrolled loop, utilizing Out of Order optimization, in a manner similar to that of decoding literals. For files encoded using a dictionary, another version of the function can be used that includes copying matches from the dictionary.
Offset DecodingOffset is a special case, as it covers wide ranges of numbers. As such, in some embodiments, offset is encoded and decoded using LIE. For example, a function first reads the table from the header. Next, a function recreates the look up table using LIE (see Triad Encoding).
Prefix DictionariesSince Lempel-Ziv parsing operates by finding and mapping redundant byte sequences, looking backward in the bitstream, compression performance is directly correlated to input data size. In other words, compressing a small file (i.e., 4 KB) will be substantially more difficult than compressing a larger (i.e., 64 KB) file. This behavior can be explained by the fact that smaller files has a smaller history to find matches, and are therefore by nature less redundant. To account for this, the use of prefix dictionaries has long been explored in data compression. Prefix dictionaries contain data which is statistically likely to occur in the data to be compressed, allowing the LZ parser to find matches with the dictionary, and significantly improving compression performance in small data applications.
While the use of prefix dictionaries is well-established, these dictionaries have inherent flaws. Since all data is different, there is no such thing as a “universal” dictionary. For instance, a dictionary built on English text will not be of any use when compression Chinese text. To remedy this issue, independent dictionaries must be constructed on a variety of data. The present application demonstrates a method to do this, and to effectively automatically select the most optimal dictionary for the given data to be compressed.
Dictionary BuilderIn some embodiments, an encoder includes or is configured to interact with a dictionary builder (implemented in software and/or hardware), which can create/generate a prefix dictionary from a source file. The dictionary builder can generate a prefix dictionary, for example, by dividing the source file into sample and test buffers. The sample buffer is hashed, and possible prefixes for each hash value are saved in a hash table creating context for choosing segments. Next, for each context the sample buffer is divided into epochs of size depending on size of the segment chosen. A segment is chosen from each epoch according to its score, which is a sum of frequencies of the prefixes included in the segment. After a segment with the highest score is chosen from an epoch, its frequencies are zeroed to reduce redundancy. A temporary dictionary is built from the chosen segments, and then it is tested by using it to encode test buffers. Subsequently, the segments array is sorted, and new dictionary built from sorted segments is also tested with encoding. The best dictionary (i.e., the one which achieves the lowest compression ratio on the test buffer) is picked as a final dictionary.
In some embodiments, once the best prefix dictionary has been picked, the dictionary builder saves the prefix dictionary to an output file, for example in a literal array format (e.g., “{0, 23, 1, . . . ,}”). In picking the segments redundancy is avoided by stripping both the head and the tail of the best segment in an epoch from zero frequency prefixes.
Example Functions of the Dictionary Builder
The function uses a reservoir sampling method to generate a random sample from the input. First, the function generates a random set of indexes using the sample_indexes array and the variables blocks_number and blocks_to_pick that depend on the size of input file. After picking the required blocks, the sample_indexes array is sorted. The function copies the blocks from the input file into the test or sample buffer using the indexes saved in sample_indexes. The following example code can be used to implement the function:
The create_context( ) function can be used for creating a hash table for a given prefix size, by iterating through the sample buffer. The create_context( ) function can read prefix size of bytes (4, 6, or 8), hash the bytes into a 2-byte value, then save the prefix as a 64-bit value if it was seen for the first time or add to its count, using the hash as an index.
-
-
The add_to_seen_hashes( ) function receives a hash value, and the prefix value. As the size of the sample is usually large (i.e. >5 MB), the hashing process can generate a lot of hash collisions. Therefore, the hash table uses the hash as an index for a struct that saves a number (e.g., 160) of different prefix values along with their respective counts. If a particular prefix has not been previously seen its value is appended to the prefixes table for the given hash value. Otherwise its count is simply incremented.
-
The function finds the best segment within an epoch using a copy of the hash table generated by the function. The copy of the hash table is used so the original table can be reused on different segment size. The function first creates table of scores for each prefix in the epoch. It calculates then score for each segment using rolling score method (i.e., utilizing the fact that the consecutive segments differ only by the first and last prefixes) and as a result the score of the next segment is the score of the previous segment minus its first prefix score and plus the next prefix score. Once the best segment is found, both its beginning and the end are stripped from the prefixes that have zero frequencies. The following example of the code might be used to implement picking the segment from the epoch:
The function can be used to generate a temporary prefix dictionary by adding the segments saved in . The function adds the segments up to dictionary size starting from the end of the dictionary buffer, so the most likely matches can get the smallest offset values (i.e., located closest to the buffer used for future data compression). Once the prefix dictionary has been generated, it is then tested using function to assess its efficiency.
The function can be used to test the temporary dictionary to assess its efficiency measured by the compression ratio achieved. It uses the test buffer and stealth functions such as to get the compressed file size. In order to amplify the differences between temporary dictionaries it splits the test buffer into 4 KB chunks, as the chunks of this size are most affected by the dictionary efficiency. The function can return the total compressed size for all the chunks which enables comparison between different dictionaries.
The function uses function and to find the dictionary that achieves the most efficient compression measured by the size of compressed test buffer. The dictionary is then copied to .
Additional details pertaining to dictionary construction can be found in U.S. patent application No. 16/806,338, filed Mar. 2, 2020 and titled “System and Method for Statistics-Based Pattern Searching of Compressed Data and Encrypted Data,” the entire contents of which are incorporated by reference herein for all purposes.
Dictionary SelectionWith a library of prefix dictionaries constructed, methods for selecting the appropriate or adequate dictionary for each compression scenario are desired. Such methods are described herein.
Example Dictionary-Related FunctionsThe function receives a chunk of data to be compressed, identifies and selects the most appropriate dictionary by choosing a sample from the input of the size depending on input size (e.g., 1024 bytes for 4 KB input). The sample blocks are chosen pseudo-randomly.
The frequencies of characters in the sample blocks are then counted, and the cross-entropy of the sample is compared with that of prefix dictionaries to find the lowest value. The cross-entropy is calculated similarly to cross-entropy for finding a static tree for encoding byte literals. To simplify the calculations the value of file size is used for comparison as it can avoid unnecessary division. The following example of the code might be used to implement dictionary selection:
The prefix trees for dictionaries can be built from the dictionaries saved as separate flat files during the dictionary header generation. As the dictionaries are relatively small in size these prefix trees can provide accurate calculations of similarity between the input file and the dictionaries.
The function is a function that receives, as an argument, a dictionary number, and loads the appropriate dictionary and its size into the struct.
Additional details regarding compression and decompression techniques can be found in U.S. Pat. No. 10,491,240, titled “Systems and Methods for Variable Length Codeword Based, Hybrid Data Encoding and Decoding Using Dynamic Memory Allocation,” and in U.S. patent application Ser. No. 16/274,417, titled “Systems and Methods for Variable Length Codeword Based Data Encoding and Decoding Using Dynamic Memory Allocation,” the entire contents of each of which are incorporated by reference herein in their entireties.
In some implementations, the memory stores a plurality of maximum-length matched byte sequences including the maximum-length matched byte sequence, and the memory stores a plurality of byte sequence literals including the byte sequence literal. The method can also include encoding data based on the plurality of maximum-length matched byte sequences and the plurality of byte sequence literals.
In some implementations, the one or more additional adjacent bytes from the input data stream occur subsequent that byte sequence within the input data stream, such that the incrementally expanding each byte sequence includes adding a subsequent one or more additional adjacent bytes from the input data stream to that byte sequence.
In some implementations, the one or more additional adjacent bytes from the input data stream occur prior to that byte sequence within the input data stream, such that the incrementally expanding each byte sequence includes adding a preceding one or more additional adjacent bytes from the input data stream to that byte sequence.
In some implementations, the one or more additional adjacent bytes from the input data stream are a plurality of additional adjacent bytes, the plurality of additional adjacent bytes including at least one bit occurring prior to that byte sequence within the input data stream and at least one bit occurring subsequent to that byte sequence within the input data stream, such that the incrementally expanding each byte sequence includes adding at least one preceding bit and at least one subsequent bit from the input data stream to that byte sequence.
In some implementations, the generating the hash of each byte sequence from the plurality of byte sequences of the input data stream is performed concurrently. The comparing the hash of each byte sequence from the plurality of byte sequences of the input data stream to the hash table to determine whether a match exists can be performed sequentially.
In some implementations, the representation of the maximum-length matched byte sequence is a triad that includes a representation of a length of the maximum-length matched byte sequence, an offset between the maximum-length matched byte sequence and a current byte sequence, and a number of byte literals between the maximum-length matched byte sequence and a previous match associated with the input data stream.
In some implementations, the comparing the hash of each byte sequence from the plurality of byte sequences of the input data stream to the hash table to determine whether a match exists is performed in response to detecting that that hash has a length that is greater than a minimum match size.
In some embodiments, a system includes a processor and a memory that is operably coupled to the processor and that stores instructions that, when executed by the processor, cause the processor to perform a method. The method includes, for each byte sequence from a plurality of byte sequences of an input data stream, generating a hash of that byte sequence and comparing the hash to a hash table to determine whether a match exists. The hash table is stored in a memory operably coupled to the processor. If a match exists, that byte sequence is incrementally expanded to include one or more additional adjacent bytes from the input data stream, to produce a plurality of expanded byte sequences. Each expanded byte sequence from the plurality of expanded byte sequences is compared to the hash table to identify a maximum-length matched byte sequence from a set that includes the byte sequence and the plurality of expanded byte sequences. A representation of the maximum-length matched byte sequence is stored in the memory. If a match does not exist, a representation of that byte sequence is stored as a byte sequence literal in the memory.
In some embodiments, a non-transitory, processor-readable medium stores instructions to perform a method. The method includes, for each byte sequence from a plurality of byte sequences of an input data stream, generating a hash of that byte sequence and comparing the hash to a hash table to determine whether a match exists. The hash table is stored in a memory operably coupled to the processor. If a match exists, that byte sequence is incrementally expanded to include one or more additional adjacent bytes from the input data stream, to produce a plurality of expanded byte sequences. Each expanded byte sequence from the plurality of expanded byte sequences is compared to the hash table to identify a maximum-length matched byte sequence from a set that includes the byte sequence and the plurality of expanded byte sequences. A representation of the maximum-length matched byte sequence is caused to be stored in the memory. If a match does not exist, a representation of that byte sequence as a byte sequence literal is caused to be stored in the memory.
In some implementations, a size of the array is equal to a size of the input bit stream.
In some implementations, the identifying the plurality of potential matches includes performing a predefined number of searches of the array.
In some implementations, the predefined number of searches is based on a predefined parsing quality.
In some implementations, the memory is a first memory and the plurality of byte sequences is a first plurality of byte sequences, the method further comprising storing, in a second memory different from the first memory, a representation of a second plurality of byte sequences that has not been matched to the hash table.
In some implementations, the identifying the plurality of potential matches is performed without reference to the input data stream.
In some implementations, the input data stream is an input bit stream.
In some implementations, the identifying the plurality of potential matches includes iteratively incrementing at least one position from the plurality of positions.
In some embodiments, a system includes a processor and a memory, operably coupled to the processor and storing instructions that, when executed by the processor, cause the processor to perform a method. The method includes generating, for each byte sequence from a plurality of byte sequences of the input data stream, a hash of that byte sequence, to define a plurality of hashes. The method also includes storing, in the memory, an array that includes (1) a plurality of positions, each position from the plurality of positions being a position within the input data stream of a hash from the plurality of hashes, and (2) a last observed position of each hash from the plurality of hashes. The method also includes identifying a plurality of potential matches between the plurality of byte sequences and a hash table based on the array, and calculating a score, from a plurality of scores, for each potential match from the plurality of potential matches. The method also includes selecting a subset of potential matches from the plurality of potential matches, based on the plurality of scores, and storing a representation of the selected subset of potential matches in the memory.
In some implementations, the method also includes encoding the second subset of byte sequences based using an encoder having the encoding type.
In some implementations, the method also includes selecting the predefined encoding strategy based on a frequency of occurrence of each character from a plurality of characters of the second subset of byte sequences. Optionally, the method also includes determining the frequency of occurrence of each character from a plurality of characters of the second subset of byte sequences using an unrolled loop.
In some implementations, the selecting the static Huffman tree is performed using principal components analysis (PCA). The predefined encoding strategy can be to prioritize speed of compression.
In some implementations, the selecting the static Huffman tree is performed using a cross-entropy heuristic.
In some implementations, the predefined encoding strategy is to prioritize accuracy over speed of compression.
All combinations of the foregoing concepts and additional concepts discussed here (provided such concepts are not mutually inconsistent) are contemplated as being part of the subject matter disclosed herein. The terminology explicitly employed herein that also may appear in any disclosure incorporated by reference should be accorded a meaning most consistent with the particular concepts disclosed herein.
The skilled artisan will understand that the drawings primarily are for illustrative purposes, and are not intended to limit the scope of the subject matter described herein. The drawings are not necessarily to scale; in some instances, various aspects of the subject matter disclosed herein may be shown exaggerated or enlarged in the drawings to facilitate an understanding of different features. In the drawings, like reference characters generally refer to like features (e.g., functionally similar and/or structurally similar elements).
To address various issues and advance the art, the entirety of this application (including the Cover Page, Title, Headings, Background, Summary, Brief Description of the Drawings, Detailed Description, Embodiments, Abstract, Figures, Appendices, and otherwise) shows, by way of illustration, various embodiments in which the embodiments may be practiced. The advantages and features of the application are of a representative sample of embodiments only, and are not exhaustive and/or exclusive. Rather, they are presented to assist in understanding and teach the embodiments, and are not representative of all embodiments. As such, certain aspects of the disclosure have not been discussed herein. That alternate embodiments may not have been presented for a specific portion of the innovations or that further undescribed alternate embodiments may be available for a portion is not to be considered to exclude such alternate embodiments from the scope of the disclosure. It will be appreciated that many of those undescribed embodiments incorporate the same principles of the innovations and others are equivalent. Thus, it is to be understood that other embodiments may be utilized and functional, logical, operational, organizational, structural and/or topological modifications may be made without departing from the scope and/or spirit of the disclosure. As such, all examples and/or embodiments are deemed to be non-limiting throughout this disclosure.
Also, no inference should be drawn regarding those embodiments discussed herein relative to those not discussed herein other than it is as such for purposes of reducing space and repetition. For instance, it is to be understood that the logical and/or topological structure of any combination of any program components (a component collection), other components and/or any present feature sets as described in the figures and/or throughout are not limited to a fixed operating order and/or arrangement, but rather, any disclosed order is exemplary and all equivalents, regardless of order, are contemplated by the disclosure.
Various concepts may be embodied as one or more methods, of which at least one example has been provided. The acts performed as part of the method may be ordered in any suitable way. Accordingly, embodiments may be constructed in which acts are performed in an order different than illustrated, which may include performing some acts simultaneously, even though shown as sequential acts in illustrative embodiments. Put differently, it is to be understood that such features may not necessarily be limited to a particular order of execution, but rather, any number of threads, processes, services, servers, and/or the like that may execute serially, asynchronously, concurrently, in parallel, simultaneously, synchronously, and/or the like in a manner consistent with the disclosure. As such, some of these features may be mutually contradictory, in that they cannot be simultaneously present in a single embodiment. Similarly, some features are applicable to one aspect of the innovations, and inapplicable to others.
In addition, the disclosure may include other innovations not presently described. Applicant reserves all rights in such innovations, including the right to embodiment such innovations, file additional applications, continuations, continuations-in-part, divisional s, and/or the like thereof. As such, it should be understood that advantages, embodiments, examples, functional, features, logical, operational, organizational, structural, topological, and/or other aspects of the disclosure are not to be considered limitations on the disclosure as defined by the embodiments or limitations on equivalents to the embodiments. Depending on the particular desires and/or characteristics of an individual and/or enterprise user, database configuration and/or relational model, data type, data transmission and/or network framework, syntax structure, and/or the like, various embodiments of the technology disclosed herein may be implemented in a manner that enables a great deal of flexibility and customization as described herein.
All definitions, as defined and used herein, should be understood to control over dictionary definitions, definitions in documents incorporated by reference, and/or ordinary meanings of the defined terms.
As used herein, in particular embodiments, the terms “about” or “approximately” when preceding a numerical value indicates the value plus or minus a range of 10%. Where a range of values is provided, it is understood that each intervening value, to the tenth of the unit of the lower limit unless the context clearly dictates otherwise, between the upper and lower limit of that range and any other stated or intervening value in that stated range is encompassed within the disclosure. That the upper and lower limits of these smaller ranges can independently be included in the smaller ranges is also encompassed within the disclosure, subject to any specifically excluded limit in the stated range. Where the stated range includes one or both of the limits, ranges excluding either or both of those included limits are also included in the disclosure.
The indefinite articles “a” and “an,” as used herein in the specification and in the embodiments, unless clearly indicated to the contrary, should be understood to mean “at least one.”
The phrase “and/or,” as used herein in the specification and in the embodiments, should be understood to mean “either or both” of the elements so conjoined, i.e., elements that are conjunctively present in some cases and disjunctively present in other cases. Multiple elements listed with “and/or” should be construed in the same fashion, i.e., “one or more” of the elements so conjoined. Other elements may optionally be present other than the elements specifically identified by the “and/or” clause, whether related or unrelated to those elements specifically identified. Thus, as a non-limiting example, a reference to “A and/or B”, when used in conjunction with open-ended language such as “comprising” can refer, in one embodiment, to A only (optionally including elements other than B); in another embodiment, to B only (optionally including elements other than A); in yet another embodiment, to both A and B (optionally including other elements); etc.
As used herein in the specification and in the embodiments, “or” should be understood to have the same meaning as “and/or” as defined above. For example, when separating items in a list, “or” or “and/or” shall be interpreted as being inclusive, i.e., the inclusion of at least one, but also including more than one, of a number or list of elements, and, optionally, additional unlisted items. Only terms clearly indicated to the contrary, such as “only one of” or “exactly one of,” or, when used in the embodiments, “consisting of,” will refer to the inclusion of exactly one element of a number or list of elements. In general, the term “or” as used herein shall only be interpreted as indicating exclusive alternatives (i.e. “one or the other but not both”) when preceded by terms of exclusivity, such as “either,” “one of,” “only one of,” or “exactly one of.” “Consisting essentially of,” when used in the embodiments, shall have its ordinary meaning as used in the field of patent law.
As used herein in the specification and in the embodiments, the phrase “at least one,” in reference to a list of one or more elements, should be understood to mean at least one element selected from any one or more of the elements in the list of elements, but not necessarily including at least one of each and every element specifically listed within the list of elements and not excluding any combinations of elements in the list of elements. This definition also allows that elements may optionally be present other than the elements specifically identified within the list of elements to which the phrase “at least one” refers, whether related or unrelated to those elements specifically identified. Thus, as a non-limiting example, “at least one of A and B” (or, equivalently, “at least one of A or B,” or, equivalently “at least one of A and/or B”) can refer, in one embodiment, to at least one, optionally including more than one, A, with no B present (and optionally including elements other than B); in another embodiment, to at least one, optionally including more than one, B, with no A present (and optionally including elements other than A); in yet another embodiment, to at least one, optionally including more than one, A, and at least one, optionally including more than one, B (and optionally including other elements); etc.
In the embodiments, as well as in the specification above, all transitional phrases such as “comprising,” “including,” “carrying,” “having,” “containing,” “involving,” “holding,” “composed of,” and the like are to be understood to be open-ended, i.e., to mean including but not limited to. Only the transitional phrases “consisting of” and “consisting essentially of” shall be closed or semi-closed transitional phrases, respectively, as set forth in the United States Patent Office Manual of Patent Examining Procedures, Section 2111.03.
While specific embodiments of the present disclosure have been outlined above, many alternatives, modifications, and variations will be apparent to those skilled in the art. Accordingly, the embodiments set forth herein are intended to be illustrative, not limiting. Various changes may be made without departing from the spirit and scope of the disclosure.
Claims
1. (canceled)
2. A non-transitory, processor-readable medium storing instructions to cause a processor to:
- for each byte sequence from a plurality of byte sequences of an input data stream: compare a hash of that byte sequence to a hash table to determine whether a match exists, the hash table stored in a memory operably coupled to the processor; in response to a match existing: incrementally expand that byte sequence to produce a plurality of expanded byte sequences, and compare each expanded byte sequence from the plurality of expanded byte sequences to the hash table to identify a maximum-length matched byte sequence from a set of matched byte sequences that includes the byte sequence and the plurality of expanded byte sequences; and in response to the match not existing: cause storage of a representation of that byte sequence as a byte sequence literal in the memory.
3. The non-transitory, processor-readable medium of claim 2, further storing instructions to cause the processor to encode data based on a plurality of maximum-length matched byte sequences and a plurality of byte sequence literals.
4. The non-transitory, processor-readable medium of claim 2, wherein the instructions to incrementally expand each byte sequence from the plurality of byte sequences include instructions to add, to that byte sequence, a subsequent one or more additional adjacent bytes from the input data stream.
5. The non-transitory, processor-readable medium of claim 2, wherein a representation of the maximum-length matched byte sequence is stored as a triad that includes a representation of a length of the maximum-length matched byte sequence, an offset between the maximum-length matched byte sequence and a current byte sequence, and a number of byte literals between the maximum-length matched byte sequence and a previous match associated with the input data stream.
6. The non-transitory, processor-readable medium of claim 2, wherein the instructions to compare the hash of each byte sequence from the plurality of byte sequences of the input data stream to the hash table to determine whether a match exists include instructions to compare the hash of each byte sequence from the plurality of byte sequences of the input data stream to the hash table in response to detecting that that hash has a length that is greater than a minimum match size.
7. A system, comprising:
- a processor; and
- a memory, operably coupled to the processor and storing instructions that, when executed by the processor, cause the processor to: generate, for each byte sequence from a plurality of byte sequences of an input data stream, a hash of that byte sequence, to define a plurality of hashes; store, in the memory, an array that includes a last observed position within the input data stream of each hash from the plurality of hashes; identify, based on the array, a plurality of potential matches between the plurality of byte sequences and a hash table; calculate a score, from a plurality of scores, for each potential match from the plurality of potential matches; select a subset of potential matches from the plurality of potential matches, based on the plurality of scores; and store, in the memory, a representation of the selected subset of potential matches.
8. The system of claim 7, wherein a size of the array is equal to a size of the input byte stream.
9. The system of claim 7, wherein the memory is a first memory and the plurality of byte sequences is a first plurality of byte sequences, the first memory further storing instructions to cause the processor to store, in a second memory different from the first memory, a representation of a second plurality of byte sequences that has not been matched to the hash table.
10. The system of claim 7, wherein the instructions to identify the plurality of potential matches between the plurality of byte sequences and the hash table based on the array include instructions to identify the plurality of potential matches between the plurality of byte sequences and the hash table at least one of:
- by performing a predefined number of searches of the array;
- without reference to the input data stream; or
- by iteratively incrementing at least one position from the plurality of positions.
11. A non-transitory, processor-readable medium storing instructions to cause a processor to:
- compare each hash from a plurality of hashes to a hash table to identify a plurality of matched hashes associated with a first subset of byte sequences from a plurality of byte sequences, a second subset of byte sequences from the plurality of byte sequences including byte sequences that are not associated with a matched hash from the plurality of matched hashes;
- select a static Huffman tree to encode the second subset of byte sequences, based on a predefined encoding strategy;
- determine whether a result size associated with the selected static Huffman tree is within a predefined percentage of a number of byte sequences in the second subset of byte sequences;
- when the result size is within the predefined percentage of the number of byte sequences in the second subset of byte sequences, set an encoding type to static encoding; and
- when the result size is not within the predefined percentage of the number of byte sequences in the second subset of byte sequences: when the number of byte sequences in the second subset of byte sequences is less than a predefined first threshold value and the result size is less than the predefined first threshold value, set the encoding type to an encoding procedure that is performed based on an inverted index array and a rank table; when at least one of: (1) the number of byte sequences in the second subset of byte sequences is not less than the predefined first threshold value, or (2) the result size is not less than the predefined first threshold value: when a custom prefix is preferable to the selected static Huffman tree, set the encoding type to custom; and when the custom prefix is not preferable to the selected static Huffman tree, set the encoding type to static encoding.
12. The non-transitory, processor-readable medium of claim 11, further storing instructions to cause the processor to encode the second subset of byte sequences using an encoder having the encoding type.
13. The non-transitory, processor-readable medium of claim 11, further storing instructions to cause the processor to select the predefined encoding strategy based on a frequency of occurrence of each character from a plurality of characters of the second subset of byte sequences.
14. The non-transitory, processor-readable medium of claim 11, further storing instructions to cause the processor to:
- select the predefined encoding strategy based on a frequency of occurrence of each character from a plurality of characters of the second subset of byte sequences; and
- determine the frequency of occurrence of each character from a plurality of characters of the second subset of byte sequences using an unrolled loop.
15. The non-transitory, processor-readable medium of claim 11, wherein the instructions to cause the processor to select the static Huffman tree include instructions to select the static Huffman tree using principal components analysis (PCA).
16. The non-transitory, processor-readable medium of claim 11, wherein the predefined encoding strategy is to prioritize speed of compression.
17. The non-transitory, processor-readable medium of claim 11, wherein the instructions to cause the processor to select the static Huffman tree include instructions to select the static Huffman tree using principal components analysis (PCA), and the predefined encoding strategy is to prioritize speed of compression.
18. The non-transitory, processor-readable medium of claim 11, wherein the instructions to cause the processor to select the static Huffman tree include instructions to select the static Huffman tree using a cross-entropy heuristic.
19. The non-transitory, processor-readable medium of claim 11, wherein the predefined encoding strategy is to prioritize accuracy over speed of compression.
20. A method, comprising:
- identifying, via a processor, a first subset of an input data and a second subset of the input data;
- selecting, via the processor, a Huffman tree from a plurality of Huffman trees, based on a predefined encoding level;
- encoding the first subset of the input data using the selected Huffman tree;
- generating a frequency curve for the second subset of the input data, using one of (1) a Riemann Sum or (2) a triple lookup table having a plurality of accuracies; and
- encoding the second subset of the input data based on the frequency curve.
Type: Application
Filed: Mar 14, 2022
Publication Date: Feb 9, 2023
Applicant: Cyborg Inc. (New York, NY)
Inventors: Nicolas Thomas Mathieu DUPONT (New York, NY), Alexandre HELLE (Forest Hills, NY), Glenn Lawrence CASH (Matawan, NJ), Alicja TEXLER (Brooklyn, NY)
Application Number: 17/693,872