Abstract: The Prefix Burrows-Wheeler Transform (“PWBT”) is described to provide data operations on data sets even if the data set has been compressed. Techniques to set up a PWBT, including an offset table and a prefix table, and techniques to apply data operations on data sets transformed by PWBT are also described. Data operations include k-Mer substring search. General applications of techniques using PWBT, such as plagiarism searches and open source clearance, are described. Bioinformatics applications of the PWBT, such as genomic analysis and genomic tagging, are also described.
Abstract: The Prefix Burrows-Wheeler Transform (“PWBT”) is described to provide data operations on data sets even if the data set has been compressed. Techniques to set up a PWBT, including an offset table and a prefix table, and techniques to apply data operations on data sets transformed by PWBT are also described. Data operations include k-Mer substring search. General applications of techniques using PWBT, such as plagiarism searches and open source clearance, are described. Bioinformatics applications of the PWBT, such as genomic analysis and genomic tagging, are also described.
Abstract: The current document is directed to a method and system for data processing in cloud-computing environments and other distributed-computing environments. Implementations of a merge sort suitable for the sorting of data within cloud-computing environments and other distributed-computing environments are disclosed. These implementations takes advantage of the massive parallelism available in cloud-computing environments as well as take into consideration numerous constraints regarding data-storage and data-retrieval operations in a cloud-computing environment. The implementations provide a type of data-sorting method and system that iteratively carries out highly parallel merge-sort operations that can be effectively applied over a range of data-set sizes up to extremely large data sets.
Abstract: Systems and methods to create a merged lexeme set from a first lexeme set and a second lexeme set such that an existential lexeme search may be performed on both data originally from the first lexeme set and data originally from the second lexeme set via the merged lexeme set, and wherein information in the merged lexeme set includes information as to which lexeme set a lexeme originated. Specifically Prefix Burrows-Wheeler Transform (“PBWT”) systems and techniques are applied to the scenario where a plurality lexeme sets are merged to a single merged lexeme set. Additionally, applications of PBWT systems and techniques as applied to genome sequence data and k-Mer searches are disclosed.
Abstract: Systems and methods to create a merged lexeme set from a first lexeme set and a second lexeme set such that an existential lexeme search may be performed on both data originally from the first lexeme set and data originally from the second lexeme set via the merged lexeme set, and wherein information in the merged lexeme set includes information as to which lexeme set a lexeme originated. Specifically Prefix Burrows-Wheeler Transform (“PBWT”) systems and techniques are applied to the scenario where a plurality lexeme sets are merged to a single merged lexeme set. Additionally, applications of PBWT systems and techniques as applied to genome sequence data and k-Mer searches are disclosed.
Abstract: The Prefix Burrows-Wheeler Transform (“PWBT”) is described to provide data operations on data sets even if the data set has been compressed. Techniques to set up a PWBT, including an offset table and a prefix table, and techniques to apply data operations on data sets transformed by PWBT are also described. Data operations include k-Mer substring search. General applications of techniques using PWBT, such as plagiarism searches and open source clearance, are described. Bioinformatics applications of the PWBT, such as genomic analysis and genomic tagging, are also described.
Abstract: The current document is directed to automated methods and processor-controlled systems for assembling short read symbol sequences into longer assembled symbol sequences that are aligned and compared to a reference symbol sequence in order to determine differences between the longer assembled symbol sequences and the reference sequence. These methods and systems are applied to process electronically stored symbol-sequence data. While the symbol-sequence data may represent genetic-code data, the automated methods and processor-controlled systems may be more generally applied to various different symbol-sequence data. In certain implementations, redundancy in read symbol sequences is used to preprocess the read symbol sequences to identify and correct symbol errors.
Abstract: The current document is directed to a method and system for data processing in cloud-computing environments and other distributed-computing environments. Implementations of a merge sort suitable for the sorting of data within cloud-computing environments and other distributed-computing environments are disclosed. These implementations takes advantage of the massive parallelism available in cloud-computing environments as well as take into consideration numerous constraints regarding data-storage and data-retrieval operations in a cloud-computing environment. The implementations provide a type of data-sorting method and system that iteratively carries out highly parallel merge-sort operations that can be effectively applied over a range of data-set sizes up to extremely large data sets.