PROGRAMS AND FUNCTIONS IN DNA-BASED DATA STORAGE
Systems and methods are provided herein for encoding and storing information in nucleic acids. Encoded information is partitioned and stored in nucleic acids having native key-value pairs that allow for storage of metadata or other data objects. Computation on the encoded information is performed by chemical implementation of if-then-else operations. Numerical data is stored in nucleic acids by producing samples having nucleic acid sequences copy counts corresponding to the numerical data. Data objects of a dataset are encoded by partitioning of bytes into parts and encoding of parts along distinct libraries of nucleic acids. These libraries can be used as inputs for computation on the dataset.
This application is a continuation of U.S. patent application Ser. No. 17/317,547 filed on May 11, 2021 (allowed), which claims priority to and the benefit of U.S. Provisional Patent Application No. 63/023,071, filed on May 11, 2020 (expired), and entitled “BRANCHING PROGRAMS IN DNA-BASED DATA STORAGE”; U.S. Provisional Patent Application No. 63/023,342, filed on May 12, 2020 (expired), and entitled “BRANCHING PROGRAMS IN DNA-BASED DATA STORAGE”; U.S. Provisional Patent Application No. 63/066,628, filed on Aug. 17, 2020 (expired), and entitled “PROGRAMS AND FUNCTIONS IN DNA-BASED DATA STORAGE”; and U.S. Provisional Patent Application No. 63/165,559, filed on Mar. 24, 2021 (pending), and entitled “PROGRAMS AND FUNCTIONS IN DNA-BASED DATA STORAGE”. The entire contents of the above-referenced applications are incorporated herein by reference.
BACKGROUNDNucleic acid digital data storage is a stable approach for encoding and storing information for long periods of time, with data stored at higher densities than magnetic tape or hard drive storage systems. Additionally, digital data stored in nucleic acid molecules that are stored in cold and dry conditions can be retrieved as long as 60,000 years later or longer.
One way to access the digital data stored in nucleic acid molecules, the nucleic acid molecules is to sequence them. As such, nucleic acid digital data storage may be an ideal method for storing data that is not frequently accessed but may have a high volume of information to be stored or archived for long periods of time.
Existing methods of storing data in nucleic acid molecules rely on encoding the digital information (e.g., binary code) into base-by-base nucleic acids sequences, such that the base-to-base relationship in the sequence directly translates into the digital information (e.g., binary code). However, such de novo base-by-base nucleic acid synthesis is error prone and expensive. Moreover, certain functions cannot be performed on data that is stored via these existing methods of nucleic acid digital data storage, when data is encoded at single base resolution, without first translating the entire set of molecules into the digital information. For example, such functions include basic tasks that are commonly performed when the data is stored on a disk, such as logic functions, addition, subtraction, and query search, including whether or not a specific query pattern occurs in a data set, the number of occurrences, and the location of each occurrence.
SUMMARYThe systems, devices, and methods described herein generally relate to methods for encoding data in nucleic acids. Encoding schemes, data structures, access methods, computational programs, and chemical implementations thereof are provided herein. These techniques allow for encoding of and computation on larger datasets than conventional techniques of nucleic acid-based data storage.
In a first aspect, provided herein is a method for storing digital information into nucleic acid molecules. The method comprises partitioning identifiers nucleic acid sequences into blocks, each identifier nucleic acid sequence comprising component nucleic acid sequences, at least a portion of which are configured to bind to one or more probes; allocating a string of symbols to a block of the blocks; mapping the string of symbols to a plurality of identifier nucleic acid sequences within the block; and constructing individual identifier nucleic acid molecules of the plurality of identifier nucleic acid sequences.
In some implementations, said mapping is performed using a codebook that maps words to codewords, wherein said words each comprise one or more symbols of the string, and wherein said codewords each comprise one or more identifier nucleic acid sequences of the identifier nucleic acid sequences to which the string is mapped. There may be a fixed number of codewords per block. A range of identifiers encoding the gth instance of a block can be calculated as the range encoding codeword instances (g−1)*c+1 through g*c.
In some implementations, a block can be accessed with an access program. A block may be associated with a location, wherein said location comprises information for accessing said block. A block may contain information about the location of another block. A block may represent a node in a graph. Said graph may be a tree. For example, said tree is a suffix tree or B-tree. A block may be an element of an inverted index. Multiple blocks may form a linked list wherein a first block points to a second block to accommodate a large symbol of strings.
In some implementations the identifier nucleic acid sequences of each block exclusively share the same component nucleic acid sequences in a specified set of key layers. Said key layers may encode a data object. Said data object may be a native key. The native key may be configured such that each symbol corresponds to a component nucleic acid sequence of a corresponding key layer. A range query over several keys sharing common symbol values at a specified set of positions may be performed by accessing all identifiers containing their corresponding components. A query for all keys that satisfy a finite automata, or regular expression, may be performed by an access program.
In some implementations, the method further comprises storing a value in the associated block of each key. The value can be retrieved using its corresponding key. The key may be a function of its value. For example, the key is a hash, bloom filter, structural array, classifier, signature, record, or fingerprint derived from its corresponding value. An access program may be designed to access all keys that are similar to, or share symbols values in common with, a query key derived from a reference value.
In some implementations, a count of the number of keys is determined by measuring DNA concentration. DNA concentration may be measured by qPCR assay, plate reader assay, fluorimetry, or spectrophotometry. DNA concentration may be normalized by a standard to determine the number of identifier sequences in a sample. A relative number of keys containing a particular symbol may be determined by performing qPCR with a probe for the corresponding component.
In some implementations, an identifier sequence for a reference key is created. The identifier sequence may be used as a hybridization probe in a hybridization reaction to search for all keys that are similar. The temperature or pH of the hybridization reaction may be used to control the stringency of the similarity search.
In a second aspect, provided herein is a method for operating on digital information stored in nucleic acid molecules. The method comprises (a) obtaining a first pool of identifier nucleic acid molecules, the pool having powder, liquid, or solid form, each identifier nucleic acid molecule in the first pool comprising component nucleic acid molecules, at least a portion of which are configured to bind to one or more probes, wherein the identifier nucleic acid molecules represent input strings of symbols; (b) screening the identifier nucleic acid molecules in the first pool by targeting at least one of the component nucleic acid molecules with a probe, to create an intermediate pool comprising a subset of identifier nucleic acid molecules from said first pool, wherein the intermediate pool represents a result of an if-then-else operation performed on the input strings; and (c) repeating step (b) wherein the intermediate pool replaces the first pool at every subsequent step until a final pool of identifier nucleic acid molecules is created that represents at least a portion of an output string of symbols.
In some implementations, each identifier nucleic acid molecule in the first pool comprises a distinct component nucleic acid sequence from each of M layers, wherein each layer comprises a set of component nucleic acid sequences. Each identifier nucleic acid molecule may represent a data object. A component nucleic acid sequence of the identifier nucleic acid molecule may represent an operand of the data object.
In some implementations, the probe comprises identifier nucleic acid molecules in a pool that include a specific component nucleic acid molecule. In some implementations, the probe is a polymerase chain reaction (PCR) primer, and the if-then-else operation is executed with PCR. In some implementations, the probe is an affinity tagged oligonucleotide, and the if-then-else operation is executed with an affinity pull down assay.
In some implementations, two or more if-then-else operations are performed on one or more pools of identifier nucleic acid molecules in parallel. In some implementations, the method further comprises splitting at least one of the first pool, the intermediate pool, or the final pool into at least two duplicate pools. The method may include replicating the at least one of the first pool, the intermediate pool, or the final pool prior to splitting. For example, replicating is executed with polymerase chain reaction (PCR). The method may include combining at least two intermediate pools of identifier nucleic acid molecules to form a new intermediate pool of identifier nucleic acid molecules or a second pool of identifier nucleic acid molecules.
In some implementations, the repetition of if-then-else operations of (b) in (c) represents execution a graph program, the output of which places an identifier nucleic acid molecule, representing a data object, into the final pool representing the at least a portion of the output string. Said graph program may represent a function on said data object. In some implementations, the at least a portion of the output string is an output of the function on the data object. The final pool to which an identifier nucleic acid molecule is placed into, according to the graph program, may determine the output of the function on the corresponding data object.
In a third aspect, provided herein is a method for storing numerical data in nucleic acids. The method comprises determining an expected copy count of an identifier nucleic acid sequence based on the numerical data and a proportionality constant; and generating a sample containing an actual number of identifier nucleic acid molecules each having the identifier nucleic acid sequence, wherein the actual number approximates the expected copy count.
In some implementations, the numerical data is a number and the expected copy count is proportional to the number. At least a portion of each identifier nucleic acid molecule may be configured to bind to one or more probes. The method may further comprise inputting the sample to an operation to produce an output sample. In some implementations, the operation comprises multiplying the number by a power of 2 by performing a polymerase chain reaction (PCR) with primers that bind to common regions on an edge of the identifier nucleic acid sequence to form the output sample containing a PCR product. For example, the power of 2 corresponds to a number of PCR cycles. In some implementations, the operation comprises multiplying the number by a fraction by performing an aliquot that isolates a fractional volume of the sample to form the output sample. In some implementations, the operation comprises adding the number as a first number to a second number in a second input sample by a mixing operation that combines the sample and the second input sample to form the output sample. The method may further comprise inputting the output sample to a second operation.
In some implementations, the number is a first element of a vector. In some implementations, the method further comprises determining a second expected copy count of the identifier nucleic acid sequence based on a second element of the vector and the proportionality constant; and generating a second sample containing a second actual number of identifier nucleic acid molecules each having the identifier nucleic acid sequence, wherein the second actual number approximates the second expected copy count. The method may further include performing a linear function on the vector by at least one of PCR, aliquoting, and mixing. This may involve converting a binary vector to a unary value in an output sample by performing the linear function. The linear function may be a scoring function. For example, the scoring function computes a higher output value for target vectors than for non-target vectors, such that copy counts for identifier sequences corresponding to target vectors are enriched in the output sample. Identifier sequences corresponding to target vectors may be determined by sequencing the output sample.
In some implementations, a ratio of copy counts between two identifier sequences in the output sample is increased by using a double-stranded DNA selection operation to form a new output sample where identifier sequences corresponding to target vectors are even more enriched in the output sample. The operation, or a repeated application thereof, may correspond to an activation function in a neural network or a quadratic function. The output sample may be allowed to reach equilibrium prior to the double stranded DNA selection operation. The method may involve changing the temperature or adding cofactors to the output sample prior to double stranded DNA selection operation. The double stranded DNA selection operation may involve at least one of chromatography, gel electrophoresis, mass spectrometry, flow cytometry, fluorescent-activated sorting, membrane capture, silica column capture, silica bead capture, or affinity capture.
In some implementations, the vector is a compressed representation of a larger data object, the compressed representation being a hash, a bloom filter, a signature, a structural array, or a fingerprint of the larger data object. The larger data object may be retrieved using the corresponding identifier nucleic acid sequence as a key.
In a fourth aspect, provided herein is a method for preparing a plurality of nucleic acid libraries encoding a dataset. The method comprises providing a dataset comprising at least one data object, each data object having an object-rank and comprising at least one byte-value; dividing each data object into a plurality of parts, wherein the plurality of parts is ranked such that each part of a respective data object comprises: (1) a respective byte-value of the at least one byte-value from the corresponding data object of the part, and (2) a part-rank indicating a position of the respective byte-value of the at least one byte-value of the corresponding data object of the part; and mapping the dataset to a plurality of nucleic acid libraries having a solid, liquid, or powder form, wherein each nucleic acid library has a library-rank and comprises a plurality of nucleic acid molecules that encode parts having the same part-rank from different data objects, wherein the same part-rank corresponds to a respective library-rank, and wherein each nucleic acid molecule comprises a key encoding the respective object-rank and an operand encoding the respective byte-value.
In some implementations, each nucleic acid molecule comprises L components, each component selected from C possible components of a distinct layer of L layers, wherein M of the L components encode the key, and wherein N of the L components encode the operand, such that M+N≤L. The byte-value of the part encoded by each nucleic acid molecule may be stored in the operand of the nucleic acid molecule. In some implementations, each part contains no more than K byte-values, wherein each data object comprises at most T byte-values, and wherein each data object is divided into at most P=[TX] parts. K may be any value less than Lx[log2C]/8. In some implementations, the dataset is encoded using W nucleic acid libraries, each library containing nucleic acid molecules each having an identifier-rank of 1, . . . , R, wherein the dataset comprises D data objects, wherein R≥D. The jth part of the P parts of the rth data object of the D data objects is encoded in the operand of a nucleic acid molecule of rank in interval [CN r, CN(r+1)−1] in the jth library.
In a fifth aspect, provided herein is a method of retrieving a subset of a data object of a dataset encoded in a plurality of nucleic acid libraries according to the method of the fourth aspect. The method comprises providing a target nucleic acid library and a query nucleic acid library, each library comprising nucleic acid molecules each comprising a key and an operand; extracting keys from nucleic acid molecules in the query nucleic acid library; matching nucleic acid molecules in the target nucleic acid library having keys that match the keys extracted from the query nucleic acid library; and selecting and outputting the matched nucleic acid molecules.
In some implementations, extracting comprises converting each nucleic acid molecule in the query nucleic acid library to a single-stranded nucleic acid molecule, and wherein matching comprising hybridizing nucleic acid molecules in the target nucleic acid library to complementary single-stranded keys. Selecting may involve applying an enzyme that selectively degrades single-stranded nucleic acids. For example, the enzyme is P1.
In some implementations, extracting comprises digesting each nucleic acid molecule in the query nucleic acid library using a sequence-specific enzyme that recognizes a specific sequence found in each nucleic acid molecule. Extracting may further comprise size-selecting keys present in double-stranded form.
In some implementations, extracting comprises introducing a nick between the key and the operand of each nucleic acid molecule using a sequence-specific nicking enzyme, incorporating a labeled nucleotide at the nick, and capturing the labeled nucleotides to retain keys in a single-stranded form. The labeled nucleotide may include a biotin label, and wherein capturing comprises affinity capture with streptavidin-coated beads.
In some implementations, extracting comprises selectively amplifying via PCR and purifying the keys from the query nucleic acid library, the keys being flanked by universal sequences that serve as primer binding sites for the PCR.
In some implementations, matching comprises converting the target nucleic acid library and the extracted keys to single-stranded forms and hybridizing single-stranded extracted keys to complementary keys in the target nucleic acid library. Selecting may involve selectively degrading remaining single-stranded molecules. Selecting may alternatively involve gel electrophoresis.
In some implementations, the query nucleic acid library encodes a first set of parts of the dataset and the target nucleic acid library encodes a second set of parts of the data set. In this case, the extracting step acts as a first if-then-else operation on the dataset, and wherein the matching step and selecting and outputting step act as a second if-then-else operation on the dataset.
In some aspects, provided herein is a system configured to perform any of the methods described herein. The system may be a printer-finisher system configured to dispense DNA components at discrete locations (e.g., reaction compartments) on a substrate, dispense reagents provide optimal conditions for the ligation reaction, and pool all of the DNA identifiers that comprise a library. The system may store and manipulate nucleic acid molecules in containers (e.g., via automated liquid handling). The system may dispense probes into compartments or containers to access subsets of nucleic acid molecules. The system may be configured to aliquot and replicate pools of nucleic acid molecules.
In some aspects, provided herein is a composition including nucleic acid molecules representing digital information according to any of the methods described herein. The composition includes identifier nucleic acid molecules comprising component nucleic acid molecules. Identifier nucleic acid molecules may be collected in a pool and mapped to digital information. For example, the presence of an identifier indicates a particular bit or symbol value in a string of symbols, and the absence of an identifier indicates another bit or symbol value in the string of symbols.
The foregoing and other objects and advantages will be apparent upon consideration of the following detailed description, taken in conjunction with the accompanying drawings, in which like reference characters refer to like parts throughout, and in which:
To provide an overall understanding of the systems, method, and devices described herein, certain illustrative embodiments will be described. Although the embodiments and features described herein are specifically described for use in connection with particular encoding schemes utilizing nucleic acid molecules, it should be understood that certain methods, techniques, or schemes may be applied to other applications of nucleic acid-based data storage or other chemical-based systems.
Base-by-base synthesis of nucleic acids for encoding digital information can be costly and time consuming, because it generally requires a de novo base-by-base synthesis of distinct nucleic acid sequences (e.g., phosphoramidite synthesis) for every new information storage request. The present disclosure relates to systems and methods that do not rely on base-by-base or de novo synthesis, and instead encode the digital information in a plurality of identifiers, or nucleic acid sequences, that include combinatorial arrangements of components (or component nucleic acid sequences). In this manner, the systems and methods of the present disclosure improve the efficiency and commercial viability of digital information storage.
The present disclosure describes methods that produce a first set of distinct nucleic acid sequences (or components) for the first request of information storage, and can thereafter reuse the same nucleic acid sequences (or components) for subsequent information storage requests. These approaches significantly reduce the cost of DNA-based information storage by reducing the role of de novo synthesis of nucleic acid sequences in the information-to-DNA encoding and writing process.
Moreover, unlike implementations of base-by-base synthesis, such as phosphoramidite chemistry-based or template-free polymerase-based nucleic acid elongation, which use cyclical delivery of each base to each elongating nucleic acid, the systems and methods of the present disclosure relate to information-to-DNA writing using identifier construction from components are highly parallelizable processes that do not necessarily use cyclical nucleic acid elongation. Thus, the present disclosure increases the speed of writing digital information to DNA compared to other methods. Various systems and methods of writing digital information into nucleic acid molecules are described in U.S. Pat. No. 10,650,312 entitled “NUCLEIC ACID-BASED DATA STORAGE”, filed Dec. 21, 2017 (describing encoding digital information in DNA); U.S. application Ser. No. 16/461,774 entitled “SYSTEMS FOR NUCLEIC ACID-BASED DATA STORAGE”, filed May 16, 2019 and published as U.S. Publication No. 2019/0362814 (describing encoding schemes for DNA-based data storage); U.S. application Ser. No. 17/012,909, entitled “CHEMICAL METHODS FOR NUCLEIC ACID-BASED DATA STORAGE”, filed Sep. 4, 2020 and published as U.S. Publication No. 2021/0079382 (describing chemical techniques and instruments for implementing various encoding schemes); U.S. application Ser. No. 16/414,758 entitled “COMPOSITIONS AND METHODS FOR NUCLEIC ACID-BASED DATA STORAGE”, filed May 16, 2019 and published as U.S. Publication No. 2020/0193301 (describing encoding schemes, partitioning, and logic gates); U.S. application Ser. No. 16/414,752 entitled “PRINTER-FINISHER SYSTEM FOR DATA STORAGE IN DNA”, filed May 16, 2019 and published as U.S. Publication No. 2019/0351673 (describing an assembly for producing encoded nucleic acid libraries); U.S. Application No. U.S. application Ser. No. 16/532,077 entitled “SYSTEMS AND METHODS FOR STORING AND READING NUCLEIC ACID-BASED DATA WITH ERROR PROTECTION”, filed Aug. 5, 2019 and published as U.S. Publication No. 2020/0185057 (describing data structures and error protection and correction); and U.S. application Ser. No. 16/872,129 entitled “DATA STRUCTURES AND OPERATIONS FOR SEARCHING, COMPUTING, AND INDEXING IN DNA-BASED DATA STORAGE”, filed May 11, 2020 and published as U.S. Publication No. 2020/0357483 (describing advanced data structures and protocols for access, rank, count, search, and extract operations), each of which is hereby incorporated by reference in its entirety.
The following description begins with an overview of various systems and methods for encoding data in nucleic acid molecules, and describes various writing and archival systems configured to print and store nucleic acid molecules that encode digital data, as described in relation to
Nucleic acids are also advantageous for physical storage of numerical values by approximation with copy counts of nucleic acid sequences, as described in relation to
Generally, the present disclosure encodes data (which is represented by a string of one- or zero-bits, or by a string of symbols, where each symbol is selected from a set of more than two symbol values) into a set of identifier nucleic acid sequences (or identifier sequence), where each unique identifier sequence has a corresponding bit or symbol in the string. The identifier sequence encodes the bit or symbol's position in the string, its value, or both the position and value. One way to implement the systems and methods of the present disclosure is to create each identifier nucleic acid molecule (or identifier molecule), which is represented by an identifier sequence, by ligating premade DNA component molecules (represented by component sequences) in an ordered manner that is based on defined layers, as is discussed in relation to
Generally, a component nucleic acid sequence is configured to bind one or more probes that can be used to select for all identifiers comprising said sequence. For example, a component may comprise a target sequence of 20 bases and a probe may comprise a complementary 20 base oligonucleotide for binding the target sequence. As described in the present disclosure, the composition of identifier nucleic acid sequences from components, each of which are capable of binding a unique probe, offers beneficial features when it comes to accessing and operating on the stored data. Though the methods of generating identifiers presented herein are especially configured to generate identifiers comprising components, it should be understood that such identifier nucleic acid molecules may be formed through a number of alternative methods. For example, de novo synthesis that generates nucleic acid sequences of length 100 bases can be used to create identifier nucleic acid sequences wherein each identifier comprises five components of 20 bases each. If all combinations of bases are available for synthesis, there may be up to 420 possible sequences for each component.
The term “symbol,” as used herein, generally refers to a representation of a unit of digital information. Digital information may be divided or translated into a string of symbols. In an example, a symbol may be a bit and the bit may have a value of ‘0’ or ‘1’.
The term “distinct,” or “unique,” as used herein, generally refers to an object that is distinguishable from other objects in a group. For example, a distinct, or unique, nucleic acid sequence may be a nucleic acid sequence that does not have the same sequence as any other nucleic acid sequence. A distinct, or unique, nucleic acid molecule may not have the same sequence as any other nucleic acid molecule. The distinct, or unique, nucleic acid sequence or molecule may share regions of similarity with another nucleic acid sequence or molecule.
The term “component,” as used herein, generally refers to a nucleic acid sequence or nucleic acid molecule. A component may comprise a distinct nucleic acid sequence. A component may be concatenated or assembled with one or more other components to generate other nucleic acid sequence or molecules.
The term “layer,” as used herein, generally refers to group or pool of components. Each layer may comprise a set of distinct components such that the components in one layer are different from the components in another layer. Components from one or more layers may be assembled to generate one or more identifiers.
The term “identifier,” as used herein, generally refers to a nucleic acid molecule or a nucleic acid sequence that represents the position and value of a bit-string within a larger bit-string. More generally, an identifier may refer to any object that represents or corresponds to a symbol in a string of symbols. In some implementations, identifiers may comprise one or multiple concatenated components.
The term “combinatorial space,” as used herein generally refers to the set of all possible distinct identifiers that may be generated from a starting set of objects, such as components, and a permissible set of rules for how to modify those objects to form identifiers. The size of a combinatorial space of identifiers made by assembling or concatenating components may depend on the number of layers of components, the number of components in each layer, and the particular assembly method used to generate the identifiers.
The term “identifier rank,” as used herein generally refers to a relation that defines the order of identifiers in a set.
The term “identifier library,” as used herein generally refers to a collection of identifiers corresponding to the symbols in a symbol string representing digital information. In some implementations, the absence of a given identifier in the identifier library may indicate a symbol value at a particular position. One or more identifier libraries may be combined in a pool, group, or set of identifiers, for example, having solid, liquid, or powder form. Each identifier library may include a unique barcode that identifies the identifier library.
The term “probe,” as used herein generally refers to an agent that binds a target sequence on an identifier nucleic acid molecule. The target sequence can be a portion of a component. The probe may comprise a sequence that matches or is the complement of its target sequence. The probe may be further used to isolate all identifier nucleic acid molecules comprising said target sequence. For example, the probe may be a primer in a PCR reaction that enriches all identifier nucleic acid molecules comprising a target sequence. Alternatively, the probe may contain be an affinity tagged oligonucleotide molecule that can be used to select all identifier nucleic acid molecules with a sequence that corresponds to said oligonucleotide. Probes may also be used for negative selection. For example, an affinity tagged probe can be used to remove all identifiers containing a particular target sequence. Alternatively, a probe may comprise an active nuclease, such as Cas9, that cleaves or digests all identifiers containing a particular target sequence.
The term “nucleic acid,” as used herein, general refers to deoxyribonucleic acid (DNA), ribonucleic acid (RNA), or a variant thereof. A nucleic acid may include one or more subunits selected from adenosine (A), cytosine (C), guanine (G), thymine (T), and uracil (U), or variants thereof. A nucleotide can include A, C, G, T, or U, or variants thereof. A nucleotide can include any subunit that can be incorporated into a growing nucleic acid strand. Such subunit can be A, C, G, T, or U, or any other subunit that may be specific to one of more complementary A, C, G, T, or U, or complementary to a purine (i.e., A or G, or variant thereof) or pyrimidine (i.e., C, T, or U, or variant thereof). In some examples, a nucleic acid may be single-stranded or double stranded, in some cases, a nucleic acid is circular.
The terms “nucleic acid molecule” or “nucleic acid sequence,” as used herein, generally refer to a polymeric form of nucleotides, or polynucleotide, that may have various lengths, either deoxyribonucleotides (DNA) or ribonucleotides (RNA), or analogs thereof. The term “nucleic acid sequence” refers to the alphabetical representation of a polynucleotide that defines the order of nucleotides; the term “nucleic acid molecule” refers to physical instance of the polynucleotide itself. This alphabetical representation can be input into databases in a computer having a central processing unit and used for mapping nucleic acid sequences or nucleic acid molecules to symbols, or bits, encoding digital information. Nucleic acid sequences or oligonucleotides may include one or more non-standard nucleotide(s), nucleotide analog(s) and/or modified nucleotides.
An “oligonucleotide”, as used herein, generally refers to a single-stranded nucleic acid sequence, and is typically composed of a specific sequence of four nucleotide bases: adenine (A); cytosine (C); guanine (G), and thymine (T) or uracil (U) when the polynucleotide is RNA.
Examples of modified nucleotides include, but are not limited to diaminopurine, 5-fluorouracil, 5-bromouracil, 5-chlorouracil, 5-iodouracil, hypoxanthine, xantine, 4-acetylcytosine, 5-(carboxyhydroxylmethyl)uracil, 5-carboxymethylaminomethyl-2-thiouridine, 5-carboxymethylaminomethyluracil, dihydrouracil, beta-D-galactosylqueosine, inosine, N6-isopentenyladenine, 1-methylguanine, 1-methylinosine, 2,2-dimethylguanine, 2-methyladenine, 2-methylguanine, 3-methylcytosine, 5-methylcytosine, N6-adenine, 7-methylguanine, 5-methylaminomethyluracil, 5-methoxyaminomethyl-2-thiouracil, beta-D-mannosylqueosine, 5′-methoxycarboxymethyluracil, 5-methoxyuracil, 2-methylthio-D46-isopentenyladenine, uracil-5-oxyacetic acid (v), wybutoxosine, pseudouracil, queosine, 2-thiocytosine, 5-methyl-2-thiouracil, 2-thiouracil, 4-thiouracil, 5-methyluracil, uracil-5-oxyacetic acid methylester, uracil-5-oxyacetic acid (v), 5-methyl-2-thiouracil, 3-(3-amino-3-N-2-carboxypropyl)uracil, (acp3)w, 2,6-diaminopurine and the like. Nucleic acid molecules may also be modified at the base moiety (e.g., at one or more atoms that typically are available to form a hydrogen bond with a complementary nucleotide and/or at one or more atoms that are not typically capable of forming a hydrogen bond with a complementary nucleotide), sugar moiety or phosphate backbone. Nucleic acid molecules may also contain amine-modified groups, such as aminoallyl-dUTP (aa-dUTP) and aminohexhylacrylamide-dCTP (aha-dCTP) to allow covalent attachment of amine reactive moieties, such as N-hydroxy succinimide esters (NHS).
The term “primer,” as used herein, generally refers to a strand of nucleic acid that serves as a starting point for nucleic acid synthesis, such as polymerase chain reaction (PCR). In an example, during replication of a DNA sample, an enzyme that catalyzes replication starts replication at the 3′-end of a primer attached to the DNA sample and copies the opposite strand.
The term “polymerase” or “polymerase enzyme,” as used herein, generally refers to any enzyme capable of catalyzing a polymerase reaction. Examples of polymerases include, without limitation, a nucleic acid polymerase. The polymerase can be naturally occurring or synthesized. An example polymerase is a Φ29 polymerase or derivative thereof. In some cases, a transcriptase or a ligase is used (i.e., enzymes which catalyze the formation of a bond) in conjunction with polymerases or as an alternative to polymerases to construct new nucleic acid sequences. Examples of polymerases include a DNA polymerase, a RNA polymerase, a thermostable polymerase, a wild-type polymerase, a modified polymerase, E. coli DNA polymerase I, T7 DNA polymerase, bacteriophage T4 DNA polymerase Φ29 (phi29) DNA polymerase, Taq polymerase, Tth polymerase, Tli polymerase, Pfu polymerase Pwo polymerase, VENT polymerase, DEEPVENT polymerase, Ex-Taq polymerase, LA-Taw polymerase, Sso polymerase Poc polymerase, Pab polymerase, Mth polymerase ES4 polymerase, Tru polymerase, Tac polymerase, Tne polymerase, Tma polymerase, Tca polymerase, Tih polymerase, Tfi polymerase, Platinum Taq polymerases, Tbr polymerase, Tfl polymerase, Pfutubo polymerase, Pyrobest polymerase, KOD polymerase, Bst polymerase, Sac polymerase, Klenow fragment polymerase with 3′ to 5′ exonuclease activity, and variants, modified products and derivatives thereof.
Digital information, such as computer data, in the form of binary code can comprise a sequence or string of symbols. A binary code may encode or represent text or computer processor instructions using, for example, a binary number system having two binary symbols, typically 0 and 1, referred to as bits. Digital information may be represented in the form of non-binary code which can comprise a sequence of non-binary symbols. Each encoded symbol can be re-assigned to a unique bit string (or “byte”), and the unique bit string or byte can be arranged into strings of bytes or byte streams. A bit value for a given bit can be one of two symbols (e.g., 0 or 1). A byte, which can comprise a string of N bits, can have a total of 2N unique byte-values. For example, a byte comprising 8 bits can produce a total of 28 or 256 possible unique byte-values, and each of the 256 bytes can correspond to one of 256 possible distinct symbols, letters, or instructions which can be encoded with the bytes. Raw data (e.g., text files and computer instructions) can be represented as strings of bytes or byte streams. Zip files, or compressed data files comprising raw data can also be stored in byte streams, these files can be stored as byte streams in a compressed form, and then decompressed into raw data before being read by the computer.
It is to be understood that the terms “index” and “position” are used interchangeably in the present disclosure, and it is to be understood that both terms are used to refer to a specific element or entity of an ordered collection, such as a list or a string. For example, an index or position may be used to specify an element in an array, vector, string, or data structure. Index/position notation uses a numbering scheme to assign nominal numbers to each entry/entity. The examples in the present disclosure often use a first index/position of 0, known in the art as zero-based numbering. The first position (also referred to as the zero-th position) of an array/string is denoted by 0 for purposes of computation involving the specific position). A set of length n would have a numbering scheme of 0, 1, . . . , n−1. It is to be understood that other numbering schemes may be used in the systems and methods described herein. For example, a numbering scheme may start at 1 and continue to n for a set of length n.
The present disclosure describes methods in relation to the figures of the application. It is to be understood that these methods, including computational steps, are configured to be performed in DNA. Methods and systems of the present disclosure may be used to encode computer data or information in a plurality of identifiers, each of which may represent one or more bits of the original information. In some examples, methods and systems of the present disclosure encode data or information using identifiers that each represents two bits of the original information.
Writing Digital Data into Nucleic Acids
Identifiers (nucleic acid molecules) have nucleic acid sequences that can be used to encode digital information, such as a string of symbols. Identifiers are formed by assembling components (nucleic acid molecules). Components may be configured to bind a probe (as described in the foregoing), and these components configured as such are “addressable components.” All components described herein may be addressable components.
In some embodiments, the identifiers may be comprised entirely of addressable components. The addressable components may be assembled to form an identifier or they may be introduced into an identifier sequence through subtractive or substitution approaches. Alternatively, they may be incorporated into a nucleic acid identifier by de novo synthesis. Different writing methods vary in speed and cost. They can also vary in the number of possible components that can be incorporated into an identifier.
Enzymatic reactions may be used to assemble components from the different layers or sets. Assembly can occur in a one pot reaction because components (e.g., nucleic acid sequences) of each layer have specific hybridization or attachment regions for components of adjacent layers. For example, a nucleic acid sequence (e.g., component) X1 from layer X, a nucleic acid sequence Y1 from layer Y, and a nucleic acid sequence Z1 from layer Z may form the assembled nucleic acid molecule (e.g., identifier) X1Y1Z1. Additionally, multiple nucleic acid molecules (e.g., identifiers) may be assembled in one reaction by including multiple nucleic acid sequences from each layer. The one reaction may involve self-assembly of components into identifiers.
Identifiers may be constructed in accordance with the product scheme using overlap extension polymerase chain reaction (OEPCR), as illustrated in
Identifiers may be assembled in accordance with the product scheme using sticky end ligation, as illustrated in
Identifiers may be assembled in accordance with the product scheme using site specific recombination, as illustrated in
Identifiers may be constructed in accordance with the product scheme using template directed ligation (TDL), as shown in
For affinity-tag based access, a process which may be referred to as nucleic acid capture, the components that constitute the identifiers in a pool may share complementarity with one or more probes. The one or more probes may bind or hybridize to the identifiers to be accessed. The probe may comprise an affinity tag. The affinity tags may bind to a bead, generating a complex comprising a bead, at least one probe, and at least one identifier. The beads may be magnetic, and together with a magnet, the beads may collect and isolate the identifiers to be accessed. The identifiers may be removed from the beads under denaturing conditions prior to reading. Alternatively, or in addition to, the beads may collect the non-targeted identifiers and sequester them away from the rest of the pool that can get washed into a separate vessel and read. The affinity tag may bind to a column. The identifiers to be accessed may bind to the column for capture. Column-bound identifiers may subsequently be eluted or denatured from the column prior to reading. Alternatively, the non-targeted identifiers may be selectively targeted to the column while the targeted identifiers may flow through the column. Accessing the targeted identifiers may comprise applying one or more probes to a pool of identifiers simultaneously or applying one or more probes to a pool of identifiers sequentially.
For degradation based access, the components that constitute the identifiers in a pool may share complementarity with one or more degradation-targeting probes. The probes may bind to or hybridize with distinct components on the identifiers. The probe may be a target for a degradation enzyme, such as an endonuclease. In an example, one or more identifier libraries may be combined. A set of probes may hybridize with one of the identifier libraries. The set of probes may comprise RNA and the RNA may guide a Cas9 enzyme. A Cas9 enzyme may be introduced to the one or more identifier libraries. The identifiers hybridized with the probes may be degraded by the Cas9 enzyme. The identifiers to be accessed may not be degraded by the degradation enzyme. In another example, the identifiers may be single-stranded and the identifier library may be combined with a single-strand specific endonuclease(s), such as the S1 nuclease, that selectively degrades identifiers that are not to be accessed. Identifiers to be accessed may be hybridized with a complementary set of identifiers to protect them from degradation by the single-strand specific endonuclease(s). The identifiers to be accessed may be separated from the degradation products by size selection, such as size selection chromatography (e.g., agarose gel electrophoresis). Alternatively, or in addition, identifiers that are not degraded may be selectively amplified (e.g., using PCR) such that the degradation products are not amplified. The non-degraded identifiers may be amplified using primers that hybridize to each end of the non-degraded identifiers and therefore not to each end of the degraded or cleaved identifiers.
With each iteration of PCR-based access on an identifier library, the identifiers may become shorter as primers are designed to bind components iteratively further inward from each edge. For example, an identifier library may comprise identifiers of the form A B C D E F G, where A, B, C, D, E, F, and G are layers. Upon amplifying with primers that bind particular components, for example, A1 and G1 in layers A and G respectively, the amplified portion of the identifier library may take on the form A1-B-C-D-E-F-G1. Upon further amplifying with primers that bind particular components, for example, B1 and F1 in layers B and F respectively, the amplified portion of the identifier library may take on the form B1-C-D-E-F1, where it may be assumed that these shorter amplified sequences correspond to full identifiers that further comprise component A1 in the position of layer A and G1 in the position of layer G.
Each identifier in a combinatorial space can comprise a fixed number of N components where each component comes from a distinct layer in a set of N layers, and is one of a number of a set of possible components in said layer. Each component can be specified by a coordinate (j, Xj) where j is the label of the layer and Xj is the label of the component within the layer. For said scheme with N layers, j is an element of the set {1, 2, . . . , N} and Xj is an element of the set {1, 2, . . . , Mj) where Mj is the number of components in layer j. A logical order to the layers can be defined. A logical order to each component within each layer can also be defined. This labeling can be used to define a logical ordering to all possible identifiers in the combinatorial space. For example, first the identifiers can be sorted according to the order of the components in layer 1, and then subsequently according to the order of the components in layer 2, and so on, as shown by example in
The logical ordering of the identifiers can be further used to allocate and order digital information. Digital information can be encoded in nucleic acids that comprise each identifier, or it can be encoded in the presence or absence of the identifiers themselves. For example, a codebook can be created that encodes 4 bits of information in every contiguous grouping of 4 identifiers. In this example, the codebook could map each possible string of 4 bits to a unique combination of 4 identifiers (since there are 16 possible combinations of 4 identifiers, it is possible to store up to log 2(16)=4 bits of data). As another example, a codebook can be created that encodes 6 bits of data in every contiguous group of 8 identifiers. In this example, the codebook could map each possible string of 6 bits to a unique subsets of 4 out of the 8 identifiers (since there are 8 choose 4=70 such subsets, it is possible to store up to floor(log 2(70))=6 bits of data). These identifier combinations may be referred to as codewords, and the data that they encode may be referred to as words. Adjacent words within data may be stored in adjacent codewords among the logically ordered identifiers.
Prior to encoding into codewords, ordered identifiers can be partitioned into blocks, where each block can contain multiple codewords. If multiple physical containers are used to store the nucleic acids, for example test tubes or per tubes, then multiple blocks can be partitioned using the same identifier space but in separate containers.
For example, on can choose to partition an identifier space into adjacent blocks of fixed size c codewords. In this way, the range of identifiers encoding the gth instance of a block, can be inferred as the range encoding codeword instances (g−1)*c+1 through g*c. These particular identifiers can then be readily accessed though chemical access programs which access only identifiers that share specified sets of components in common. These access programs work by selectively targeting and subsequently enriching, or selecting, identifiers with said specified sets of components using probes, for example with primers and PCR or affinity tagged oligonucleotides and affinity pull-down assays. Probes can be applied in a series of selection reactions, as in
Even though blocks of fixed length have finite storage capacity, the information contained within a block can refer to a subsequent block or multiple blocks from which to receive additional data. In this way, linked lists can be implanted in blocks of identifiers. Moreover blocks can represent trees or graphs. The structure of the tree or graph can be implicit by prescribing blocks that belong to different nodes of said tree or graph. For example, a first block can be reserved for a root node in a B tree, a second and third block for the second level nodes, and so on. The value of each node can be encoded in the block. And in this way, once can traverse a path in a tree to satisfy a query by accessing a block, decoding its node information, and using said information to calculate a subsequent block to access and decode (ie, a subsequent node to which to travel). Alternatively, the tree may have no prescribed structure besides a block for a starting node. From this block, and each block thereafter, the location of all blocks encoding neighboring nodes may be encoded along with the information within each node.
This structure can be used to create a trie, for example a suffix tree. A suffix tree can be used to look for patterns within a string of data. In a suffix tree, the blocks are configured to represent nodes of a suffix tree. The suffix tree is a trie, where every path from the root node to a leaf represents a suffix of a symbol string S. The edges of the trie represent substrings of symbols that comprise each path. A suffix tree can be represented in an identifier nucleic acid library where every block corresponds to a node in a suffix tree and contains information about its daughter nodes. For example, information about daughter nodes comprises the substrings of symbols that comprise the edges leading to each daughter node, and the locations of the blocks containing those daughter nodes. The root node can be a prescribed block, like the first block. Querying for the membership, count, or location of a pattern involves following a path along the suffix tree by accessing identifiers of the block corresponding to the root node, decoding the information contained therein, determining the location of the next block based on the information contained therein and the query pattern, and continuing the process until either no downstream blocks (or nodes) are left that will satisfy the corresponding query or until a lead node is reached. In the former case, the query pattern does not exist in the string S. In the latter case, the blocks corresponding to the leaf node can be configured to contain the count or locations of the query pattern.
Other approaches for looking for patterns include FM-index approaches and inverted indexes, each of which have block-based implementations using identifier nucleic acids. In an inverted index, there can be a block for each possible substring of a fixed length over a symbol alphabet that comprises a string S of symbols. Each block can contain information about the starting positions of the corresponding substring within S. The block IDs can correspond to the substrings, or they can correspond to the positions of the sorted substrings, such that the location of a block corresponding to a substring can be ascertained without additional information. A query pattern can be mapped to substrings of the inverted index that comprise it, and the blocks can be accessed and decoded. The positional information contained therein can be used to determine the count and locations of the query pattern within S. The inverted index need not be limited to substrings of fixed length. For example, it can also be used to represent the positions of words in a document, or across multiple documents.
Suitable data structures and search methods, such as the FM-index approach, are described in U.S. Application No. U.S. application Ser. No. 16/532,077 entitled “SYSTEMS AND METHODS FOR STORING AND READING NUCLEIC ACID-BASED DATA WITH ERROR PROTECTION”, filed Aug. 5, 2019 and published as U.S. Publication No. 2020/0185057 (describing data structures and error protection and correction); and U.S. application Ser. No. 16/872,129 entitled “DATA STRUCTURES AND OPERATIONS FOR SEARCHING, COMPUTING, AND INDEXING IN DNA-BASED DATA STORAGE”, filed May 11, 2020 and published as U.S. Publication No. 2020/0357483 (describing advanced data structures and protocols for access, rank, count, search, and extract operations), each of which is hereby incorporated by reference in its entirety.
Native Key-Value Stores with Parallel Queries
In some embodiments, the information needed to access a block of interest may be natively configured to a data object, such that the container of the block and the access program needed to retrieve it may be directly inferred from said data object without the necessity of accessing additional blocks. The data object may be referred to as a native key, the information contained within the corresponding block as a value, and their combination as a native key value (nkv) pair. In other words, a native key is the information encoded by an identifier sequence that may be used to access the identifier sequence. Whereas the value, in this context, is additional data that may be associated with the key and accessed with it.
The native key may be configured such that it is a symbol string with one position for each key layer, and one symbol value for each component in each key-layer. For example, if there are 8 key layers, and they each comprise 2 components, then the native key can be a string of 8 bits. As another example, if there are 16 key layers of 256 components each then the native key can be a string of 16 ASCII characters. Each symbol in the native key is an operand because it maps to an addressable component. A query formulation over the keys can be specified as a regular expression composed by logicals AND and OR, and by a primitive match-component operator, which is orchestrated in the form of a directed acyclic graph (DAG), the access program. The access program will specify the order in which chemical reactions will be executed over various containers containing the identifiers. For example, a range query can be performed for all keys containing a particular symbol value in the 2nd and 4th position by performing a series of two probe-bases reactions, one selecting for the corresponding component in the 2nd key layer and one selecting for the corresponding component in the 4th key layer. The accessed identifier blocks will contain only the values associated with the valid native keys. In this way, one can build a native key value store where query operations are performed over all keys simultaneously, without necessitating additional index information.
This method may be especially useful for storing native keys that are high dimensional and wherein common queries may be formulated as symbol selection operations. Such queries over such keys would be difficult to satisfy on conventional media, as the high dimensionality of the keys would make them impractical to index. But in identifier nucleic acids, architected as described above, the query could be fast and low cost. Such a method could be useful, for example, if the key is a bloomfilter, fingerprint, hash, or structural array of its corresponding data object and the query is one of symbol selection, or similarity or membership with respect to a reference object.
In one embodiment, the native keys could be records in a data table, the key layers can correspond to columns in the table, and the components of each layer can each correspond to a particular column value. Records can be queried using SQL-type commands to be compiled into an access program to be performed in DNA and select all identifier sequences that comprise a specified set of components. In some instances, the identifier sequence may be comprised entirely of key layers. In some applications, queries may involve counting the number of native keys that satisfy a certain property. Counting can be performed by sequencing, but it can be performed more natively with bulk read out methods such as fluorescence or absorbance. For example, one could determine the total amount of identifier molecules in a sample by doing qPCR with common edge primers or by using spectrophotometry, fluorimetry, gel electrophoresis, or plate reader assays. Because the expected number of identifier molecules per identifier sequences should be uniform, the bulk readout can be normalized by a standard sample (e.g. a sample comprised of a known number of identifier sequences) to find the number of unique identifier sequences (or the count), which can in turn be used to calculate a count of a number of native keys.
In some embodiments, the absolute count may not be of interest, and instead a relative count may be desired. For example, qPCR with a component specific primer (or probe) can be performed to find the relative fraction of identifier sequences in a sample that contain a particular component. The fraction may be relative to the entire population of identifier sequences in the sample or it may be relative to another subpopulation of identifier sequences that contain a different component. This type of data may be especially useful if the native keys are records, and the components of corresponding identifier sequences correspond to field values that comprise the records. One could then perform quick and effective queries of the form, find the number of records that contain data value x. Or as another, more complex example, of the records that contain data value x find the fraction that contain data value y. For example, of all the car accidents that happened in 1998 what fraction occurred in September. Bulk readout methods for absolute and relative counting may be faster and cheaper than DNA sequencing.
Because native DNA keys, for example records, may comprise components that represent data values, it follows that native keys that are semantically similar (share data in common) will also be similar in sequence. Such a property may be useful for performing similarity queries. For example, a query may be specified to find all records in a database that are similar to a reference record. The reference record may be converted into its corresponding identifier sequence and used as a probe in an identifier library to find all similar identifier sequences (corresponding to similar records). The reference identifier sequence may contain an affinity tag (e.g., biotin) such that it (and its binding partners) can be retrieved with an affinity pull-down assay (e.g., with immobilized streptavidin). The reference identifier sequence will be more likely to bind to other identifier sequences that share more sequences in common. The stringency of this binding, and therefore the stringency of the similarity query, can be controlled by using temperature and additives, such as salts. Increased temperature will increase the stringency and make it less likely that less similar identifier sequences will be returned by the query. The retrieved identifier sequences can be sequenced to determine their identity. Alternatively, the retrieved identifier sequences can be counted using the bulk readout methods described above. Such native similarity search may be less precise than conventional methods, but may be a lot faster and cheaper, especially for large databases of keys.
More broadly, any finite automata computation can be performed over native keys using access programs. This is because finite automata can be represented as regular expressions, and regular expressions can be reduced to the logical operators that comprise our query formulation. Generally, native key symbols can correspond to operands, and any logic can be performed over those operands with an access program. In an embodiment, native key symbols can correspond to numbers, and arithmetic can be performed over those numbers with an access program. For example, two 8-bit numbers that comprise native keys can be multiplied, and keys can be selected such that the product is between 412 and 500. Though this type of computation enables querying native keys using finite automata, it is not limited to this application, nor is it limited to Boolean output. Presented in the following section is a broader framework for multi-output computation over data objects comprised of operands.
Computing with Graph Programs (if-then-Else)
An identifier library can be used to perform computations in parallel across multiple data objects in a data set. A computation derives the outputs related to the given inputs using a specified set of derivation rules. The set of all correctly related input-output pairs may be defined by some function, say ƒ, which maps inputs, say X and Y, to their correct outputs, say Z and W. In this way, a computation implements a function ƒ if and only if when given some inputs (X, Y), it derives the same outputs as ƒ(X, Y).
To translate this tabular specification of ƒ into a ruleset amenable to DNA computing, a directed graph program is constructed and will direct the chemical operations to be performed on identifiers. Knuth's “The Art of Computer Programming Volume 4A: Combinatorial Algorithms Part 1” (which is hereby incorporated by reference in its entirety) provides theoretical details of directed graph programs, but here the process is adapted to the present disclosure's nucleic acid-based approach. The nodes in this graph are identifier libraries and arrows emanating from a node represent an operation performed on the library. Each node is labeled first by a bit string and later in the process by an input symbol. Each arrow is labeled first by an input of the function and later by a symbol of the alphabet. The graph program is constructed in two steps.
In
The rule is reapplied to the new node 01011010, and two new nodes 0101 and 1010. The first new node 0101, however, factors into 01 and 01, so apply the “square” rule and instead create the node labeled 01. Because this is the third level of node creation, the arrows are labeled with X3 as shown. Similar rules are applied to the node labeled 1010 to arrive at the final graph shown for output L. Using the same ruleset, graphs are constructed for all three outputs C, H, and L as shown in
The ITE gate operates as follows. The gate is presented with an input library of identifiers. These identifiers are tested by the gate by examining the component in layer t of each identifier. If the component in layer t of an identifier is c, and c is defined by the programmer to be a member of the set Ljt, then that identifier is sorted into the output library labeled Ljt. In this way, an ITE gate testing layer t, classifies identifiers in the input library into output libraries based on the component in its layer t.
The computation proceeds in the following way. Consider the graph program constructed to compute output L in
From the output libraries constructed by the three graph programs shown in
Data Storage and Computation with Copy Number Manipulation
In the preceding, strategies were presented for storing data using identifier sequences and computing on the data by manipulating the presence or absence of identifier sequences. Alternatively, or in addition to this method of data storage and compute, data can also be stored in the number of molecules, or copy number, of a particular identifier sequence in an identifier library. For example, an entire identifier sequence can correspond to a key and the value of the key can correspond to the copy count of the identifier sequence in one or more identifier libraries.
Copy count may be difficult to control and may be inherently imprecise depending on the method used to encode it, manipulate it, and measure it. To circumvent these limitations, one may encode data values with expected copy count (ECC) instead of actual copy count and accept that the representation is approximate or fuzzy. Although copy count is discrete, ECC is continuous and thus enables representation of more values within a range of copy counts. An analog mapping can be defined such that numerical data values are represented by a proportional ECC. For example, if the constant of proportionality is 100, then the value 2 would be represented by an ECC of 200, the value 2.5 would be represented by an ECC of 250, the value 2.755 would be represented by an ECC of 275.5, and so on. Such an encoding scheme is referred to as “analog encoding.”
The benefit of analog encoding is that it enables computation with native sample operations such as mixing, aliquoting, and PCR. Table 1 provides and overview of these sample operations and their logic. Mixing two samples, s1 and s2, performs addition where the two numbers being added are the respective numerical values of an identifier sequence in both samples. Aliquoting from a sample, s1, performs multiplication by a number less than one (a “fraction”) where the fraction corresponds to the fractional volume of the aliquot (the volume of the aliquot divided by the volume of the sample s1 from which it was derived). PCR performs multiplication by a power of 2 where the power of 2 corresponds to the number of PCR cycles. For example, one PCR cycle corresponds to multiplying by 2, two PCR cycles corresponds to multiplying by 4, three PCR cycles corresponds to multiplying by 8, and so on.
For samples containing multiple identifier sequences, the logic performed by the sample operations are performed over all identifier sequences in parallel. Moreover, the sample operations described above are composable as they take samples as inputs and produce samples as outputs that can be further operated on. For example, multiplication by an arbitrary positive number can be accomplished by performing PCR for y cycles and aliquoting a fractional volume z since any positive number can be represented by z·2y for some fraction z and some integer y. Because of this, any arbitrary linear function with positive coefficients can be represented by composing these sample operations.
In an embodiment, one can encode a vector x=[x1, x2, . . . , xN] by storing the value of each element in a different sample, but in the same identifier sequence across each sample. A linear function (or linear combination) c1x1+c2xx+ . . . +cNxN can be applied to the vector by computing each multiplication term with a combined aliquot-PCR operation over the respective sample, and then adding the terms by mixing all of the output samples together.
Not only can each multiplicative term and each addition in a linear function be performed in parallel, but the function itself is performed over multiple vectors simultaneously if multiple identifier sequences exist in the samples, each encoding their respective vector. For example, with a trillion identifier sequences, one can encode up to a trillion vectors. If each vector is 100 elements, then this can be encoded across 100 samples.
The method of digital to analog conversion illustrated in
The sample operations above cannot perform linear functions with negative coefficients. If the data is represented in binary, then a term with a negative coefficient in a linear function can be converted to a term with a positive coefficient using the following formula: −cixi=ci(1−xi)−ci. The (1−xi) is equivalent to x′i, the complement of xi when xi is a bit. So it follows that any negative term in a linear function (over binary data) can be converted to a positive term by using the complement of the binary digit in the term. For example c1x1−c2x2−c3x3 can be converted to c1x1+c2x′2+c3x′3−(c1+c2). It follows that any mixed-sign linear function can be converted to a positive sign linear function minus a constant. In order to perform this linear function conversion on a binary dataset (without prior knowledge of which terms in the linear function will be negative), the complement of the data must be stored along with the original data. In other words, for each sample s created with the presence of a set of identifier sequences A and the absence of another set of identifier sequences B there must be a complement sample s′ created as well that contains the presence of the set of identifier sequences in set B and the absence of identifier sequences in set A. Once the linear function is in a form with positive sign terms, it can be executed using sample operations described above (see Table 1 and
The linear functions applied to the data can be models designed to identify target vectors that have a certain property. For example, a linear function can be designed to convert target vectors into a high score and non-target vectors into a low score. Because those scores are represented by ECCs, Identifier sequences encoding higher scores will be more abundant in an output sample than identifier sequences encoding lower scores and therefore random sampling from the output sample with DNA sequencing will be more likely to return identifier sequences represented by vectors that score higher in the model. The vectors themselves may be compressed representations of larger data objects that are stored elsewhere. For example, they may be hashes, bloom filters, signatures, fingerprints, or structural arrays derived from the original data objects. The original data objects for high scoring vectors can be retrieved using the associated identifier sequence (or a derivative thereof) as a key.
Because the operations described herein perform linear functions, they may be limited in their ability to enrich target identifier sequences (representing target data objects) over non-target identifier sequences. Subsequent application of non-linear functions, such as activation functions, may further enrich target sequences. This is akin to how neural networks function to create complex behavior, such as classification. Generally in neural networks, a layer of “neurons” calculates a weighted sum of inputs (equivalent to linear functions), and then those values (the weighted sums at each neuron in the layer) are added together after applying a non-linear function to each of them. The non-linear function in this context is called an activation function as it “activates” the signal from certain neurons. The non-linearity is crucial for activation because it effectively suppresses signal from neurons with output values that fall below a certain threshold.
The formation of double stranded DNA from complementary single strands is a dimerization process and therefore has a super-linear dependence on the concentration of DNA molecules. Eqns. 1-8 provides a model for dimerization of single stranded DNA (species Z) to form double stranded DNA (species Y). Mass action kinetics are used to derive the concentration of double stranded DNA at steady state ([Y]) as a function of the total amount of DNA [X].
Eqn. 1 represents the equilibrium reaction of dimerization, governed by the rate of double stranded DNA degradation into single strands (δ) and the rate of duplexing (a). Eqn. 2 represents the rate of change in the concentration of double stranded DNA [Y] based on the kinetics of Eqn. 1. At steady state, Eqn. 2 is equal to zero as shown. At steady state, solving Eqn. 2 for the concentration of double stranded DNA [Y] yields Eqn. 3. The parameter K is the ratio of the rate of double stranded DNA degradation into single strands (δ) and the rate of duplexing (a) (in short, K=δ/α). The total amount of DNA [X] is equal to the sum of the amounts of single stranded DNA [Z] and double stranded DNA [Y], as shown in Eqn. 4. Substituting Eqn. 4 into Eqn. 3 for [Z] gives Eqn. 5. Expanding Eqn. 5 produces the quadratic equation Eqn. 6 in standard quadratic form which can be solved for [Y] to get Eqn. 7. Eqn. 7 has two solutions, but one solution has [Y]>[X] which is not possible, so only one feasible solution is shown in Eqn. 7.
Eqn. 7 is non-linear but has different behaviors in different regimes. For [X]>K, [Y] is approximately linear to [X]. When the total amount of DNA [X] is much less than the rate constant K ([X]<<K), [Y] is quadratic in [X] and Eqn. 7 may be approximated by Eqn. 8 by using a second order approximation.
This non-linearity can be exploited to increase the copy count difference between identifier sequences in a sample. It can favorably enrich identifier sequences with higher copy counts, thus effectively suppressing identifier sequences with lower copy counts, much like an activation function in a neural network. This in turn will make identifier sequences with higher copy counts more likely to be sequenced if the sequencing process only covers a small sample size of the total sample. The non-linear function can also be applied repeatedly. Literature (such as Livni, R. et al. “On the Computational Efficiency of Training Neural Networks”, arXiv:1410.1141 [cs.LG]; which is hereby incorporated by reference in its entirety) has shown that quadratic equations like the one demonstrated in Eqn. 7 can be used effectively as activation functions in a neural network when applied repeatedly. The quadratic approximation equation from Eqn. 8 or the entire exact equation can be used to train a neural network model. This model can then be applied to data stored in DNA using the operations described above.
In order to faithfully exploit the non-linear relationship in Eqn. 7, it is necessary to perform a sample operation that selectively extracts double stranded DNA from a sample. In one embodiment this may be accomplished with chromatography strategies like gel electrophoresis. In another implementation, this may be accomplished by mass spectrometry. In another implementation, this may be accomplished by flow cytometry or another fluorescent sorting technique using double-stranded DNA specific dye. In another implementation, this may be accomplished with membrane capture or affinity capture techniques, such as with silica beads or columns. Prior to double stranded DNA selection the sample must be given time to equilibriate (reach steady state) as per the model assumptions in Equations 1-8. The non-linear regime occurs with [X]<<K which can be forced by either diluting the sample to make [X] smaller. Alternatively, one could make K larger, for example by increasing the temperature of the sample or adding substances that interfere with duplexing.
Encoding Via Longitudinal StripingIn
In this section, identifiers, as described in relation to
This aspect enables retrieval of any substripe of any subset of data objects, given any query substripe of the desired subset. An example of this retrieval method is shown in
-
- 1. Key extraction: The key layers of each identifier in query X are isolated from the operand layers of the identifier.
- 2. Key matching: Identifiers from a target library Lj whose key layers match the key layers isolated from X are selected and output as substripe Y.
In some implementations, key extraction is performed by first converting each identifier in X to a single-stranded form. The nucleic acids corresponding to all possible components that can occur in the key layers are obtained in a single stranded form complementary to that of X, and are hybridized to the single-stranded form of X. This ensures that the key layers are double-stranded whereas the operands layers are in single-stranded form. This mixture is subjected to a single-stranded DNA degrading nuclease such as P1 (as shown in
In some implementations, key extraction is performed by digesting each identifier in X with a restriction endonuclease that recognizes a specific sequence that is found in all identifiers between the key and operand layers, followed by size selection to retain the key layers of the substripe X in double-stranded form.
In some implementations, key extraction is performed by introducing a nick between the key and operand layers with a sequence-specific nicking enzyme, incorporating labeled nucleotides with DNA polymerase I at the nick site, melting the strands to produce single-stranded DNA, and performing affinity capture against the labeled nucleotides to retain the key layers of the substripe X in single-stranded form. In one implementation, the label may be biotin and affinity capture performed with streptavidin-coated beads.
In some implementations, key extraction is performed by selectively amplifying and purifying the key layers. Here, the identifiers are designed such that the key layers are flanked by universal sequences that serve as primer binding sites for PCR. The amplified DNA is purified to remove operand sequences, resulting in the retention of key layers of the substripe X in double-stranded form.
In some implementations, key matching is performed by first converting the target library L1 and the keys extracted into complementary single-stranded forms. The nucleic acids corresponding to all possible components that can occur in the operand layers of the target library are obtained in single-stranded form complementary to that of the target library, and together with the single-stranded keys, are hybridized to the complementary single-stranded target library. This ensures that identifiers in the target library whose key layers match one of the extracted keys will be in double-stranded form, whereas an identifier whose key layers do not match any of the extracted keys will be in a partially single-stranded form. Its key layers will be partially or completely single stranded. This mixture is subjected to a single-stranded nucleic acid degrading nuclease such as P1. This degrades the key layers of non-matching identifiers in the target library, leaving only matching identifiers in their full-length double-stranded form (as shown in
In some implementations, key matching is performed by attaching affinity groups to the extracted key layers, where key layers are complementary to matching key layers in the target library, such that molecules of the target library hybridize to the key molecules and can be indirectly captured using affinity methods targeting those key molecules.
The full workflow shown in
The graph programming model described in the foregoing (see
Provided below are example implementations of the various aspects described in the foregoing.
Partitioning Information Stored in Nucleic AcidsAs discussed in the foregoing, information stored in nucleic acids may be partitioned to provide for easier access and reading of said information.
As described previously, mapping (in step 2108) may be performed using a codebook that maps words (portions of digital information to be encoded, such as the string of symbols) to codewords (groups of identifier nucleic acid sequences). Each block comprise one or more codewords. In some implementations, there is a fixed number of codewords per block. For example, a range of identifiers encoding the gth instance of a block can be calculated as the range encoding codeword instances (g−1)*c+1 through g*c.
In some implementations, a block can be accessed with an access program, such as those described in the foregoing. In some implementations, a block is associated with a location, said location comprising information for accessing said block. In some implementations, a block contains information about a location of another block.
In some implementations, a block represents a node in a graph. The graph may be a tree, such as a suffix tree or a B-tree. In some implementations, a block is an element of an inverted index. In some implementations, multiple blocks form a linked list, wherein a first block points to a second block to accommodate a large symbol of strings that is mapped to at least the first block and the second block.
In some implementations, the identifier sequences of each block share the same component sequences in a set of key layers. Said key layers may store a data object, such as a native key (as described in the foregoing). The native key comprises one or more symbols and may be configured such that each symbol corresponds to a component sequence of a corresponding key layer. The method may further comprise performing a range query over several keys sharing common symbol values at a set of positions by accessing all identifier molecules containing their corresponding component sequences. A query for all keys that satisfy a finite automata or regular expression may be performed by an access program.
The method may further comprise storing a value in the identifier sequences of the block associated with each key. Said value may be retrieved by accessing the corresponding key. The key may be a function of the value. The key may be a hash (e.g., a hashed value resulting from applying a hash function to the value), a bloom filter, a structural array, a classifier, a signature, a record, or a fingerprint derived from the corresponding value. An access program may be designed to access all keys that are similar to, or share symbol values in common with, a query key. Said query key may be derived from a reference value.
A count of a number of keys (e.g., across all the blocks or identifier molecules) may be determined by measuring DNA concentration. For example, DNA concentration may be measured by fluorescence (e.g., qPCR, plate reader assay, Qubit fluorimeter) or by absorbance (e.g., spectrophotometry assay). The measured DNA concentration may be normalized by a standard to determine the number of identifier sequences in a sample. A relative number of keys containing a particular symbol may be determined by performing qPCR with a probe for a key layer component corresponding to the particular symbol.
In some implementations, the method further comprises creating an identifier sequence for a reference key. Said reference key may be used for performing a query. The identifier sequence may be used as a hybridization probe in a hybridization reaction to search for or extract all keys that are similar or the same. Conditions, such as pH or temperature, of the hybridization reaction may be used to control the stringency of the search or extraction. For example, adjusting temperature allows for control of how similar the keys must be to the reference key in order to be accepted.
Operating on Information Stored in Nucleic AcidsAs discussed in the foregoing, information stored in nucleic acids may be used for computation by applying if-then-else (ITE) gates to select for certain values, symbols, or sequences.
In some implementations, each identifier molecule in the first pool comprises a distinct component sequence from each of M layers. Each layer comprises a set of component sequences.
In some implementations, an identifier molecule represents a data object. A component molecule of the identifier molecule may represent an operand of the data object. In some implementations, the if-then-else operation comprises using a probe pool of identifier molecules having a specific component molecule as probes, for example, to select for identifier molecules in the first pool having the same specific component molecule. For example, the identifier molecules of the probe pool are single-stranded and at least partially hybridize to single-stranded identifier molecules in the first pool which contain the same specific component molecule. In some implementations, the probes are PCR primers, and the if-then-else operation is executed with PCR. In some implementations, the probes are affinity tagged oligonucleotides, and the if-then-else operation is executed with an affinity pull down assay.
In some implementations, two or more if-then-else operations are performed on one or more pools of identifier molecules in parallel. In some implementations, method 2200 further comprises splitting at least one of the first pool, the intermediate pool, or the final pool into at least two duplicate pools. The at least one of the first pool, the intermediate pool, or the final pool may be replicated (e.g., by PCR) prior to splitting. Method 2200 may further comprise combining at least two intermediate pools of identifier molecules to form a new intermediate pool of identifier molecules or a second pool of identifier molecules.
In some implementations, the if-then-else operation (or repeated application thereof via step 2206) represents execution of a graph program, the output of which places an identifier molecule, representing a data object, into the final pool representing the at least a portion of the output string (e.g., the outputted identifier represents a bit of the output string). Said graph program may represent a function on said data object. The at least a portion of the output string may be an output of the function on the data object. In some implementations, the final pool to which an identifier molecule is placed into, according to the graph program, determines the output of the function on the corresponding data object encoded by the identifier molecule.
Storing Numerical Data in Nucleic AcidsAs discussed in the foregoing, numerical data may be stored in identifier molecules by generating an expected copy count.
Method 2300 may further comprise inputting the sample to an operation to produce an output sample. For example, the operation may involve multiplying the number by a power of 2 by performing a polymerase chain reaction (PCR) with primers that bind to common regions on an edge of the identifier nucleic acid sequence to form the output sample containing a PCR product. The power of 2 may correspond to the number of PCR cycles performed. The operation may involve adding the number as a first number to a second number in a second input sample by a mixing operation that combines the sample and the second input sample to form the output sample. The output sample may be input to a second operation.
In some implementations, the number is a first element of a vector. Method 2300 may further comprise determining a second expected copy count of the identifier nucleic acid sequence based on a second element of the vector and the proportionality constant. Method 2300 may then further comprise generating a second sample containing a second actual number of identifier nucleic acid molecules each having the identifier nucleic acid sequence, wherein the second actual number approximates the second expected copy count.
Method 2300 may further comprise performing a linear function on the vector by at least one of PCR, aliquoting, and mixing. The linear function may be used to convert a binary vector to a unary value in an output sample. The linear function may be a scoring function. For example, the scoring function may compute a higher output value for target vectors than for non-target vectors, such that copy counts for identifier sequences corresponding to target vectors are enriched in the output sample. Identifier sequences corresponding to target vectors may be determined by sequencing the output sample. The ratio of copy counts between two identifier sequences in the output sample may be increased by using a double-stranded DNA selection operation to form a new output sample where identifier sequences corresponding to target vectors are even more enriched in the output sample. The operation (or repeated application thereof) may correspond to an activation function in a neural network or a quadratic function. In some implementations, the output sample is allowed to reach equilibrium prior to the double stranded DNA selection operation. Also prior to the operation, the temperature may be changed, or cofactors may be added to the output sample. The operation may be performed by at least one of chromatography, gel electrophoresis, mass spectrometry, flow cytometry, fluorescent-activated sorting, membrane capture, silica column capture, silica bead capture, or affinity capture.
In some implementations, the vector is a compressed representation of a larger data object (larger than the vector). The compressed representation may be a hash, a bloom filter, a signature, a structural array, or a fingerprint of the larger data object. The larger data object may be retrieved using the corresponding identifier nucleic acid sequence as a key.
Preparing Stripe Libraries Encoding a DatasetAs discussed in the foregoing, a dataset may be encoded by division into parts and encoding of its parts across stripe libraries.
In some implementations, each nucleic acid molecule comprises L components, each component selected from C possible components of a distinct layer of L layers. M of the L components encode the key, and N of the L components encode the operand, such that the sum of M and N is less than or equal to L. The byte-value of the part encoded by each nucleic acid molecule may be stored in the operand of the nucleic acid molecule. In some implementations, each part contains no more than K byte-values, and each data object comprises at most T byte-values, such that each data object is divided into at most P parts, where P=[T/K]. K may be any value less than L*[log 2C]/8.
In some implementations, the dataset is encoded using W nucleic acid libraries, each library containing nucleic acid molecules having an identifier rank of 1, . . . , R. The dataset comprises D data objects, such that R is greater than or equal to D. A rule may be that the jth part of the P parts of the rth data object of the D data objects is encoded in the operand of a nucleic acid molecule of rank in the interval [CNr, CN(r+1)−1] in the jth library.
Also provided herein is a method of retrieving a subset of a data object of a dataset encoded in a plurality of nucleic acid libraries according to method 2400. This method comprises providing a target nucleic acid library and a query nucleic acid library, each comprising nucleic acid molecules each comprising a key and an operand. The method then involves extracting keys from nucleic acid molecules in the query nucleic acid library. The method then involves matching nucleic acid molecules in the target nucleic acid library having keys that match the keys extracted from the query nucleic acid library. Finally, the method involves selecting and outputting the matched nucleic acid molecules.
The extraction step may comprise converting each nucleic acid molecules in the query library to a single-stranded molecule, and match may comprise hybridizing nucleic acid molecules in the target library to complementary single-stranded keys. Selecting may comprise applying an enzyme that selectively degrades single-stranded nucleic acids after the hybridization. For example, the enzyme may be P1.
The extraction step may comprise digesting each nucleic acid molecule in the query library using a sequence-specific enzyme that recognizes the specific sequence found in each nucleic acid molecule. This extraction may further comprise size-selecting keys present in double-stranded form.
The extraction step may comprise introducing a nick between the key and the operand of each molecule using a sequence-specific nicking enzyme, incorporating a labeled nucleotide at the nick, and capturing the labeled nucleotides to retain keys in a single-stranded form. For example, the labeled nucleotide has a biotin label, and capturing involves affinity capture with streptavidin-coated beads.
The extraction step may comprise selectively amplifying via PCR and purifying the keys from the query library, the keys being flanked by universal sequences that serve as primer binding sites for the PCR.
Matching may involve converting the target library and extracted keys to single-stranded forms and hybridizing single-stranded extracted keys to complementary keys in the target library. Selecting may involve gel electrophoresis.
In some implementations, the query library encodes a first set of parts of the dataset, and the target library encodes a second set of parts of the dataset (e.g., according to method 2400 of
The foregoing is merely illustrative of the principles of the disclosure, and the apparatuses can be practiced by other than the described embodiments, which are presented for purposes of illustration and not of limitation. It is to be understood that the apparatuses disclosed herein, while shown for use in nucleic acid-based data storage, may be applied to applications involving data archival and storage or chemical data science.
Variations and modifications will occur to those of skill in the art after reviewing this disclosure. The disclosed features may be implemented, in any combination and subcombination (including multiple dependent combinations and subcombinations), with one or more other features described herein. The various features described or illustrated above, including any components thereof, may be combined or integrated in other systems. Moreover, certain features may be omitted or not implemented.
The systems and methods described may be implemented locally printer-finisher system, such as that described in U.S. application Ser. No. 16/414,752 entitled “PRINTER-FINISHER SYSTEM FOR DATA STORAGE IN DNA”, filed May 16, 2019 and published as U.S. Publication No. 2019/0351673, which is hereby incorporated by reference in its entirety. The printer-finisher system may include a data processing apparatus. The systems and methods described herein may be implemented remotely on a separate data processing apparatus. The separate data processing apparatus may be connected directly or indirectly to the printer-finisher system through cloud applications. The printer-finisher system may communicate with the separate data processing apparatus in real-time (or near real-time).
In general, embodiments of the subject matter and the functional operations described in this specification can be implemented in digital electronic circuitry, or in computer software, firmware, or hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer program products, i.e., one or more modules of computer program instructions encoded on a computer readable medium for execution by, or to control the operation of, data processing apparatus. The computer readable medium can be a machine-readable storage device, a machine-readable storage substrate, a memory device, a composition of matter affecting a machine-readable propagated signal, or a combination of one or more of them. The term “data processing apparatus” encompasses all apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers. The apparatus can include, in addition to hardware, code that creates an execution environment for the computer program in question, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them. A propagated signal is an artificially generated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal that is generated to encode information for transmission to suitable receiver apparatus.
A computer program (also known as a program, software, software application, script, or code) can be written in any form of programming language, including compiled or interpreted languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program may correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub programs, or portions of code). A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network.
The processes and logic flows described in this specification can be performed by one or more programmable processors executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit).
Processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer. Generally, a processor will receive instructions and data from a read-only memory or a random access memory or both. The essential elements of a computer are a processor for performing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto optical disks, or optical disks. However, a computer need not have such devices.
Examples of changes, substitutions, and alterations are ascertainable by one skilled in the art and could be made without departing from the scope of the information disclosed herein. All references cited herein are incorporated by reference in their entirety and made part of this application.
Claims
1-47. (canceled)
48. A method for storing numerical data in nucleic acids, the method comprising:
- determining an expected copy count of an identifier nucleic acid sequence based on the numerical data and a proportionality constant; and
- generating a sample containing an actual number of identifier nucleic acid molecules each having the identifier nucleic acid sequence, wherein the actual number approximates the expected copy count.
49. The method of claim 48, wherein the numerical data is a number and the expected copy count is proportional to the number.
50. The method of claim 49, further comprising inputting the sample to an operation to produce an output sample.
51. The method of claim 50, wherein the operation comprises multiplying the number by a power of 2 by performing a polymerase chain reaction (PCR) with primers that bind to common regions on an edge of the identifier nucleic acid sequence to form the output sample containing a PCR product.
52. The method of claim 51, wherein the power of 2 corresponds to a number of PCR cycles.
53. The method of claim 50, wherein the operation comprises multiplying the number by a fraction by performing an aliquot that isolates a fractional volume of the sample to form the output sample.
54. The method of claim 50, wherein the operation comprises adding the number as a first number to a second number in a second input sample by a mixing operation that combines the sample and the second input sample to form the output sample.
55. The method of claim 50, further comprising inputting the output sample to a second operation.
56. The method of claim 49, wherein the number is a first element of a vector.
57. The method of claim 56, further comprising:
- determining a second expected copy count of the identifier nucleic acid sequence based on a second element of the vector and the proportionality constant; and
- generating a second sample containing a second actual number of identifier nucleic acid molecules each having the identifier nucleic acid sequence, wherein the second actual number approximates the second expected copy count.
58. The method of claim 57, further comprising performing a linear function on the vector by at least one of PCR, aliquoting, and mixing.
59. The method of claim 58, further comprising converting a binary vector to a unary value in an output sample by performing the linear function.
60. The method of claim 58, wherein the linear function is a scoring function.
61. The method of claim 60, wherein the scoring function computes a higher output value for target vectors than for non-target vectors, such that copy counts for identifier sequences corresponding to target vectors are enriched in the output sample.
62. The method of claim 61, wherein identifier sequences corresponding to target vectors are determined by sequencing the output sample.
63. The method of claim 61, wherein a ratio of copy counts between two identifier sequences in the output sample is increased by using a double-stranded DNA selection operation to form a new output sample where identifier sequences corresponding to target vectors are even more enriched in the output sample.
64. The method of claim 63, wherein the operation, or a repeated application thereof, corresponds to an activation function in a neural network.
65. The method of claim 63, wherein the operation corresponds to a quadratic function.
66. The method of claim 63, further comprising letting the output sample go to equilibrium prior to the double stranded DNA selection operation.
67. The method of claim 66, further comprising changing the temperature or adding cofactors to the output sample prior to double stranded DNA selection operation.
68. The method of claim 63, wherein the double stranded DNA selection operation is at least one of chromatography, gel electrophoresis, mass spectrometry, flow cytometry, fluorescent-activated sorting, membrane capture, silica column capture, silica bead capture, or affinity capture.
69. The method of claim 58, wherein the vector is a compressed representation of a larger data object, the compressed representation being a hash, a bloom filter, a signature, a structural array, or a fingerprint of the larger data object.
70. The method of claim 69, wherein the larger data object is retrieved using the corresponding identifier nucleic acid sequence as a key.
71. The method of claim 48, wherein at least a portion of each identifier nucleic acid molecule is configured to bind to one or more probes.
72-92. (canceled)
Type: Application
Filed: Mar 14, 2022
Publication Date: Feb 2, 2023
Inventors: Nathaniel Roquet (Charlestown, MA), Swapnil P. Bhatia (Charlestown, MA), Michael Norsworthy (Charlestown, MA), Sarah Flickinger (Charlestown, MA), Tracy Kambara (Charlestown, MA)
Application Number: 17/693,705