SYNTHETIC GENOME
The current invention provides a synthetic prokaryotic genome comprising 5 or fewer occurrences of one or more sense codons; and/or a synthetic prokaryotic genome derived from a parent genome, wherein the synthetic prokaryotic genome comprises less than 10%, 5%, 2%, 1%, 0.5%, 0.1% of the occurrences of one or more sense codons, relative to the parent genome; and/or a synthetic prokaryotic genome comprising 100 or more, 200 or more, or 1000 or more genes with no occurrences of one or more sense codons.
The instant application contains a Sequence Listing which has been submitted electronically in XML format and is hereby incorporated by reference in its entirety. Said XML copy, created on May 15, 2023, is named “51689-008003_Sequence_Listing.xml” and is 8,555,414 bytes in size.
FIELD OF THE INVENTIONThe present invention relates to synthetic genomes and methods of their production.
BACKGROUND TO THE INVENTIONThe design and synthesis of genomes provides a powerful approach for understanding and engineering biology. Genome synthesis has the potential to accelerate metabolic engineering. In particular, genome synthesis has the potential to elucidate synonymous codon function and to facilitate genetically encoded unnatural polymer synthesis (Wang, K., et al., 2016. Nature, 539(7627), 59-64).
The standard genetic code encodes the 20 canonical amino acids using 61 sense codons, and eighteen of the twenty amino acids are encoded by more than one synonymous codon. Nature chooses one sense codon, from up to six synonyms, to encode each amino acid at each position in a gene. Synonymous codon choice can influence mRNA folding, transcriptional and translational regulatory sequences, translation rate, co-translational folding, protein levels, and has emerging and yet to be understood roles (Wang, K., et al., 2016. Nature, 539(7627), 59-64; and Cambray, G., et al., 2018. Nature biotechnology, 36(10), 1005-1015).
Genome-wide replacement of a target codon with synonymous codons (synonymous codon compression) may provide a foundation for reassigning sense codons to non-canonical amino acids (or other monomers) to facilitate the in vivo biosynthesis of genetically encoded non-canonical biopolymers (Chin, J. W., 2017. Nature, 550(7674), 53-60).
Site-directed mutagenesis approaches have been used to replace up to 321 amber stop codons in the E. coli genome (Mukai, T., et al., 2015. Scientific reports, 5, p. 9699). However, sense codons are commonly orders of magnitude more abundant than stop codons, and genome synthesis, rather than mutagenesis, may be the preferred route to tackling sense codon removal in many cases.
Genome synthesis has enabled the creation of Mycoplasma with synthetic genomes (Gibson, D. G., et al., 2010. Science, 329(5987), 52-56) and the creation of nine strains of S. cerevisiae in which the DNA for one or two of the sixteen chromosomes is replaced by synthetic DNA (Zhang, W., et al., 2017. Science, 355(6329), eaaf3981; and Richardson, S. M., et al., 2017. Science, 355(6329), 1040-1044). These experiments have replaced up to 1 Mb of DNA (0.99 Mb, yeast; 1.08 Mb, Mycoplasma) in individual strains. Replicon excision for enhanced genome engineering through programmed recombination (REXER) has been reported for replacing >100 kb of the E. coli genome with synthetic DNA in a single step. Moreover, it has been shown that REXER can be iterated via genome stepwise interchange synthesis (GENESIS) to replace 220 kb of the E. coli genome with 230 kb of synthetic DNA (Wang, K., et al., 2016. Nature, 539(7627), 59-64; WO 2018/020248).
Genome synthesis has been used to alter synonymous codons in individual genes (Napolitano, M. G., et al., 2016. PNAS, 113(38), E5588-E5597), genomic regions and essential operons (Wang, K., et al., 2016. Nature, 539(7627), 59-64; and Lau, Y. H., et al. 2017. Nucleic acids research, 45(11), 6971-6980). For instance, Wang et al. used defined ‘recoding schemes’ to replace a 20 kb region of the E. coli genome rich in both essential genes and target codons.
However, these studies have mutated only a small fraction (up to 4.7%) of targeted sense codons in the genome of a single strain. Consequently, it is not known whether the application of these methods to genome-wide synonymous codon compression will be able to produce viable genomes. For instance, it is not known whether the defined recoding schemes tested in Wang et al. can be applied genome-wide to create an organism in which a reduced number of sense codons are used to encode the 20 canonical amino acids.
Thus, there is a demand for synthetic genomes, wherein one or more sense codon has been removed. There is also a demand for improved methods to produce synthetic genomes.
SUMMARY OF THE INVENTIONThe inventors have surprisingly found that a viable synthetic prokaryotic genome may be produced, wherein one or more sense codon has been removed. In particular, they produced a viable synthetic genome in which the number of codons used to encode cellular protein is reduced from 64 to 61, by genome-wide recoding of two sense codons and one stop codon. They also produced an E. coli host cell comprising said synthetic genome.
They inventors have also surprisingly found that defined recoding and refactoring schemes can enable genome-wide synonymous codon compression for more than 99.9% of target codons. They found that alternative recoding and refactoring at non-tolerated positions enabled genome-wide synonymous codon compression.
The inventors have also surprisingly found that recombination-mediated genetic engineering (e.g. REXER and/or GENESIS) may be combined with directed conjugation to efficiently produce synthetic genomes. In particular, they found, for example, that at least about 4 Mb of DNA can be efficiently replaced by said method and that said method allows failures in the design of synthetic DNA (non-tolerated positions) to be identified at codon-level resolution.
Accordingly, in one aspect the present invention provides a synthetic prokaryotic genome comprising 5 or fewer occurrences of one or more sense codons. In some embodiments the synthetic prokaryotic genome comprises 4 or fewer, 3 or fewer, 2 or fewer, 1 or fewer, or no occurrences of one or more sense codons. In some embodiments the one or more sense codons consist of one sense codon or two sense codons, preferably two sense codons. In some embodiments the synthetic prokaryotic genome comprises no occurrences of two or more sense codons, preferably two sense codons, and no occurrences of one stop codon, preferably the amber stop codon (TAG).
The synthetic prokaryotic genome may be a synthetic bacterial genome, preferably a synthetic Escherichia coli, Salmonella enterica, or Shigella dysenteriae genome. In some embodiments the synthetic prokaryotic genome is 100 kb to 10 Mb, or 1 Mb to 10 Mb, or 2 Mb to 6 Mb in size. The synthetic prokaryotic genome may be viable. In some embodiments the synthetic prokaryotic genome comprises 100 or more, 200 or more, or 1000 or more genes, optionally wherein the genes have no occurrences of the one or more sense codons, preferably wherein the genes are essential genes.
In some embodiments the one or more sense codons are selected from TCG, TCA, TCT, TCC, AGT, AGC, GCG, GCA, GCT, GCC, CTG, CTA, CTT, CTC, TTG, and TTA, preferably the one or more sense codons are selected from TCG, TCA, AGT, AGC, GCG, GCA, CTG, CTA, TTG, and TTA, more preferably the one or more sense codons are selected from TCG, TCA, AGT, AGC, TTG, TTA, GCG and GCA, most preferably the one or more sense codons are TCG and/or TCA.
In some embodiments the synthetic prokaryotic genome comprises 10 or fewer, 5 or fewer, or no occurrences of the amber stop codon (TAG).
In a further aspect the present invention provides a synthetic prokaryotic genome comprising 100 or more, 200 or more, or 1000 or more genes, wherein the genes collectively comprise 5 or fewer occurrences of one or more sense codons, preferably wherein the genes are essential genes. In some embodiments the genes collectively comprise 4 or fewer, 3 or fewer, 2 or fewer, 1 or fewer, or no occurrences of one or more sense codons. In some embodiments the one or more sense codons consist of one sense codon or two sense codons, preferably two sense codons.
The synthetic prokaryotic genome may be a synthetic bacterial genome, preferably a synthetic Escherichia coli, Salmonella enterica, or Shigella dysenteriae genome. In some embodiments the synthetic prokaryotic genome is 100 kb to 10 Mb, or 1 Mb to 10 Mb, or 2 Mb to 6 Mb in size. The synthetic prokaryotic genome may be viable.
In some embodiments the one or more sense codons are selected from TCG, TCA, TCT, TCC, AGT, AGC, GCG, GCA, GCT, GCC, CTG, CTA, CTT, CTC, TTG, and TTA, preferably the one or more sense codons are selected from TCG, TCA, AGT, AGC, GCG, GCA, CTG, CTA, TTG, and TTA, more preferably the one or more sense codons are selected from TCG, TCA, AGT, AGC, TTG, TTA, GCG and GCA, most preferably the one or more sense codons are TCG and/or TCA.
In some embodiments the synthetic prokaryotic genome comprises 10 or fewer, 5 or fewer, or no occurrences of the amber stop codon (TAG).
In a further aspect the present invention provides a synthetic prokaryotic genome derived from a parent prokaryotic genome, wherein the synthetic prokaryotic genome comprises less than 10%, 5%, 2%, 1%, 0.5%, 0.1% of the occurrences of one or more sense codons, relative to the parent prokaryotic genome, or wherein the synthetic prokaryotic genome comprises no occurrences of one or more sense codons. In some embodiments the one or more sense codons consist of one sense codon or two sense codons, preferably two sense codons.
The synthetic prokaryotic genome may be a bacterial genome, preferably an Escherichia coli, Salmonella enterica, or Shigella dysenteriae genome. In some embodiments the synthetic prokaryotic genome is 100 kb to 10 Mb, or 1 Mb to 10 Mb, or 2 Mb to 6 Mb in size. The synthetic prokaryotic genome may be viable.
In some embodiments the one or more sense codons are selected from TCG, TCA, TCT, TCC, AGT, AGC, GCG, GCA, GCT, GCC, CTG, CTA, CTT, CTC, TTG, and TTA, preferably the one or more sense codons are selected from TCG, TCA, AGT, AGC, GCG, GCA, CTG, CTA, TTG, and TTA, more preferably the one or more sense codons are selected from TCG, TCA, AGT, AGC, TTG, TTA, GCG and GCA, most preferably the one or more sense codons are TCG and/or TCA, optionally wherein TCG and/or TCA are replaced with synonymous sense codons.
Preferably 90% or more, 95% or more, 98% or more, 99% or more, 99.5% or more, 99.6% or more, 99.7% or more, 99.8% or more, 99.9% or more, or 100% of the occurrences of the one or more sense codons in the parent prokaryotic genome are replaced with synonymous sense codons. In some embodiments 90% or more, 95% or more, 98% or more, 99% or more, 99.5% or more, 99.6% or more, 99.7% or more, 99.8% or more, 99.9% or more, or 100% of the occurrences of TCG and/or TCA in the parent prokaryotic genome are replaced with AGC and/or AGT, most preferably 90% or more, 95% or more, 98% or more, 99% or more, 99.5% or more, 99.6% or more, 99.7% or more, 99.8% or more, 99.9% or more, or 100% of the occurrences of TCG in the parent prokaryotic genome are replaced with AGC and/or 90%, 95%, 90% or more, 95% or more, 98% or more, 99% or more, 99.5% or more, 99.6% or more, 99.7% or more, 99.8% or more, 99.9% or more, or 100% of the occurrences of TCA in the parent prokaryotic genome are replaced with AGT.
In some embodiments the synthetic prokaryotic genome comprises 10 or fewer, 5 or fewer, or no occurrences of the amber stop codon (TAG), preferably wherein 90% or more, 95% or more, 98% or more, 99% or more, or all of the occurrences of TAG in the parent prokaryotic genome are replaced with TAA.
In some embodiments 99.9% or more, or 100% of the occurrences of two or more sense codons, preferably two sense codons, in the parent prokaryotic genome are replaced with synonymous sense codons, and all of the occurrences of TAG in the parent prokaryotic genome are replaced with TAA.
One or more pairs of genes which share an overlapping region comprising the one or more sense codons in the parent prokaryotic genome may be refactored, preferably wherein the one or more pairs of genes are those in which replacement of one or more of the sense codons with synonymous sense codons would change the encoded protein sequence of both or either of the pair of genes.
In some embodiments for pairs of genes in opposite orientations, a synthetic insert is inserted between the genes, wherein the synthetic insert comprises the overlapping region; and/or for pairs of genes in the same orientation, a synthetic insert is inserted between the genes, wherein the synthetic insert comprises: (i) a stop codon; (ii) about 20-200 bp from upstream of the overlapping region; and (iii) the overlapping region.
In a further aspect the present invention provides a polynucleotide comprising twenty or more, thirty or more, forty or more, fifty or more, 100 or more essential genes with no occurrences of one or more sense codons. In some embodiments the one or more sense codons consist of one sense codon or two sense codons, preferably two sense codons.
In some embodiments the one or more sense codons are selected from TCG, TCA, TCT, TCC, AGT, AGC, GCG, GCA, GCT, GCC, CTG, CTA, CTT, CTC, TTG, and TTA, preferably the one or more sense codons are selected from TCG, TCA, AGT, AGC, GCG, GCA, CTG, CTA, TTG, and TTA, more preferably the one or more sense codons are selected from TCG, TCA, AGT, AGC, TTG, TTA, GCG and GCA, most preferably the one or more sense codons are TCG and/or TCA.
The occurrences of the one or more sense codons in the genes may be replaced with synonymous sense codons, preferably TCG codons are replaced with AGC and/or TCA codons are replaced with AGT.
The essential genes may comprise essential genes selected from one or more of the list consisting of: ribF, lspA, ispH, dapB, folA, imp, yabQ, ftsL, ftsI, murE, murF, mraY, murD, ftsW, murG, murC, ftsQ, ftsA, ftsZ, lpxC, secM, secA, can, folK, hemL, yadR, dapD, map, rpsB, tsf, pyrH, frr, dxr, ispU, cdsA, yaeL, yaeT, lpxD, fabZ, lpxA, lpxB, dnaE, accA, tilS, proS, yafF, hemB, secD, secF, ribD, ribE, thiL, dxs, ispA, dnaX, adk, hemH, lpxH, cysS, folD, entD, mrdB, mrdA, nadD, holA, rlpB, leuS, lnt, ginS, fldA, cydA, infA, cydC, ftsK, lolA, serS, rpsA, msbA, lpxK, kdsB, mukF, mukE, mukB, asnS, fabA, mviN, me, fabD, fabG, acpP, tmk, holB, lolC, lolD, lolE, purB, minE, minD, pth, prsA, ispE, lolB, hemA, prfA, prmC, kdsA, topA, ribA, fabI, tyrS, ribC, ydiL, pheT, pheS, rplT, infC, thrS, nadE, gapA, yeaZ, aspS, argS, pgsA, yefM, metG, folE, yejM, gyrA, nrdA, nrdB, folC, accD, fabB, gltX, ligA, zipA, dapE, dapA, der, hisS, ispG, suhB, tadA, acpS, era, rnc, lepB, rpoE, pssA, yfiO, rplS, trmD, rpsP, ffh, grpE, csrA, ispF, ispD, ftsB, eno, pyrG, chpR, lgt, fbaA, pgk, yqgD, metK, yqgF, plsC, ygiT, parE, ribB, cca, ygjD, tdcF, yraL, yhbV, infB, nusA, ftsH, obgE, rpmA, rplU, ispB, murA, yrbB, yrbK, yhbN, rpsI, rplM, degS, mreD, mreC, mreB, accB, accC, yrdC, def, fmt, rplQ, rpoA, rpsD, rpsK, rpsM, secY, rplO, rpmD, rpsE, rplR, rplF, rpsH, rpsN, rplE, rplX, rplN, rpsQ, rpmC, rplP, rpsC, rplV, rpsS, rplB, rplW, rplD, rplC, rpsJ, fusA, rpsG, rpsL, trpS, yrfF, asd, rpoH, ftsX, ftsE, ftsY, yhhQ, bcsB, glyQ, gpsA, rfaK, kdtA, coaD, rpmB, dfp, dut, gmk, spoT, gyrB, dnaN, dnaA, rpmH, rnpA, yidC, tnaB, glmS, glmU, wzyE, hemD, hemC, yigP, ubiB, ubiD, hemG, yihA, ftsN, murI, murB, birA, secE, nusG, rplJ, rplL, rpoB, rpoC, ubiA, plsB, lexA, dnaB, ssb, alsK, groS, psd, orn, yjeE, rpsR, chpS, ppa, valS, yjgP, yjgQ, and dnaC.
In a further aspect the present invention provides a prokaryotic host cell comprising a synthetic prokaryotic genome according to the present invention or a polynucleotide according to the present invention.
The prokaryotic host cell may be viable. The prokaryotic host cell may be a bacterial cell, preferably an Escherichia coli, Salmonella enterica, or Shigella dysenteriae cell. Preferably the host cell is suitable for use in production of polypeptides comprising one or more non-proteinogenic amino acids, preferably two or more non-proteinogenic amino acids, most preferably three or more non-proteinogenic amino acids.
In a further aspect the present invention provides use of a prokaryotic host cell according to the present invention for producing polypeptides comprising one or more non-proteinogenic amino acids, preferably two or more non-proteinogenic amino acids, most preferably three or more non-proteinogenic amino acids.
In a further aspect the present invention provides a method for producing a synthetic genome comprising:
-
- (a) providing a parent genome;
- (b) carrying out one or more rounds of recombination-mediated genetic engineering on the parent genome, to produce two or more different partially synthetic genomes; and
- (c) carrying out one or more rounds of directed conjugation with the two or more different partially synthetic genomes to produce a synthetic genome;
wherein the partially synthetic genomes each comprise a synthetic region that has 50 or fewer, 20 or fewer, 10 or fewer, 5 or fewer, or 0 occurrences of each of one or more sense codons; or wherein the partially synthetic genomes each comprise a synthetic region that has less than 10%, 5%, 2%, 1%, 0.5%, 0.1% of the occurrences of each of one or more sense codons, relative to the corresponding region in the parent genome.
The synthetic regions may collectively cover 90% or greater, 95% or greater, 99% or greater or 100% of the parent genome. In some embodiments the synthetic regions are 10-1000 kb, 50-1000 kb, 100-1000 kb, or 100-500 kb in size.
The method may further comprise testing the viability of the partially synthetic genomes after each round of recombination-mediated genetic engineering and/or after each round of directed conjugation.
The two or more different partially synthetic genomes may comprise at least one partially synthetic donor genome and at least one partially synthetic recipient genome. In some embodiments the at least one partially synthetic donor genome comprises a synthetic region and a first selectable marker flanked by two homology regions immediately downstream of an origin of transfer; and the at least one partially synthetic recipient genomes comprise a second selectable marker flanked by two corresponding homology regions, optionally wherein the first selectable marker comprises a positive selectable marker, and/or the second selectable marker comprises a negative selectable marker. In some embodiments the synthetic region present in the at least one partially synthetic recipient genomes is outside the region flanked by the homology regions. In some embodiments the method further comprises one or more rounds of selection for the selectable markers.
The one or more rounds of recombination-mediated genetic engineering may comprise one or more rounds of replicon excision for enhanced genome engineering through programmed recombination (REXER).
The synthetic genome may be a synthetic prokaryotic genome according to the present invention.
In a further aspect the present invention provides a synthetic prokaryotic genome produced by the method of the present invention.
REXER (replicon excision for enhanced genome engineering through programmed recombination) utilizes CRISPR/Cas9 and lambda-red mediated recombination to replace genomic DNA with synthetic DNA provided from an episome (BAC). This enables large regions of the genome (>100 kb) to be replaced by synthetic DNA (Wang, K., et al., 2016. Nature, 539(7627), 59-64; WO 2018/020248). The black triangles denote the location of CRISPR protospacers, which are cleaved by Cas9 to liberate the synthetic DNA (pink) cassette from the BAC flanked by homology regions (HRs). Homology regions 1 and 2 (HR1, HR2) program the location of recombination into the E. coli genome. Selection cassette −1/+1 ensures the integration of the synthetic DNA, while selection cassette −2/+2 on the genome ensures the removal of the corresponding wt DNA. In the example shown in the figure, +1 is KanR, −1 is rpsL, +2 is CmR, −2 is sacB.
Iterative cycles of REXER (see
Synthetic genomic sections from multiple, individual partially-recoded genomes were assembled into a single, fully-recoded genome via conjugation (Ma, N. J., et al., 2014. Nat Protoc 9, 2285-2300). The donor (d) and recipient (r) strains harbour unique recoded genomic sections; recoded overlapping homology regions (3 kb to 400 kb) were utilized to seamlessly recombine the strains. Small homology regions ranging from 3-5 kb are denoted with an asterisk (*). Conjugations for which we used greater than 5 kb homology (HR) are indicated with text. For assembly, the recoded genomic content from the donor was conjugated in a clockwise manner to replace the corresponding wt genomic section in the recipient. The origin of strain AB and H is described in detail in
28 sense codons are highlighted in grey, along with the amber stop codon. The genome wide removal of these sense codons, but not other sense codons, would enable all their cognate tRNA to be deleted without removing the ability to decode one or more sense codons remaining in the genome. This is necessary but not sufficient for the reassignment of sense codons to unnatural monomers. Serine, leucine and alanine codon boxes are highlighted because the endogenous aminoacyl-tRNA synthetases for these amino acids do not recognize the anticodons of their cognate tRNAs. This may facilitate the assignment of codons within these boxes to new amino acids through the introduction of tRNAs bearing cognate anticodons that do not direct mis-aminocylation by endogenous synthetases. The number of total codon counts for all 64 triplet codons in the MDS42 genome (GenBank accession no. AP012306), all known codon-anticodon interactions through both Watson-Crick base-paring and wobbling, base modification on tRNA anticodons, tRNA genes, and measured in vivo tRNA relative abundance are reported. This analysis identifies 10 codons from the serine, leucine, and alanine groups (serine codon TCG, TCA, AGT, AGC; leucine codon CTG, CTA, TTG, TTA; and alanine codon GCG, GCA) satisfy both the codon-anticodon interaction and aminoacyl-tRNA synthetases recognition criteria for codon reassignment.
A version of the E. coli MDS42 genome in which the serine codons TCG and TCA and the stop codon TAG in open reading frames (ORFs) are systematically replaced by their synonyms AGC, AGT, and TAA, respectively. Using the defined rules for synonymous codon compression and refactoring a genome is designed in which all 18,218 target codons are recoded to their target synonyms.
Sequence of E. coli Syn61, in which all 1.8×104 target codons in the genome are recoded. The synthesis of our recoded genome introduced only eight non-programmed mutations (Table 6), four of these mutations arose during the preparation of the 100 kb BACs, and four during the recoding process.
The terms “comprising”, “comprises” and “comprised of” as used herein are synonymous with “including” or “includes”; or “containing” or “contains”, and are inclusive or open-ended and do not exclude additional, non-recited members, elements or steps. The terms “comprising”, “comprises” and “comprised of” also include the term “consisting of”.
Synthetic Genomes
Genomes
As used herein, a “genome” is the genetic material of an organism, including both genes and non-coding DNA. As used herein, a “synthetic genome” is a synthetically-built genome. Typically a synthetic genome will be produced by genetic modification of a pre-existing (i.e. “parent”) genome. Thus, a synthetic genome may be derived from a parent genome, i.e. identical to a parent genome, except comprising one or more genetic modifications. The skilled person will be able to readily identify the parent genome on which a synthetic genome is based and the genetic modifications carried out. As used herein, a “parent genome” may be any naturally-occurring, commercially-available, deposited, catalogued or otherwise well-known genome, or derivative thereof.
The synthetic genome of the present invention is a synthetic prokaryotic genome. A prokaryote is a unicellular organism that lacks a membrane-bound nucleus, mitochondria, or any other membrane-bound organelle. Prokaryotes are divided into two domains, Archaea and Bacteria. The genome of prokaryotic organisms generally is a circular, double-stranded piece of DNA, multiple copies of which may exist at any time.
Preferably, the synthetic genome of the present invention is a synthetic bacterial genome. Preferably the synthetic bacterial genome is suitable for heterologous protein production, in particular the production of polypeptides comprising one or more non-proteinogenic amino acids (for instance those described by Ferrer-Miralles, N. and Villaverde, A., 2013. Microbial Cell Factories, 12:113). Suitable bacterial genomes include: escherichia (e.g. Escherichia coli), caulobacteria (e.g. Caulobacter crescentus), phototrophic bacteria (e.g. Rodhobacter sphaeroides), cold adapted bacteria (e.g. Pseudoalteromonas haloplanktis, Shewanella sp. strain Ac10), pseudomonads (e.g. Pseudomonas fluorescens, Pseudomonas putida, Pseudomonas aeruginosa), halophilic bacteria (e.g. Halomonas elongate, Chromohalobacter salexigens), streptomycetes (e.g. Streptomyces lividans, Streptomyces griseus), nocardia (e.g. Nocardia lactamdurans), mycobacteria (e.g. Mycobacterium smegmatis), coryneform bacteria (e.g. Corynebacterium glutamicum, Corynebacterium ammoniagenes, Brevibacterium lactofermentum), bacilli (e.g. Bacillus subtilis, Bacillus brevis, Bacillus megaterium, Bacillus licheniformis, Bacillus amyloliquefaciens), and lactic acid bacteria (e.g. Lactococcus lactis, Lactobacillus plantarum, Lactobacillus casei, Lactobacillus reuteri, Lactobacillus gasseri) genomes. In some embodiments the synthetic genome is a synthetic gram-negative bacterial genome.
Bacterial genomes can range in size anywhere from about 130 kb to over 14 Mb. Thus, in some embodiments the synthetic prokaryotic genome of the present invention is 100 kb to 20 Mb, or 130 kb to 15 Mb, or 200 kb to 15 Mb, or 300 kb to 15 Mb, or 500 kb to 15 Mb, or 1 Mb to 15 Mb, or 1 Mb to 10 Mb, or 1 Mb to 8 Mb, or 1 Mb to 6 Mb, or 2 Mb to 6 Mb, or 2 Mb to 5 Mb, or 3 Mb to 5 Mb, or about 4 Mb in size. The synthetic prokaryotic genome may comprise 100 or more, 200 or more, 300 or more, 400 or more, 500 or more, 600 or more, 700 or more, 800 or more, 900 or more, 1000 or more, 1500 or more, or 2000 or more genes, preferably 1000 or more genes. The synthetic prokaryotic genome may comprise 100 or more, 200 or more, 300 or more, 400 or more, 500 or more, 600 or more, 700 or more, 800 or more, 900 or more, 1000 or more, 1500 or more, or 2000 or more genes for which there is evidence of translation and/or of the predicted protein product, preferably 1000 or more genes. Preferably the synthetic prokaryotic genome comprises 100 or more, 200 or more, 300 or more, 400 or more, 500 or more essential genes, preferably 300 or more essential genes.
Preferably, the synthetic genome of the present invention is a synthetic Escherichia coli, Salmonella enterica, or Shigella dysenteriae genome. These are phylogenetically related species as disclosed by Lukjancenko, O., et al., 2010. Microbial ecology, 60(4), pp. 708-720; and Karberg, K. A., et al., 2011. PNAS, 108(50), pp. 20154-20159.
More preferably, the synthetic genome of the present invention is a synthetic E. coli genome. The parent genome may be any suitable E. coli genome including MDS42, K-12, MG1655, BL21, BL21(DE3), AD494, Origami, HMS174, BLR(DE3), HMS174(DE3), Tuner(DE3), Origami2(DE3), Rosetta2(DE3), Lemo21(DE3), NiCo21(DE3), T7 Express, SHuffle Express, C41(DE3), C43(DE3), and m15 pREP4 or derivatives thereof (Rosano, G. L. and Ceccarelli, E. A., 2014. Frontiers in microbiology, 5, p. 172). Most preferably, the parent genome is MDS42, MG1655, or BL21 or a derivative thereof. MG1655 is considered as the wild type strain of E coli. The GenBank ID of genomic sequence of this strain is U00096. BL21 is widely available commercially. For example, it can be purchased from New England BioLabs with catalog number C2530H (https://www.neb.com/products/c2530-bl21-competent-e-coli).
In some embodiments the synthetic genome is a reduced synthetic genome or a minimal synthetic genome. A “reduced genome” is one in which the size of the parent genome has been reduced by removing non-essential genes and/or non-coding regions. A “minimal genome” is a genome which has been reduced to its minimal size whilst remaining viable e.g. by deletion of all non-essential regions of the genome.
The synthetic genome of the present invention may be a viable genome. As used herein, a “viable genome” refers to a genome that contains nucleic acid sequences sufficient to cause and/or sustain viability of a cell, e.g., those encoding molecules required for replication, transcription, translation, energy production, transport, production of membranes and cytoplasmic components, and cell division.
Preferably one or more tRNA or release factors may be deleted from the synthetic genome and the synthetic genome may remain viable. For example, a tRNA which decodes only the one or more sense codons that have been replaced (or deleted) may be dispensable. Similarly, a tRNA which decodes the one or more sense codons that have been replaced (or deleted) may be dispensable if the remaining sense codons that it decodes may also be decoded by an alternative tRNA. For example, serT, encoding tRNASerUGA, is the only tRNA that decodes TCA codons in E. coli, and is therefore normally essential. However, if the synthetic genome does not contain TCA codons then serT may be dispensable.
Sense Codons
The current invention provides a synthetic prokaryotic genome comprising 5 or fewer occurrences of one or more sense codons; and/or a synthetic prokaryotic genome derived from a parent genome, wherein the synthetic prokaryotic genome comprises less than 10%, 5%, 2%, 1%, 0.5%, 0.1% of the occurrences of one or more sense codons, relative to the parent genome; and/or a synthetic prokaryotic genome comprising 100 or more, 200 or more, or 1000 or more genes with no occurrences of one or more sense codons.
The one or more sense codons may consist of one, two, three, four, five, six, seven, or eight sense codons. Preferably, the one or more sense codons consist of one sense codon or two sense codons, most preferably two sense codons.
The synthetic prokaryotic genome may comprise 5 or fewer (e.g. 5, 4, 3, 2, 1), or no occurrences of one or more (e.g. 1, 2, 3, 4, 5, 6, 7, or 8) sense codons. In some embodiments the synthetic prokaryotic genome comprises 5 or fewer (e.g. 5, 4, 3, 2, 1, 0) of each of the one or more (e.g. 1, 2, 3, 4, 5, 6, 7, or 8) sense codons. In other embodiments the synthetic prokaryotic genome comprises 5 or fewer (e.g. 5, 4, 3, 2, 1, 0) of the one or more (e.g. 1, 2, 3, 4, 5, 6, 7, or 8) sense codons combined (i.e. in total). In preferred embodiments the synthetic prokaryotic genome comprises no occurrences of one sense codon. In other preferred embodiments the synthetic prokaryotic genome comprises no occurrences of two sense codons.
The synthetic prokaryotic genome may be derived from a parent genome and comprise 5 or fewer (e.g. 5, 4, 3, 2, 1), or no occurrences of one or more (e.g. 1, 2, 3, 4, 5, 6, 7, or 8) native sense codons. In some embodiments the synthetic prokaryotic genome comprises 5 or fewer (e.g. 5, 4, 3, 2, 1, 0) of each of the one or more (e.g. 1, 2, 3, 4, 5, 6, 7, or 8) native sense codons. In other embodiments the synthetic prokaryotic genome comprises 5 or fewer (e.g. 5, 4, 3, 2, 1, 0) of the one or more (e.g. 1, 2, 3, 4, 5, 6, 7, or 8) native sense codons combined (i.e. in total). In preferred embodiments the synthetic prokaryotic genome is derived from a parent genome and comprises no occurrences of one native sense codon. In other preferred embodiments the synthetic prokaryotic genome is derived from a parent genome and comprises no occurrences of two native sense codons.
In some embodiments the synthetic prokaryotic genome comprises 100 or more, 200 or more, 300 or more, 400 or more, 500 or more, 600 or more, 700 or more, 800 or more, 900 or more, 1000 or more, 1500 or more, or 2000 or more genes, preferably 1000 or more genes. In some embodiments the genes are those for which there is evidence of translation and/or of the predicted protein product. For example, the synthetic prokaryotic genome may comprise 100 or more, 200 or more, 300 or more, 400 or more, 500 or more 600 or more, 700 or more, 800 or more, 900 or more, 1000 or more, 1500 or more, or 2000 or more genes, preferably 1000 or more genes for which there is evidence of translation and/or of the predicted protein product. Preferably the synthetic prokaryotic genome comprises 100 or more, 200 or more, 300 or more, 400 or more, 500 or more essential genes, preferably 300 or more essential genes. Preferably the (essential) genes have no occurrences of the one or more sense codons.
The synthetic prokaryotic genome may comprise less than 10%, 5%, 2%, 1%, 0.5%, 0.1% of the occurrences of one or more (e.g. 1, 2, 3, 4, 5, 6, 7, or 8) sense codons, relative to the parent genome. In some embodiments the synthetic prokaryotic genome comprises less than 10%, 5%, 2%, 1%, 0.5%, 0.1% of the occurrences of each of the one or more (e.g. 1, 2, 3, 4, 5, 6, 7, or 8) sense codons, relative to the parent genome. In other embodiments the synthetic prokaryotic genome comprises less than 10%, 5%, 2%, 1%, 0.5%, 0.1% of the occurrences of the one or more (e.g. 1, 2, 3, 4, 5, 6, 7, or 8) sense codons combined, relative to the parent genome. In preferred embodiments the synthetic prokaryotic genome comprises less than 10%, 5%, 2%, 1%, 0.5%, 0.1% of one sense codon, relative to the parent genome. In other preferred embodiments the synthetic prokaryotic genome comprises less than 10%, 5%, 2%, 1%, 0.5%, 0.1% of two sense codons, relative to the parent genome.
The synthetic prokaryotic genome may comprise 100 or more, 200 or more, or 1000 or more genes with no occurrences of one or more (e.g. 1, 2, 3, 4, 5, 6, 7, or 8) sense codons. Preferably, all or substantially all the genes in the synthetic prokaryotic genome have no occurrences of the one or more (e.g. 1, 2, 3, 4, 5, 6, 7, or 8) sense codons. In preferred embodiments, all or substantially all the genes in the synthetic prokaryotic genome have no occurrences of one sense codon. In other preferred embodiments, all or substantially all the genes in the synthetic prokaryotic genome have no occurrences of two sense codons. By substantially all is meant all but 10 or fewer (e.g. 10, 9. 8, 7, 6, 5, 4, 3, 2, 1, or 0) genes comprise occurrences of the one or more sense codons.
The synthetic prokaryotic genome may comprise 100 or more, 200 or more, or 1000 or more genes with no occurrences of one or more (e.g. 1, 2, 3, 4, 5, 6, 7, or 8) native sense codons. Preferably, all or substantially all the genes in the synthetic prokaryotic genome have no occurrences of the one or more (e.g. 1, 2, 3, 4, 5, 6, 7, or 8) native sense codons. In preferred embodiments, all or substantially all the genes in the synthetic prokaryotic genome have no occurrences of one native sense codon. In other preferred embodiments, all or substantially all the genes in the synthetic prokaryotic genome have no occurrences of two native sense codons. By substantially all is meant all but 10 or fewer (e.g. 10, 9. 8, 7, 6, 5, 4, 3, 2, 1, or 0) genes comprise occurrences of the one or more native sense codons.
Preferably the genes encode proteins (e.g. the genes are those for which there is evidence of translation and/or of the predicted protein product) and/or the genes are essential genes. Thus, in more preferred embodiments the synthetic prokaryotic genome comprises 100 or more, 200 or more, or 1000 or more protein-encoding and/or 100 or more, 200 or more, or 300 or more essential genes with no occurrences of one or two sense codons. In other more preferred embodiments all or substantially all the protein-encoding and/or essential genes in the synthetic prokaryotic genome comprise no occurrences of one or two sense codons.
In preferred embodiments no proteins are translated from any of the remaining occurrences of the one or more sense codons and/or genes comprising the remaining occurrences of the one or more sense codons are putative or non-coding genes. In some embodiments the translation of the genes comprising the remaining occurrences of the one or more sense codons is reduced and/or prevented (e.g. the genes may comprise stop codons in the 5′ sequence).
Any remaining occurrences of the sense codons may be necessary to ensure that the synthetic prokaryotic genome is viable. For example, one or more, preferably all, of the remaining occurrences of the one or more sense codons in the synthetic prokaryotic genome may be present in the regulatory elements of essential genes; and/or one or more, preferably all, of the remaining occurrences of the one or more sense codons may be in genes in which there is no evidence for translation or the predicted protein product (i.e. putative or non-coding genes).
As used herein, a “sense codon” is a nucleotide triplet that codes for an amino acid. Thus, sense codons may be identified in a genome by gene prediction, i.e. by identifying regions of the genome that code for proteins (i.e. genes) and the corresponding open reading frames (ORFs). Typically, genomes naturally comprise 61 sense codons: GCT, GCC, GCA, GCG, CGT, CGC, CGA, CGG, AGA, AGG, AAT, AAC, GAT, GAC, TGT, TGC, CAA, CAG, GAA, GAG, GGT, GGC, GGA, GGG, CAT, CAC, ATT, ATC, ATA, TTA, TTG, CTT, CTC, CTA, CTG, AAA, AAG, ATG, TTT, TTC, CCT, CCC, CCA, CCG, TCT, TCC, TCA, TCG, AGT, AGC, ACT, ACC, ACA, ACG, TGG, TAT, TAC, GTT, GTC, GTA, and GTG (read from 5′ to 3′ on the coding strand of DNA). The standard genetic code encodes the 20 canonical amino acids using the 61 triplet codons. 18 of the 20 amino acids are encoded by more than one synonymous codon (see
The 61 sense codons in DNA are transcribed into corresponding mRNA and subsequently decoded by one or more tRNAs. tRNAs carry an amino acid to a ribosome as directed by the sense codons in the mRNA. The tRNAs can recognise one or more sense codons via a complementary anticodon. A sequence of sense codons is subsequently translated into a polypeptide (i.e. a sequence of amino acids). Codon and anticodon interactions in the E. coli genome are shown in
Preferably, the genome wide removal of the one or more sense codons, but not other sense codons, enables all the cognate tRNA corresponding to said one or more sense codons to be deleted without removing the ability to decode the one or more sense codons remaining in the genome. Thus, the one or more sense codons may be selected from: TCG, TCA, AGT, AGC, GCG, GCA, GTG, GTA, CTG, CTA, TTG, TTA, ACG, ACA, CCG, CCA, CGG, CGA, CGT, CGC, AGG, AGA, GGG, GGA, GGT, GGC, ATT, and ATC.
Aminoacyl-tRNA synthetases for serine, leucine and alanine do not recognize the anticodons of their cognate tRNAs. This may facilitate the assignment of codons within these boxes to new amino acids through the introduction of tRNAs bearing cognate anticodons that do not direct mis-aminocylation by endogenous synthetases. Thus, the one or more sense codons may be selected from: TCG, TCA, TCT, TCC, AGT, AGC, GCG, GCA, GCT, GCC, CTG, CTA, CTT, CTC, TTG, and TTA.
Preferably, the one or more sense codons fulfill both these criteria, thus the one or more sense codons may be selected from: TCG, TCA, AGT, AGC, GCG, GCA, CTG, CTA, TTG, and TTA. More preferably, the one of more sense codons are selected from TCG, TCA, AGT, AGC, TTG, TTA, GCG and GCA. Most preferably, the one of more sense codons are TCG and/or TCA.
Preferably, one or more sense codons are removed such that the genome is compatible with codon reassignment to non-proteinogenic amino acids. Thus, the one or more sense codons may comprise one or more of TCA, CTA, or TTA. Alternatively, two or more sense codons are removed, wherein the two or more sense codons comprise one or more of the sense codon pairs, selected from the group consisting of: GCG and GCA; GCT and GCC; TCG and TCA; AGT and AGC; TCT and TCC; CTG and CTA; TTG and TTA; and CTT and CTC. Preferably, two or more sense codons are removed, wherein the two or more sense codons comprise one or more of the sense codon pairs, selected from the group consisting of: GCG and GCA; TCG and TCA; AGT and AGC; CTG and CTA; and TTG and TTA. More preferably, the two or more sense codons comprise TCG and TCA.
To achieve removal of sense codons they may be replaced with synonymous sense codons. This is preferable to ensure that the encoded protein sequence is not changed. For instance, the present invention provides a synthetic prokaryotic genome wherein 90% or more, 95% or more, 98% or more, 99% or more, 99.5% or more, 99.6% or more, 99.7% or more, 99.8% or more, 99.9% or more, or 100% of the occurrences of one or more sense codons in the parent genome are replaced with synonymous sense codons. The person skilled in the art is able to deduce suitable synonymous sense codon replacements. For example, in E. coli, typically TCG, TCA, TCT, TCC, AGT and AGC all encode serine; typically GCG, GCA, GCT and GCC all encode alanine; typically CTG, CTA, CTT, CTC, TTG and TTA all encode leucine.
In some embodiments, the replacement is a defined replacement, i.e. one sense codon is replaced with a single synonymous sense codon. Preferably, 90% or more, 95% or more, 98% or more, 99% or more, 99.5% or more, 99.6% or more, 99.7% or more, 99.8% or more, 99.9% or more, or 100% of the occurrences of one or more sense codons in the parent genome are replaced with a defined (i.e. single) synonymous sense codon.
For example, the defined replacement may be: GCG replaced with either GCT or GCC; GCA replaced with either GCT or GCC; TCG replaced with any one of TCT, TCC, AGT, or AGC; TCA replaced with any one of TCT, TCC, AGT, or AGC; AGT replaced with any one of TCG, TCA, TCT, or TCC; AGC replaced with any one of TCG, TCA, TCT, or TCC; CTG replaced with any one of CTT, CTC, TTG or TTA; CTA replaced with any one of CTT, CTC, TTG or TTA; TTG replaced with any one of CTG, CTA, CTT or CTC; or TTA replaced with any one of CTG, CTA, CTT or CTC. Preferably the one or more defined sense codon replacements are selected from one or more of: GCG to either GCT or GCC; GCA to either GCT or GCC; TCG to either AGT or AGC; TCA to either AGT or AGC; AGT to either TCA or TCT; AGC to either TCG or TCC or TCA; TTG to CTT; and TTA to CTC. More preferably, TCG and/or TCA are replaced with AGC and/or AGT. Most preferably, TCG is replaced with AGC and/or TCA is replaced with AGT.
Preferably, the defined replacement is such that the genome is compatible with codon reassignment to non-proteinogenic amino acids. For example: (i) GCG may be replaced with either GCT or GCC, and GCA may be replaced with either GCT or GCC; (ii) TCG may be replaced with any of TCT, TCC, AGT, or AGC, and TCA may be replaced with any of TCT, TCC, AGT, or AGC; (iii) AGT may be replaced with any of TCG, TCA, TCT, or TCC, and AGC may be replaced with any of TCG, TCA, TCT, or TCC; (iv) CTG may be replaced with any of CTT, CTC, TTG or TTA, and CTA may be replaced with any of CTT, CTC, TTG or TTA; or (v) TTG may be replaced with any of CTG, CTA, CTT or CTC, and TTA may be replaced with any of CTG, CTA, CTT or CTC.
Preferably, the defined replacement scheme is one or more of those listed in the table below:
Preferably, none of these codon replacements affect ribosomal binding sites (AGGAGG), which are highly conserved regulatory sequences in E. coli. The selected codon replacements may be tested on a small test region (e.g. a 20 kb region of the genome rich in both essential target genes and target codons) to assess viability. If the codon replacements are not viable on the small test region they may be disregarded.
When replacement of one or more sense codons in the parent genome with defined replacement synonymous sense codons does not result in a viable genome, alternative replacement synonymous sense codons may be used. For instance, 99.9% of the occurrences of one or more sense codons in the parent genome may be replaced with a defined (i.e. single) synonymous sense codon, and the remaining 0.1% with alternative synonymous sense codons. For example, 99.9% of the occurrences of TCG may be replaced with AGC and 0.1% replaced with TCT, TCC, AGT or AGC; and/or 99.9% of the occurrences of TCA may be replaced with AGT and 0.1% replaced with TCT, TCC, AGT or AGC.
As used herein, a “stop codon” is a nucleotide triplet that codes for termination of translation into proteins. Typically, genomes naturally comprise 3 stop codons: TAA (“ochre”), TGA (“opal” or “umber”) and TAG (“amber”).
In some embodiments the synthetic prokaryotic genome further comprises 10 or fewer, 5 or fewer, or no occurrences of one or two stop codons, preferably 10 or fewer, 5 or fewer, or no occurrences of the amber stop codon (TAG). Preferably wherein 90% or more, 95% or more, 98% or more, 99% or more, or all of the occurrences of TAG in the parent prokaryotic genome are replaced with TAA (the ochre stop codon). In preferred embodiments the synthetic prokaryotic genome comprises no occurrences of the amber stop codon (TAG), optionally wherein all of the occurrences of TAG in the parent prokaryotic genome are replaced with TAA (the ochre stop codon).
Accordingly, in preferred embodiments the synthetic prokaryotic genome of the present invention comprises no occurrences of one or more, or two or more sense codons and no occurrences of one stop codon, preferably the amber stop codon (TAG). In more preferred embodiments the synthetic prokaryotic genome of the present invention comprises no occurrences of two sense codons, preferably TCG and TCA, and no occurrences of the amber stop codon (TAG), optionally wherein TCG, TCA and TAG in the parent prokaryotic genome are replaced with synonymous codons, for example 99.9% or more of the occurrences of TCG in the parent prokaryotic genome are replaced with AGC, 99.9% or more of the occurrences of TCA in the parent prokaryotic genome are replaced with AGT and all of the occurrences of TAG in the parent prokaryotic genome are replaced with TAA.
In some embodiments the synthetic prokaryotic genome comprises a polynucleotide sequence which is at least 80%, 85%, 90%, 95%, 98%, 99%, 99.5%, 99.8%, or 99.9% identical to SEQ ID NO:1 or SEQ ID NO:2.
The invention provides a synthetic prokaryotic genome which is at least 98%, 98.5%, 99%, 99.5%, 99.6%, 99.7%, 99.8%, 99.9%, 99.95% or 100% identical to SEQ ID NO:1 or SEQ ID NO:2 Sequence comparisons can be conducted by eye, or more usually, with the aid of readily available sequence comparison programs. These publicly and commercially available computer programs can calculate sequence identity between two or more sequences.
Sequence identity may be calculated over contiguous sequences, i.e. one sequence is aligned with the other sequence and each amino acid in one sequence directly compared with the corresponding amino acid in the other sequence, one residue at a time. This is called an “ungapped” alignment. Typically, such ungapped alignments are performed only over a relatively short number of residues (for example less than 50 contiguous amino acids).
Although this is a very simple and consistent method, it fails to take into consideration that, for example, in an otherwise identical pair of sequences, one insertion or deletion will cause the following amino acid residues to be put out of alignment, thus potentially resulting in a large reduction in % homology when a global alignment is performed. Consequently, most sequence comparison methods are designed to produce optimal alignments that take into consideration possible insertions and deletions without penalising unduly the overall homology score. This is achieved by inserting “gaps” in the sequence alignment to try to maximise local homology.
However, these more complex methods assign “gap penalties” to each gap that occurs in the alignment so that, for the same number of identical amino acids, a sequence alignment with as few gaps as possible (reflecting higher relatedness between the two compared sequences) will achieve a higher score than one with many gaps. “Affine gap costs” are typically used that charge a relatively high cost for the existence of a gap and a smaller penalty for each subsequent residue in the gap. This is the most commonly used gap scoring system. High gap penalties will of course produce optimised alignments with fewer gaps. Most alignment programs allow the gap penalties to be modified. However, it is preferred to use the default values when using such software for sequence comparisons. For example when using the GCG Wisconsin Bestfit package (see below) the default gap penalty for amino acid sequences is −12 for a gap and −4 for each extension.
Calculation of maximum % sequence identity therefore firstly requires the production of an optimal alignment, taking into consideration gap penalties. A suitable computer program for carrying out such an alignment is the GCG Wisconsin Bestfit package (University of Wisconsin, U.S.A; Devereux et al., 1984, Nucleic Acids Research 12:387). Examples of other software than can perform sequence comparisons include, but are not limited to, the BLAST package (see Ausubel et al., 1999 ibid—Chapter 18), FASTA (Atschul et al., 1990, J. Mol. Biol., 403-410) and the GENEWORKS suite of comparison tools. Both BLAST and FASTA are available for offline and online searching (see Ausubel et al., 1999 ibid, pages 7-58 to 7-60). However it is preferred to use the GCG Bestfit program.
Suitably, the sequence identity may be determined across the entirety of the sequence. Suitably, the sequence identity may be determined across the entirety of the candidate sequence being compared to a sequence recited herein.
Although the final sequence identity can be measured in terms of identity, the alignment process itself is typically not based on an all-or-nothing pair comparison. Instead, a scaled similarity score matrix is generally used that assigns scores to each pairwise comparison based on chemical similarity or evolutionary distance. An example of such a matrix commonly used is the BLOSUM62 matrix (the default matrix for the BLAST suite of programs). GCG Wisconsin programs generally use either the public default values or a custom symbol comparison table if supplied (see user manual for further details). Preferably, the public default values for the GCG package, or in the case of other software the default matrix, such as BLOSUM62, are used.
Once the software has produced an optimal alignment, it is possible to calculate % sequence identity. The software typically does this as part of the sequence comparison and generates a numerical result.
Refactoring
Genomes contain numerous overlapping open reading frames (ORFs), which can be classified as 3′, 3′ (between ORFs in opposite orientations) or 5′, 3′ (between ORFs in the same orientation). The one or more sense codons (i.e. those to be replaced) may be found within both classes of overlap in the parent genome.
If the replacement of the one or more sense codons of each ORF within an overlap can be achieved without changing the encoded protein sequence of either ORF (i.e. by introducing synonymous codon(s)) then it may not be necessary to edit (e.g. refactor) the parent genome. However, when the encoded protein sequence is changed by the replacement of the one or more sense codons, (i.e. one or more synonymous sense codons are not introduced into one or both of the ORFs), then it may be necessary to edit (e.g. refactor) the parent genome.
Thus, in some embodiments one or more pairs of genes which share an overlapping region comprising the one or more sense codons in the parent genome are refactored. “Refactored” means that the genes are reorganised to prevent changes to the encoded protein sequences. Preferably, the pairs of genes are those in which sense codon replacements (e.g. defined synonymous codon replacements) would change the encoded protein sequence of both or either of the pair of genes. Most preferably, all pairs of genes which share an overlapping region comprising the one or more sense codons in the parent genome are refactored, wherein the pairs of genes are those in which sense codon replacements (e.g. defined synonymous codon replacements) would change the encoded protein sequence of both or either of the pair of genes.
For 3′,3′ overlaps (i.e. pairs of genes in opposite orientations) a synthetic insert may be inserted between the genes. For 3′,3′ overlaps the synthetic insert may comprise the overlapping region.
For 5′, 3′ overlaps (i.e. pairs of genes in the same orientation, comprising an upstream gene and a downstream gene) a synthetic insert may be inserted between the genes. For 5′,3′ overlaps the synthetic insert may comprise: (i) a stop codon; (ii) about 20-200 bp, or 20-100 bp, or 20-50 bp, from upstream of the overlapping region; and (iii) the overlapping region. Preferably, the synthetic insert comprises: (i) a stop codon; (ii) about 20 bp from upstream of the overlapping region; and (iii) the overlapping region. This preserves the sequence of the RBS for the downstream ORF and the distance between this RBS and its start codon.
In preferred embodiments the stop codon is in frame with the original start site for the downstream gene. Preferably the stop codon is TAA.
Aside from the specific mutations described above, i.e. those aimed at reducing the amount of one or more sense codons (e.g. replacements of one or more sense codons and/or refactoring) and those aimed at reducing the amount of amber stop codons, the synthetic prokaryotic genome may comprise 1000 or fewer, 100 or fewer, 50 or fewer, 20 or fewer, 10 or fewer additional (i.e. non-programmed) mutations relative to the parent genome. Preferably the synthetic prokaryotic genome comprises 2×10−4 or fewer additional or non-programmed mutations per target codon (i.e. per occurrence of the one or more sense codons in the parent genome).
Polynucleotides
The invention provides polynucleotides comprising one or more genes with no occurrences of one or more sense codons. The polynucleotides may comprise two or more, three or more, four or more, five or more, ten or more, twenty or more, thirty or more, forty or more, fifty or more, 100 or more, 200 or more, 500 or more, 600 or more, 700 or more, 800 or more, 900 or more, 1000 or more, 1500 or more, or 2000 or more genes with no occurrences of one or more sense codons. Preferably, the polynucleotides comprise 100 or more genes with no occurrences of one or more sense codons. More preferably, the polynucleotides comprise 1000 or more genes with no occurrences of one or more sense codons.
The one or more sense codons may consist of one, two, three, four, five, six, seven, or eight sense codons. Preferably, the one or more sense codons consist of one sense codon or two sense codons, most preferably two sense codons. Thus, in preferred embodiments the polynucleotides comprise 100 or more genes with no occurrences of one or two sense codons. In other preferred embodiments the polynucleotides comprise 1000 or more genes with no occurrences of one or two sense codons.
The one or more sense codons may be selected from: TCG, TCA, AGT, AGC, GCG, GCA, GTG, GTA, CTG, CTA, TTG, TTA, ACG, ACA, CCG, CCA, CGG, CGA, CGT, CGC, AGG, AGA, GGG, GGA, GGT, GGC, ATT, and ATC. Alternatively, the one or more sense codons may be selected from: TCG, TCA, TCT, TCC, AGT, AGC, GCG, GCA, GCT, GCC, CTG, CTA, CTT, CTC, TTG, and TTA. Preferably, the one or more sense codons are selected from: TCG, TCA, AGT, AGC, GCG, GCA, CTG, CTA, TTG, and TTA. More preferably, the one of more sense codons are selected from TCG, TCA, TTG, TTA, GCG and GCA. Most preferably, the one of more sense codons are TCG and/or TCA.
The one or more sense codons in the genes may be replaced with synonymous sense codons. Preferably, the replacement is a defined replacement, i.e. one sense codon is replaced with a single synonymous sense codon.
For example GCG may be replaced with GCT or GCC; GCA may be replaced with GCT or GCC; TCG may be replaced with TCT, TCC, AGT, or AGC; TCA may be replaced with TCT, TCC, AGT, or AGC; AGT may be replaced with TCG, TCA, TCT, or TCC; AGC may be replaced with TCG, TCA, TCT, or TCC; CTG may be replaced with CTT, CTC, TTG or TTA; CTA may be replaced with CTT, CTC, TTG or TTA; TTG may be replaced with CTG, CTA, CTT or CTC; or TTA may be replaced with CTG, CTA, CTT or CTC. Preferably the one or more defined sense codon replacements are selected from: GCG to GCT or GCC; GCA to GCT or GCC; TCG to AGT or AGC; TCA to AGT or AGC; AGT to TCA or TCT; AGC to TCG or TCC or TCA; TTG to CTT; and TTA to CTC. More preferably, TCG and/or TCA are replaced with AGC and/or AGT. Most preferably, TCG are replaced with AGC and/or TCA are replaced with AGT.
In some embodiments the genes are those for which there is evidence of translation and/or of the predicted protein product.
In preferred embodiments the genes are essential genes. The essential genes may be selected from one ore more of the list consisting of: ribF, lspA, ispH, dapB, folA, imp, yabQ, ftsL, ftsI, murE, murF, mraY, murD, ftsW, murG, murC, ftsQ, ftsA, ftsZ, lpxC, secM, secA, can, folK, hemL, yadR, dapD, map, rpsB, tsf, pyrH, frr, dxr, ispU, cdsA, yaeL, yaeT, lpxD, fabZ, lpxA, lpxB, dnaE, accA, tilS, proS, yafF, hemB, secD, secF, ribD, ribE, thiL, dxs, ispA, dnaX, adk, hemH, lpxH, cysS, folD, entD, mrdB, mrdA, nadD, holA, rlpB, leuS, lnt, ginS, fldA, cydA, infA, cydC, ftsK, lolA, serS, rpsA, msbA, lpxK, kdsB, mukF, mukE, mukB, asnS, fabA, mviN, rne, fabD, fabG, acpP, tmk, holB, lolC, lolD, lolE, purB, minE, minD, pth, prsA, ispE, lolB, hemA, prfA, prmC, kdsA, topA, ribA, fabI, tyrS, ribC, ydiL, pheT, pheS, rplT, infC, thrS, nadE, gapA, yeaZ, aspS, argS, pgsA, yefM, metG, folE, yejM, gyrA, nrdA, nrdB, folC, accD, fabB, gltX, ligA, zipA, dapE, dapA, der, hisS, ispG, suhB, tadA, acpS, era, rnc, IepB, rpoE, pssA, yfiO, rplS, trmD, rpsP, ffh, grpE, csrA, ispF, ispD, ftsB, eno, pyrG, chpR, lgt, fbaA, pgk, yqgD, metK, yqgF, plsC, ygiT, parE, ribB, cca, ygjD, tdcF, yraL, yhbV, infB, nusA, ftsH, obgE, rpmA, rplU, ispB, murA, yrbB, yrbK, yhbN, rpsI, rplM, degS, mreD, mreC, mreB, accB, accC, yrdC, def, fmt, rplQ, rpoA, rpsD, rpsK, rpsM, secY, rplO, rpmD, rpsE, rplR, rplF, rpsH, rpsN, rplE, rplX, rplN, rpsQ, rpmC, rplP, rpsC, rplV, rpsS, rplB, rplW, rplD, rplC, rpsJ, fusA, rpsG, rpsL, trpS, yrfF, asd, rpoH, ftsX, ftsE, ftsY, yhhQ, bcsB, glyQ, gpsA, rfaK, kdtA, coaD, rpmB, dfp, dut, gmk, spoT, gyrB, dnaN, dnaA, rpmH, rnpA, yidC, tnaB, glmS, glmU, wzyE, hemD, hemC, yigP, ubiB, ubiD, hemG, yihA, ftsN, murI, murB, birA, secE, nusG, rplJ, rplL, rpoB, rpoC, ubiA, plsB, lexA, dnaB, ssb, alsK, groS, psd, orn, yjeE, rpsR, chpS, ppa, valS, yjgP, yjgQ, and dnaC.
Preferably, the essential genes may be selected from one ore more of the list consisting of: ribF, lspA, ispH, dapB, folA, imp, yabQ, lpxC, secM, secA, can, folK, hemL, yadR, dapD, map, rpsB, tsf, pyrH, frr, dxr, ispU, cdsA, yaeL, yaeT, lpxD, fabZ, lpxA, lpxB, dnaE, accA, tilS, proS, yafF, hemB, secD, secF, ribD, ribE, thiL, dxs, ispA, dnaX, adk, hemH, lpxH, cysS, folD, entD, mrdB, mrdA, nadD, holA, rlpB, leuS, lnt, ginS, fldA, cydA, infA, cydC, ftsK, lolA, serS, rpsA, msbA, lpxK, kdsB, mukF, mukE, mukB, asnS, fabA, mviN, me, fabD, fabG, acpP, tmk, holB, lolC, lolD, lolE, purB, minE, minD, pth, prsA, ispE, lolB, hemA, prfA, prmC, kdsA, topA, ribA, fabI, tyrS, ribC, ydiL, pheT, pheS, rplT, infC, thrS, nadE, gapA, yeaZ, aspS, argS, pgsA, yefM, metG, folE, yejM, gyrA, nrdA, nrdB, folC, accD, fabB, gltX, ligA, zipA, dapE, dapA, der, hisS, ispG, suhB, tadA, acpS, era, rnc, lepB, rpoE, pssA, yfiO, rplS, trmD, rpsP, ffh, grpE, csrA, ispF, ispD, ftsB, eno, pyrG, chpR, lgt, fbaA, pgk, yqgD, metK, yqgF, plsC, ygiT, parE, ribB, cca, ygjD, tdcF, yraL, yhbV, infB, nusA, ftsH, obgE, rpmA, rplU, ispB, murA, yrbB, yrbK, yhbN, rpsI, rplM, degS, mreD, mreC, mreB, accB, accC, yrdC, def, fmt, rplQ, rpoA, rpsD, rpsK, rpsM, secY, rplO, rpmD, rpsE, rplR, rplF, rpsH, rpsN, rplE, rplX, rplN, rpsQ, rpmC, rplP, rpsC, rplV, rpsS, rplB, rplW, rplD, rplC, rpsJ, fusA, rpsG, rpsL, trpS, yrfF, asd, rpoH, ftsX, ftsE, ftsY, yhhQ, bcsB, glyQ, gpsA, rfaK, kdtA, coaD, rpmB, dfp, dut, gmk, spoT, gyrB, dnaN, dnaA, rpmH, rnpA, yidC, tnaB, glmS, glmU, wzyE, hemD, hemC, yigP, ubiB, ubiD, hemG, yihA, ftsN, murI, murB, birA, secE, nusG, rplJ, rplL, rpoB, rpoC, ubiA, plsB, lexA, dnaB, ssb, alsK, groS, psd, orn, yjeE, rpsR, chpS, ppa, valS, yjgP, yjgQ, and dnaC.
Accordingly, the invention provides polynucleotides comprising one or more essential genes with no TCG codons and/or TCA codons, wherein the one or more essential genes is selected from the list consisting of: ribF, lspA, ispH, dapB, folA, imp, yabQ, lpxC, secM, secA, can, folK, hemL, yadR, dapD, map, rpsB, tsf, pyrH, frr, dxr, ispU, cdsA, yaeL, yaeT, lpxD, fabZ, lpxA, lpxB, dnaE, accA, tilS, proS, yafF, hemB, secD, secF, ribD, ribE, thiL, dxs, ispA, dnaX, adk, hemH, lpxH, cysS, folD, entD, mrdB, mrdA, nadD, holA, rlpB, leuS, lnt, glnS, fldA, cydA, infA, cydC, ftsK, lolA, serS, rpsA, msbA, lpxK, kdsB, mukF, mukE, mukB, asnS, fabA, mviN, me, fabD, fabG, acpP, tmk, holB, lolC, lolD, lolE, purB, minE, minD, pth, prsA, ispE, lolB, hemA, prfA, prmC, kdsA, topA, ribA, fabI, tyrS, ribC, ydiL, pheT, pheS, rplT, infC, thrS, nadE, gapA, yeaZ, aspS, argS, pgsA, yefM, metG, folE, yejM, gyrA, nrdA, nrdB, folC, accD, fabB, gltX, ligA, zipA, dapE, dapA, der, hisS, ispG, suhB, tadA, acpS, era, rnc, lepB, rpoE, pssA, yfiO, rplS, trmD, rpsP, ffh, grpE, csrA, ispF, ispD, ftsB, eno, pyrG, chpR, lgt, fbaA, pgk, yqgD, metK, yqgF, plsC, ygiT, parE, ribB, cca, ygjD, tdcF, yraL, yhbV, infB, nusA, ftsH, obgE, rpmA, rplU, ispB, murA, yrbB, yrbK, yhbN, rpsI, rplM, degS, mreD, mreC, mreB, accB, accC, yrdC, def, fmt, rplQ, rpoA, rpsD, rpsK, rpsM, secY, rplO, rpmD, rpsE, rplR, rplF, rpsH, rpsN, rplE, rplX, rplN, rpsQ, rpmC, rplP, rpsC, rplV, rpsS, rplB, rplW, rplD, rplC, rpsJ, fusA, rpsG, rpsL, trpS, yrfF, asd, rpoH, ftsX, ftsE, ftsY, yhhQ, bcsB, glyQ, gpsA, rfaK, kdtA, coaD, rpmB, dfp, dut, gmk, spoT, gyrB, dnaN, dnaA, rpmH, rnpA, yidC, tnaB, glmS, glmU, wzyE, hemD, hemC, yigP, ubiB, ubiD, hemG, yihA, ftsN, murI, murB, birA, secE, nusG, rplJ, rplL, rpoB, rpoC, ubiA, plsB, lexA, dnaB, ssb, alsK, groS, psd, orn, yjeE, rpsR, chpS, ppa, valS, yjgP, yjgQ, and dnaC. Preferably, the polynucleotides comprise two or more, three or more, four or more, five or more, ten or more, twenty or more, thirty or more, forty or more, fifty or more, 100 or more, or 200 or more essential genes with no TCG codons and/or TCA codons.
In some embodiments the polynucleotide comprises a polynucleotide sequence which is at least 80%, 85%, 90%, 95%, 98%, 99%, 99.5%, 99.8%, or 99.9%, or 100% identical to SEQ ID NO:1 or SEQ ID NO:2 or to any fragment of SEQ ID NO:1 or SEQ ID NO:2, preferably wherein the fragment is at least 10 kb, 20 kb, 50 kb, 100 kb, or 500 kb in length.
Preferably the polynucleotide is viable. I.e. the polynucleotide may incorporated into a genome such that the genome is a viable genome. Preferably, the polynucleotide may replace a corresponding region of the parent genome and retain viability of said genome. As used herein, a “viable genome” refers to a genome that contains nucleic acid sequences sufficient to cause and/or sustain viability of a cell, e.g., those encoding molecules required for replication, transcription, translation, energy production, transport, production of membranes and cytoplasmic components, and cell division. Thus, the present invention also provides a viable synthetic prokaryotic genome (e.g. a viable synthetic E. coli genome) comprising the polynucleotide of the present invention.
The invention provides a polynucleotide which is at least 98%, 98.5%, 99%, 99.5%, 99.6%, 99.7%, 99.8%, 99.9%, 99.95% or 100% identical to SEQ ID NO:1 or SEQ ID NO:2 or to any fragment of SEQ ID NO:1 or SEQ ID NO:2, preferably wherein the fragment is at least 10 kb, 20 kb, 50 kb, 100 kb, or 500 kb in length.
Host Cells and Uses Thereof
Host Cells
The invention also provides a host cell comprising the synthetic prokaryotic genome or the polynucleotide of the invention. The host cell may be an isolated host cell.
The host cell of the present invention is a prokaryotic cell. More preferably, the host cell is a bacterial cell. Preferably the bacterial host cell is suitable for heterologous protein production, in particular the production of polypeptides comprising one or more non-proteinogenic amino acids (for instance those described by Ferrer-Miralles, N. and Villaverde, A., 2013. Microbial Cell Factories, 12:113). Suitable bacterial host cells include: escherichia (e.g. Escherichia coli), caulobacteria (e.g. Caulobacter crescentus), phototrophic bacteria (e.g. Rodhobacter sphaeroides), cold adapted bacteria (e.g. Pseudoalteromonas haloplanktis, Shewanella sp. strain Ac10), pseudomonads (e.g. Pseudomonas fluorescens, Pseudomonas putida, Pseudomonas aeruginosa), halophilic bacteria (e.g. Halomonas elongate, Chromohalobacter salexigens), streptomycetes (e.g. Streptomyces lividans, Streptomyces griseus), nocardia (e.g. Nocardia lactamdurans), mycobacteria (e.g. Mycobacterium smegmatis), coryneform bacteria (e.g. Corynebacterium glutamicum, Corynebacterium ammoniagenes, Brevibacterium lactofermentum), bacilli (e.g. Bacillus subtilis, Bacillus brevis, Bacillus megaterium, Bacillus licheniformis, Bacillus amyloliquefaciens), and lactic acid bacteria (e.g. Lactococcus lactis, Lactobacillus plantarum, Lactobacillus casei, Lactobacillus reuteri, Lactobacillus gasseri). In some embodiments the bacterial host cell is gram-negative bacterium.
Preferably, the host cell is an Escherichia coli, Salmonella enterica, or Shigella dysenteriae.
More preferably, the host cell is an E. coli. Suitable E. coli host cells include MDS42, K-12, MG1655, BL21, BL21(DE3), AD494, Origami, HMS174, BLR(DE3), HMS174(DE3), Tuner(DE3), Origami2(DE3), Rosetta2(DE3), Lemo21(DE3), NiCo21(DE3), T7 Express, SHuffle Express, C41(DE3), C43(DE3), and m15 pREP4 or derivatives thereof (Rosano, G. L. and Ceccarelli, E. A., 2014. Frontiers in microbiology, 5, p. 172). Most preferably, the host cell is MDS42, MG1655, or BL21 or a derivative thereof. MG1655 is considered as the wild type strain of E. coli. The GenBank ID of genomic sequence of this strain is U00096. BL21 is widely available commercially. For example, it can be purchased from New England BioLabs with catalog number C2530H.
The host cell may preferably be the same as that from which the synthetic prokaryotic genome or polynucleotide is from (or derived from). For example, if the synthetic prokaryotic genome is a synthetic E. coli genome then the host cell is preferably an E. coli. When the parent genome of a cell has been modified to produce the synthetic prokaryotic genome of the present invention, the host cell is preferably the same cell, i.e. preferably the host cell comprising the synthetic prokaryotic genome is the same as the host cell of the parent genome (the parent host cell).
The host cell may be viable, i.e. able to grow and replicate.
When the genome of a cell has been modified to produce the synthetic prokaryotic genome of the present invention, the synthetic prokaryotic genome is preferably one which, when present in the parent host cell, does not substantially decrease the growth rate. Thus, preferably the host cell comprising the synthetic prokaryotic genome does not have a substantially decreased growth rate relative to the host cell comprising the parent genome. In some embodiments the host cell comprising the synthetic prokaryotic genome has a doubling time less than 4 times, 3 times, 2 times, or about 1.6 times, slower than the host cell comprising the host cell comprising the parent genome. The doubling time can be determined by any method known to those of skill in the art. In some embodiments the doubling time is determined at 37° C., 25° C. or 42° C., in LB media.
When the genome of a cell has been modified to produce the synthetic prokaryotic genome of the present invention, the synthetic prokaryotic genome is preferably one which, when present in the parent host cell, does not cause any substantial phenotypical changes. Thus, preferably the host cell comprising the synthetic prokaryotic genome does not have any substantial phenotypical changes relative to the host cell comprising the parent genome. In some embodiments the host cell comprising the synthetic prokaryotic genome has a mean cell length less than 100%, 50%, or about 20% greater than the host cell comprising the parent genome. For example, the cell length may be about 1.5 to 3 microns. The cell length can be determined by any method known to those of skill in the art. In some embodiments the host cell comprising the synthetic prokaryotic genome has a proteome that is not substantially different from the proteome of the host cell comprising the parent genome. The proteome can be determined by any method known to those of skill in the art.
Reassignment to Alternative Canonical Amino Acids
In some embodiments the one or more sense codons (i.e. those removed from the parent genome) are reassigned to encode alternative canonical amino acids. For example, if TCG and TCA have been removed, one or both may be reassigned to encode a canonical amino acid other than serine (e.g. alanine).
For instance, the synthetic prokaryotic genome of the present invention substantially or completely lacks one or more sense codons. Therefore, one or more tRNA or release factors may be deleted from the synthetic genome. For instance, a tRNA which decodes the one or more sense codons that have been replaced (or deleted) may be deleted from the synthetic prokaryotic genome. A tRNA which decodes one or more sense codons that have been replaced (or deleted) may be deleted and the synthetic prokaryotic genome will remain viable if the tRNA decodes only the one or more sense codons that have been replaced (or deleted); or alternatively if the tRNA decodes one or more sense codons that have been replaced (or deleted) and one or more sense codons that have not been replaced (or deleted), if the tRNA is dispensable for the one or more sense codons that have not been replaced (or deleted) (i.e. the one or remaining sense codons which the tRNA decodes are decoded by one or more alternative tRNAs). For example, if the synthetic prokaryotic genome lacks TCA sense codons, serT, encoding tRNASerUGA, may be deleted and/or if the synthetic prokaryotic genome lacks TCG sense codons, serU, encoding tRNASerCGA, may be deleted. The deletion of one or more tRNAs may be used, for instance, in combination with a reassigned, endogenous tRNA or an orthogonal aminoacyl-tRNA synthetase/tRNA pair to reassign the one or more sense codons to an alternative amino acid.
For example, if TCG and TCA have been removed from the synthetic prokaryotic genome, serT, encoding tRNASerUGA, and serU, encoding tRNASerCGA, may be deleted from the synthetic prokaryotic genome, and either the tRNACGA can be reassigned (e.g. to tRNAAlaCGA) an orthogonal aminoacyl-tRNA synthetase/tRNACGA pair may be introduced to the host cell (e.g. by a heterologous nucleic acid or by incorporation into the synthetic prokaryotic genome) to reassign TCG to an alternative canonical amino acid. Thus, in some embodiments, the host cell of the present invention further comprises one or more reassigned tRNAs and/or one or more heterologous nucleotides (e.g. plasmids) encoding one orthogonal aminoacyl-tRNA synthetase (aaRS)-tRNA pair. In some embodiments the host cell of the present invention further comprises a plasmid encoding an orthogonal aminoacyl-tRNA synthetase (aaRS)-tRNA pair. Alternatively, the orthogonal aminoacyl-tRNA synthetase (aaRS)-tRNA pair may be introduced into the host cell by incorporation into the synthetic prokaryotic genome. Thus, in some embodiments the synthetic prokaryotic genome encodes an orthogonal aminoacyl-tRNA synthetase (aaRS)-tRNA pair, preferably wherein the gene encoding the native tRNA has been deleted from the parent prokaryotic genome. In preferred embodiments the host cell of the present invention further comprises one or more reassigned tRNAs. Methods for reassigning tRNAs will be well known to those of skill in the art.
The reassignment to encode alternative canonical amino acids may increase biosafety. Thus, in some embodiments the host cell of the present invention has increased biosafety. Accordingly, the present invention provides host cells with improved biosafety.
For example, the reassignment to encode alternative canonical amino acids may render the host cell comprising the synthetic prokaryotic genome resistant to bacteriophage infection. One or more bacteriophage genes will typically comprise the one or more sense codons, thus when the one or more bacteriophage genes are translated an alternative canonical amino acid may be incorporated into the corresponding bacteriophage proteins. The incorporation of an alternative canonical amino acid may destabilise, disrupt or reduce the activity of said proteins, thus reducing the infectivity of the bacteriophage and rendering the host cell resistant to bacteriophage infection.
Thus, in some embodiments the host cell of the present invention is resistant to phage infection. For example, when the genome of a cell has been modified to produce the synthetic prokaryotic genome of the present invention, the synthetic prokaryotic genome may be one which, when present in the parent host cell, increases resistance to phage infection. Thus, in some embodiments the host cell comprising the synthetic prokaryotic genome has increased phage resistance relative to the host cell comprising the parent genome.
Accordingly, the present invention provides phage-resistant host cells and host cells with increased phage resistance.
The reassignment to encode alternative canonical amino acids may also allow genetic material, e.g. antibiotic resistance genes, to be designed such that they are functional in the recoded strain, but not in wild type strains. For example, the genetic material may be incorporated into the host cell of the present invention (e.g. by a heterologous nucleic acid or by incorporation into the synthetic prokaryotic genome) such that the host cell will grow in certain conditions (e.g. in the presence of an antibiotic), but other host cells (e.g. the parent host cell) will not. Thus, in some embodiments the host cell of the present invention may render a composition comprising the host cell more resistant to contamination by other host cells (e.g. other prokaryotes).
Reassignment to Non-Proteinogenic Amino Acids
In some embodiments the one or more sense codons (i.e. those removed from the parent genome) are reassigned to encode non-canonical amino acids (non-proteinogenic amino acids).
Thus, the present invention provides for use of a host cell according to the present invention for producing polypeptides comprising one or more non-proteinogenic amino acids, preferably two or more non-proteinogenic amino acids, most preferably three or more non-proteinogenic amino acids.
The present invention also provides polypeptides obtained or obtainable by using a host cell according to the present invention. In some embodiments, the polypeptides comprise one or more non-proteinogenic amino acids, preferably two or more non-proteinogenic amino acids, most preferably three or more non-proteinogenic amino acids. Thus, the present invention also provides polypeptides comprising two or more non-proteinogenic amino acids and polypeptides comprising three or more non-proteinogenic amino acids.
As used herein, “non-proteinogenic amino acids” (also known as “non-coded amino acids” or “noncanonical amino acids”) are amino acids that are not naturally encoded or found in the genetic code. Despite the use of only 22 amino acids by the translational machinery to assemble proteins (the proteinogenic amino acids—20 in the standard genetic code and an additional 2 that can be incorporated by special translation mechanisms), over 140 amino acids are known to occur naturally in proteins and thousands more may occur in nature or be synthesized in the laboratory. Thus, non-proteinogenic amino acids may comprise any amino acid excluding L-alanine, L-cysteine, L-aspartic acid, L-glutamic acid, L-phenylalanine, glycine, L-histidine, L-isoleucine, L-lysine, L-leucine, L-methionine, L-asparagine, L-proline, L-glutamine, L-arginine, L-serine, L-threonine, L-valine, L-tryptophan and L-tyrosine, and optionally L-pyrrolysine and L-selenocysteine.
In some embodiments, the non-proteinogenic amino acids are unnatural amino acids (UAAs).
The non-proteinogenic amino acid or UAA is not particularly limited. Suitable non-proteinogenic amino acid and UAAs will be well known to those of skill in the art, for example those disclosed in Neumann, H., 2012. FEBS letters, 586(15), pp. 2057-2064; and Liu, C. C. and Schultz, P. G., 2010. Annual review of biochemistry, 79, pp. 413-444. In some embodiments the non-proteinogenic amino acid and/or UAAs are selected from one or more of: p-Acetylphenylalanine, m-Acetylphenylalanine, O-allyltyrosine, Phenylselenocysteine, p-Propargyloxyphenylalanine, p-Azidophenylalanine, p-Boronophenylalanine, O-methyltyrosine, p-Aminophenylalanine, p-Cyanophenylalanine, m-Cyanophenylalanine, p-Fluorophenylalanine, p-lodophenylalanine, p-Bromophenylalanine, p-Nitrophenylalanine, L-DOPA, 3-Aminotyrosine, 3-lodotyrosine, p-lsopropylphenylalanine, 3-(2-Naphthyl)alanine, Biphenylalanine, Homoglutamine, D-tyrosine, p-Hydroxyphenyllactic acid, 2-Aminocaprylic acid, Bipyridylalanine, HQ-alanine, p-Benzoylphenylalanine, o-Nitrobenzylcysteine, o-Nitrobenzylserine, 4,5-Dimethoxy-2-nitrobenzylserine, o-Nitrobenzyllysine, o-Nitrobenzyltyrosine, 2-Nitrophenylalanine, Dansylalanine, p-Carboxymethylphenylalanine, 3-Nitrotyrosine, Sulfotyrosine, Acetyllysine, Methylhistidine, 2-Aminononanoic acid, 2-Aminodecanoic acid, Pyrrolysine, Cbz-lysine, Boc-lysine and Allyloxycarbonyllysine.
Prokaryotes, e.g. E. coli, are not typically able to incorporate most eukaryotic post-translational modifications, such as ubiquitination, glycosylation and phosphorylation, nor are they typically capable of other eukaryotic maturation processes, and proteolytic protein maturation. In addition, correct disulphide bond formation and lipolysaccharide contaminations can be troublesome (see Ovaa, H., 2014. Frontiers in chemistry, 2, p. 15). However, therapeutic proteins, such as antibodies, enzymes and cytokines commonly carry post-translational modifications and disulphide bonds, and often require proteolytic maturation to attain their correctly folded state. Thus, the majority of therapeutic proteins are produced in eukaryotic and mammalian cell systems. However, expression in prokaryotic host cells e.g. E. coli is in general cheaper, more susceptible to genetic modifications, and versatile with regard to mutant library development, and suitable for industrial scale fermentation (Ovaa, H., 2014. Frontiers in chemistry, 2, p. 15).
Thus, in some embodiments the polypeptides are therapeutic polypeptides, preferably wherein mammalian protein modifications have been introduced via one or more non-proteinogenic amino acids. For example, amber codon suppression has previously been used to incorporate one or more non-proteinogenic amino acids (i.e. mammalian protein modifications) into therapeutic polypeptides. The present invention allows two or more non-proteinogenic amino acids to be incorporated. Thus, the present invention provides a therapeutic polypeptide comprising two or more non-proteinogenic amino acids.
The synthetic prokaryotic genome of the present invention substantially or completely lacks one or more sense codons, therefore one or more tRNA or release factors may be deleted from the synthetic genome. For example, a tRNA which decodes only the one or more sense codons that have been replaced (or deleted) may be deleted from the synthetic prokaryotic genome. For example, if the synthetic prokaryotic genome lacks TCA sense codons, serT, encoding tRNASerUGA, may be deleted and/or if the synthetic prokaryotic genome lacks TCG sense codons, serU, encoding tRNASerCGA, may be deleted. The synthetic prokaryotic genome may then be used (in conjunction with an orthogonal aminoacyl-tRNA synthetase-tRNA pair) to direct the incorporation of non-proteinogenic amino acids into proteins.
Genetic code expansion uses an orthogonal aminoacyl-tRNA synthetase (aaRS)-tRNA pair to direct the incorporation of non-proteinogenic amino acids into proteins, in response to an unassigned codon (e.g. the amber stop codon, UAG) introduced at the desired site in a gene of interest. The orthogonal synthetase does not recognize endogenous tRNAs, and specifically aminoacylates an orthogonal cognate tRNA (which is not an efficient substrate for endogenous synthetases) with the non-proteinogenic amino acids provided to (or synthesized by) the cell (Chin, J. W., 2017. Nature, 550(7674), 53-60). The person skilled in the art would be able to identify and/or generate suitable orthogonal aminoacyl-tRNA synthetase (aaRS)-tRNA pairs (e.g. Elliott, T. S. et al., 2014. Nat Biotechnol 32, 465-472; Elliott, T. S., et al., 2016. Cell Chem Biol 23, 805-815; and Krogager, T. P. et al., 2018. Nat Biotechnol 36, 156-159). Thus, in some embodiments, the host cell of the present invention further comprises one or more heterologous nucleotides (e.g. plasmids) encoding one orthogonal aminoacyl-tRNA synthetase (aaRS)-tRNA pair. In preferred embodiments the host cell of the present invention further comprises a plasmid encoding an orthogonal aminoacyl-tRNA synthetase (aaRS)-tRNA pair. Alternatively, the orthogonal aminoacyl-tRNA synthetase (aaRS)-tRNA pair may be introduced into the host cell by incorporation into the synthetic prokaryotic genome. Thus, in some embodiments the synthetic prokaryotic genome encodes an orthogonal aminoacyl-tRNA synthetase (aaRS)-tRNA pair, preferably wherein the gene encoding the native tRNA has been deleted from the parent prokaryotic genome.
Thus, in some embodiments the host cell of the present invention further comprises one or more heterologous nucleotides (e.g. plasmids) which comprise one or more genes comprising said sense codons. In preferred embodiments the host cell further comprises a plasmid comprising a gene comprising said sense codons. The one or more sense codons may be present in a desired site in the gene, preferably wherein the desired site allows incorporation of one or more non-proteinogenic amino acids (i.e. mammalian protein modifications) into polypeptides, preferably therapeutic polypeptides.
In other embodiments said sense codons may be present in one or more genes in the synthetic prokaryotic genome (for example, the heterologous nucleotide may be incorporated into the synthetic prokaryotic genome). The one or more sense codons may be present in a desired site in the gene, preferably wherein the desired site allows incorporation of one or more non-proteinogenic amino acids (i.e. mammalian protein modifications) into polypeptides, preferably therapeutic polypeptides.
For example, if TCG and TCA have been removed from the synthetic prokaryotic genome, serT, encoding tRNASerUGA, and serU, encoding tRNASerCGA, may be deleted from the synthetic prokaryotic genome, and an orthogonal aminoacyl-tRNA synthetase/tRNACGA pair may be used in combination with (heterologous) genes comprising the TCG codon, to encode polypeptides comprising one or more non-proteinogenic amino acid. Thus, the host cell of the present invention may, for instance, further comprise: (i) a plasmid encoding an orthogonal aminoacyl-tRNA synthetase/tRNACGA pair; and (ii) a plasmid comprising a gene comprising one or more TCG codons. Similarly, if AGT and AGC are removed, serV, encoding tRNASerGCU may be deleted from the synthetic prokaryotic genome, and an orthogonal aminoacyl-tRNA synthetase/tRNAACU pair and/or an orthogonal aminoacyl-tRNA synthetase/tRNAGCU pair may be used. Similarly, if CTG and CTA are removed, leuP,Q,T,V encoding tRNALeuCAG, and leuW, encoding tRNALeuUAG, may be deleted from the synthetic prokaryotic genome, and an orthogonal aminoacyl-tRNA synthetase/tRNACAG pair may be used. Similarly, if TTG and TTA are removed, leuX, encoding tRNALeuCAA, and leuZ, encoding tRNALeuUAA, may be deleted from the synthetic prokaryotic genome, and an orthogonal aminoacyl-tRNA synthetase/tRNACAA pair and/or an orthogonal aminoacyl-tRNA synthetase/tRNAUAA pair may be used may be used. Similarly, if GCG and GCA are removed, alaT,U,V, encoding tRNAAlaUGC may be deleted from the synthetic prokaryotic genome, and an orthogonal aminoacyl-tRNA synthetase/tRNACGC pair may be used.
In some embodiments the synthetic prokaryotic genome lacks genes encoding release factors (e.g. RF1) and/or the host cell lacks release factors (e.g. RF1) to increase the efficiency of incorporation of non-proteinogenic amino acids.
Method for Producing a Synthetic Genome
In one aspect, the invention provides a method for producing a synthetic genome comprising:
-
- (a) providing a parent genome;
- (b) carrying out one or more rounds of recombination-mediated genetic engineering on the parent genome, to produce two or more different partially synthetic genomes; and
- (c) carrying out one or more rounds of directed conjugation with the two or more different partially synthetic genomes to produce a synthetic genome.
Recombination-Mediated Genetic Engineering
Preferably one or more rounds of recombination-mediated genetic engineering are used to edit 10-1000 kb, 50-1000 kb, 100-1000 kb, or 100-500 kb of the parent genome to provide two or more different partially synthetic genomes. Thus, in preferred embodiments each round of recombination-mediated genetic engineering inserts or replaces 10 kb or more, 50 kb or more, 100 kb or more, or about 100 kb of DNA in the parent genome.
As used herein, the term “recombination-mediated genetic engineering” (also known as “recombineering”) is a method for genetic engineering (i.e. editing genomes) based on homologous recombination systems. Typically recombineering is based on homologous recombination in Escherichia coli mediated by bacteriophage proteins, either RecE/RecT from Rac prophage or Redaβδ from bacteriophage lambda. Any suitable method of recombination-mediated genetic engineering may be used. Methods for recombination-mediated genetic engineering will be well known to those of skill in the art.
In “classical recombination” (exemplified by lambda red mediated recombination in E. coli), short regions of synthetic DNA may be inserted into the genome or used to replace genomic DNA in a two-step process: i) transformation of cells with linear double stranded DNA (dsDNA) carrying a stretch of synthetic DNA, coupled with a positive selection marker, and flanked by a homology region (HR) to the target region of the genome on each end, and ii) recombination mediated by the homologous regions, followed by selection for genomic integration by virtue of the positive selection marker. This approach can be used to insert or replace 2-3 kb of genomic DNA. Thus, if classical recombination is used, many rounds of recombination-mediated genetic engineering would be required to edit 100-500 kb of the parent genome.
Thus, in preferred embodiments the one or more rounds of recombination-mediated genetic engineering comprise one or more rounds of replicon excision for enhanced genome engineering through programmed recombination (REXER).
REXER is described in WO 2018/020248 (herein incorporated by reference). Each round of REXER may be used to insert or replace about 50 kb to 250 kb, or about 100 kb of DNA in the parent genome.
Thus, the one or more rounds of recombination-mediated genetic engineering may comprise:
-
- i) providing a host cell (e.g. E. coli), wherein the host cell comprises an episomal replicon (e.g. a plasmid or a bacterial artificial chromosome) and a target nucleic acid (e.g. the genome), wherein the episomal replicon comprises a donor nucleic acid sequence (i.e. a synthetic region), wherein the donor nucleic acid sequence comprises in order: 5′—homologous recombination sequence 1—sequence of interest—homologous recombination sequence 2-3′, wherein the sequence of interest comprises a positive selectable marker, and wherein the target nucleic acid comprises in order: 5′—homologous recombination sequence 1—negative selectable marker—homologous recombination sequence 2-3′;
- ii) providing helper protein(s) capable of supporting nucleic acid recombination in said host cell (e.g. lambda Red proteins);
- iii) providing helper protein(s) and/or RNAs capable of supporting nucleic acid excision in said host cell (e.g. CRISPR/Cas9 proteins/RNAs);
- iv) inducing excision of said donor nucleic acid sequence;
- v) incubating to allow recombination between the excised donor nucleic acid and said target nucleic acid; and
- vi) selecting for recombinants having incorporated said donor nucleic acid into said target nucleic acid.
Suitably selecting for recombinants having incorporated said donor nucleic acid into said target nucleic acid comprises selection for gain of the positive selectable marker of the donor nucleic acid and loss of the negative selectable marker of the target nucleic acid. Suitably selection for gain of the positive selectable marker of the donor nucleic acid and loss of the negative selectable marker of the target nucleic acid is carried out simultaneously. Suitably said sequence of interest comprises both a positive selectable marker and a negative selectable marker. Suitably the negative selectable marker is selected from the group consisting of sacB (sucrose sensitivity), rpsL (S12 ribosomal protein—streptomycin sensitivity), or pheST251A_A294G(4-chlorophenylalanine sensitivity). Suitably the positive selectable marker is selected from the group consisting of CmR (chloramphenicol resistance), KanR (kanamycin resistance), HygR (hygromycin resistance), GentamycinR (gentamycin resistance), or tetracyclineR (tetracycline resistance). Suitably the step of selecting for recombinants comprises sequential selection for said positive and negative markers, or sequential selection for said negative and positive markers. Suitably the step of selecting for recombinants comprises simultaneous selection for said positive and negative markers.
Suitably said method as described above further comprises the step of inducing at least one double stranded break in the target nucleic acid sequence, wherein said double stranded break is between said homologous recombination sequence 1 and said homologous recombination sequence 2. Suitably at least two double stranded breaks are induced in the target nucleic acid sequence, wherein each said double stranded break is between said homologous recombination sequence 1 and said homologous recombination sequence 2.
Suitably said excised donor nucleic acid begins with said homologous recombination sequence 1 and ends with said homologous recombination sequence 2.
Suitably said episomal replicon comprises a negative selectable marker independent of the donor nucleic acid sequence. Suitably said method comprises the further step of selecting for loss of the episomal replicon by selecting for loss of said negative selectable marker independent of the donor nucleic acid sequence. Suitably said episomal replicon comprises in order: excision cut site 1—donor nucleic acid sequence—excision cut site 2. Suitably said target nucleic acid possesses its own origin of replication capable of functioning within said host cell. Suitably said episomal replicon is a plasmid nucleic acid. Suitably said episomal replicon is a bacterial artificial chromosome (BAC). Suitably said target nucleic acid is the host cell genome.
The episomal replicon (e.g. BAC) may be assembled by homologous recombination, for example in S. cerevisiae, as described in Kouprina, N., et al., 2004. Methods Mol Biol 255, 69-89. The assembly may combine: 7-14 stretches of synthetic DNA, each 6-13 kb in length; a selection construct (comprising a negative selection marker and/or a positive selection marker); and a BAC shuttle vector backbone. The stretches of synthetic DNA may collectively correspond to the donor nucleic acid sequence (i.e. the synthetic region) in the episomal replicon, wherein each stretch comprises 80-200 bp of overlapping DNA sequence with each other, and wherein the overlap regions are free of any recoding targets. The stretches may be supplied in pSC101 or pST vectors flanked by suitable restriction sites (e.g. BsaI, AvrII, SpeI, or XbaI). Thus, during assembly the synthetic DNA stretches may be excised by digestion with the corresponding restriction enzymes. Assembly of the episomal replicon may be verified by sequencing.
Suitably the two homology regions may be 30-100 bp, or 40-50 bp, or about 50 bp in length.
CRISPR/Cas9 machinery may be used to for excision. In some embodiments the CRISPR/Cas9 machinery comprises Cas9, tracrRNA and two spacer RNAs, wherein the spacer RNAs target the two homology regions for excision. In preferred embodiments, the spacer RNAs are linear double stranded spacers. In other embodiments, the CRISPR/Cas9 machinery comprises Cas9 and two sgRNAs, wherein the sgRNAs target the two homology regions for excision.
Lambda red recombination machinery may be used for recombination. The lambda red recombination machinery may comprise lambda alpha/beta/gamma.
The method may comprise performing one or more rounds of REXER, i.e. the steps as described above with a first donor nucleic acid sequence, choosing further donor sequence(s) contiguous with said first donor nucleic acid sequence, and repeating said steps with said further donor nucleic acid sequence(s) until the partially synthetic genome has been assembled. This is known as genome stepwise interchange synthesis (GENESIS), described in Wang, K. et al., 2016. Nature 539, 59-64 and is shown schematically in
In preferred embodiments the donor sequence(s) correspond to regions of the synthetic genome according to the present invention and/or to polynucleotides according to the present invention.
Thus, the donor sequence(s) (i.e. synthetic region) may comprise 20 or fewer occurrences of one or more sense codons; and/or the donor sequence(s) may comprise 10 or more, 20 or more, or 100 or more genes with no occurrences of one or more sense codons.
The donor sequence(s) (i.e. synthetic region) may be identical to sequences (i.e. non-synthetic regions) of the parent genome except that they have 50 or fewer, 20 or fewer, 10 or fewer, 5 or fewer, or 0 occurrences of each of one or more sense codons; and/or comprise less than 10%, 5%, 2%, 1%, 0.5%, 0.1% of the occurrences of each of one or more sense codons, relative to the corresponding region in the parent genome; and/or comprise 10 or more, 20 or more, or 100 or more genes with no occurrences of one or more sense codons.
The donor sequence(s) (i.e. synthetic region) may also be refactored relative to the sequences (i.e. non-synthetic regions) of the parent genome. For 3′,3′ overlaps (i.e. pairs of genes in opposite orientations) a synthetic insert may be inserted between the genes. For 3′,3′ overlaps the synthetic insert may comprise the overlapping region. For 5′, 3′ overlaps (i.e. pairs of genes in the same orientation) a synthetic insert may be inserted between the genes. For 5′, 3′ overlaps the synthetic insert may comprise: (i) a stop codon; (ii) about 20-200 bp, or 20-100 bp, or 20-50 bp, from upstream of the overlapping region; and (iii) the overlapping region. Preferably, the synthetic insert comprises: (i) a stop codon; (ii) about 20 bp from upstream of the overlapping region; and (iii) the overlapping region. In preferred embodiments the stop codon is in frame with the original start site for the downstream gene. Preferably the stop codon is TAA.
Preferably the donor sequence(s) (i.e. synthetic region) are collectively 50-10000 kb, 100-5000 kb, 100-2000 kb, 100-1000 kb, or 100-500 kb in size. Preferably each donor sequence is 50-300 kb, 100-200 kb, or about 100 kb in size.
Accordingly, the donor sequences may each be about 100 kb in size and identical to corresponding sequences of the parent genome, except they comprise no occurrences of one or more sense codons and all pairs of genes which share an overlapping region comprising the one or more sense codons in the parent genome are refactored, wherein the pairs of genes are those in which sense codon replacements would change the encoded protein sequence of both or either of the pair of genes.
In preferred embodiments the viability of the genome is tested after each round of recombination-mediated genetic engineering. In some embodiments the sequence of the genome is verified after each round of recombination-mediated genetic engineering.
Partially Synthetic Genomes
The present invention provides two or more different partially synthetic genomes.
As used herein a “partially synthetic genome” is a genome in which one or more contiguous regions of the parent genome have been edited (i.e. the partially synthetic genomes comprise one or more synthetic regions), wherein one or more contiguous (synthetic) regions do not cover the whole of the parent genome. Preferably, the partially synthetic genomes of the present invention have one contiguous (synthetic) region. In contrast, a “synthetic genome” may comprise genome edits which cover substantially all of the parent genome.
The partially synthetic genomes of the present invention may be prokaryotic genomes. Preferably, the partially synthetic genomes of the present invention are bacterial genomes. More preferably, the partially synthetic genomes of the present invention are Escherichia coli, Salmonella enterica, or Shigella dysenteriae genomes. Most preferably, the partially synthetic genomes of the present invention are E. coli genomes. In some embodiments the partially synthetic genomes are reduced or minimal partially synthetics genomes. In preferred embodiments, the partially synthetic genomes are viable genomes.
In some embodiments the partially synthetic genomes of the present invention are 100 kb to 20 Mb, or 130 kb to 15 Mb, or 200 kb to 15 Mb, or 300 kb to 15 Mb, or 500 kb to 15 Mb, or 1 Mb to 15 Mb, or 1 Mb to 10 Mb, or 1 Mb to 8 Mb, or 1 Mb to 6 Mb, or 2 Mb to 6 Mb, or 2 Mb to 5 Mb, or 3 Mb to 5 Mb, or about 4 Mb in size.
The partially synthetic genomes may comprise a synthetic region that has 50 or fewer, 20 or fewer, 10 or fewer, 5 or fewer, or 0 occurrences of each of one or more sense codons; or the partially synthetic genomes may comprise a synthetic region that has less than 10%, 5%, 2%, 1%, 0.5%, 0.1% of the occurrences of each of one or more sense codons, relative to the corresponding region in the parent genome.
Preferably, the synthetic regions are 50-10000 kb, 100-5000 kb, or 100-500 kb in size.
Thus, the partially synthetic genomes may comprise one or more contiguous regions of 100-5000 kb that have 10 or fewer, 5 or fewer, or no occurrences of each of one or more sense codons; and/or the partially synthetic genomes may comprise one or more contiguous regions of 100-5000 kb that have less than 10%, 5%, 2%, 1%, 0.5%, 0.1% of the occurrences of each of one or more sense codons, relative to the corresponding region in the parent genome; and/or the partially synthetic genomes may comprise one or more contiguous regions of 100-5000 kb that have 10 or more, 20 or more, or 100 or more genes with no occurrences of one or more sense codons
The remainder of the partially synthetic genome (i.e. the non-synthetic region(s)) may have un-altered sense codons. Thus, the partially synthetic genomes may comprise one or more non-synthetic region(s) that have 100% or 99% of the occurrences of each sense codons, relative to the corresponding region in the parent genome; and/or the partially synthetic genomes may comprise one or more non-synthetic region(s) that have 100 or more genes with occurrences of each sense codon. The non-synthetic regions may be 500 kb to 20 Mb, or 500 kb to 10 Mb, or 500 kb to 5 Mb, or about 3.5 Mb in size.
For example, the partially synthetic genomes may comprise one contiguous region (i.e. a synthetic region) of 100-5000 kb that has 10 or more, 20 or more, or 100 or more genes with no occurrences of one or more sense codons and one contiguous region of 500 kb-10000 kb (i.e. a non-synthetic region) that has 100 or more genes with occurrences of each sense codon.
The two or more different partially synthetic genomes may be derived from the same parent genome, i.e. comprise substantially the same sequences, e.g. the two or more different partially synthetic genomes may share 90%, 95%, 99%, or 99.5% sequence identity.
The two or more different partially synthetic genomes may comprise one or more synthetic regions, such that the synthetic regions collectively cover 90% or greater, 95% or greater, 99% or greater or 100% of the parent genome. Preferably, the two or more different partially synthetic genomes each comprise one or more synthetic regions, wherein the synthetic regions do not substantially overlap, (e.g. the overlap between synthetic regions is 10 kb or less, preferably about 3-4 kb). Thus, the two or more different partially synthetic genomes may each comprise one unique or substantially unique synthetic region.
Thus, in preferred embodiments the two or more different partially synthetic genomes each comprise one contiguous synthetic region of 100-5000 kb that has 10 or more, 20 or more, or 100 or more genes with no occurrences of one or more sense codons and one non-synthetic contiguous region of 500 kb-10000 kb that has 100 or more genes with occurrences of each sense codon; wherein the synthetic regions collectively cover substantially all of the parent genome and wherein the synthetic regions do not substantially overlap.
The two or more different partially synthetic genomes may be suitable for directed conjugation. Thus, in preferred embodiments the two or more different partially synthetic genomes comprise at least one partially synthetic donor genome and at least one partially synthetic recipient genome. The method of the invention may comprise a further step of one or more rounds of recombination-mediated genetic engineering, preferably lambda red mediated genetic engineering (prior to directed conjugation) to provide at least one partially synthetic donor genome and at least one partially synthetic recipient genome. The method may further comprise one or more rounds of selection for the at least one partially synthetic donor genome and at least one partially synthetic recipient genome.
The at least one partially synthetic donor genome may comprise a synthetic region and a first selectable marker flanked by two homology regions immediately downstream of an origin of transfer; and the at least one partially synthetic recipient genome may comprise a second selectable marker flanked by two corresponding homology regions, optionally wherein the first selectable marker comprises a positive selectable marker, and/or the second selectable marker comprises a negative selectable marker.
Suitably the negative selectable marker is selected from the group consisting of sacB (sucrose sensitivity), rpsL (S12 ribosomal protein—streptomycin sensitivity), or pheST251A_A294G (4-chlorophenylalanine sensitivity). Suitably the positive selectable marker is selected from the group consisting of CmR (chloramphenicol resistance), KanR (kanamycin resistance), HygR (hygromycin resistance), GentamycinR (gentamycin resistance), or tetracyclineR (tetracycline resistance). The selectable markers may be different to those in the one or more steps of recombination-mediated genetic engineering.
Preferably the synthetic region present in the at least one partially synthetic recipient genomes is outside the region flanked by the homology regions, i.e. the synthetic regions do not substantially overlap. Preferably the homology regions are 3 kb to 500 kb in length, most preferably about 3-5 kb.
Directed Conjugation
One or more rounds of directed conjugation may be carried out on the two or more different partially synthetic genomes of the present invention to produce a synthetic genome.
Each round of directed conjugation may be used to provide partially synthetic genomes with larger contiguous synthetic regions. For example, after the one or more rounds of recombination-mediated genetic engineering there may be 8 partially synthetic genomes, each with a contiguous synthetic region of about 500 kb. After a first round of directed conjugation, two of the partially synthetic genomes may be combined to provide 6 partially synthetic genomes, each with a contiguous synthetic region of about 500 kb and 1 partially synthetic genome with contiguous synthetic region of about 1 Mb. A second round may provide either 5 partially synthetic genomes, each with a contiguous synthetic region of about 500 kb and 1 partially synthetic genome with contiguous synthetic region of about 1.5 Mb; or 4 partially synthetic genomes, each with a contiguous synthetic region of about 500 kb and 2 partially synthetic genome each with a contiguous synthetic region of about 1 Mb. After several rounds of directed conjugation a completely synthetic genome (i.e. one with a contiguous synthetic region of about 4 Mb) may be provided. An example is shown schematically in
Any suitable method of directed conjugation may be used. Methods of directed conjugation will be well known to those of skill in the art, for instance as described by Ma, N. J., Moonan, D. W. and Isaacs, F. J., 2014. Nature Protocols, 9(10), p. 2285. The route to the synthetic genome is not limited.
Thus, the one or more rounds of directed conjugation may comprise:
-
- i) providing a first host cell comprising a partially synthetic recipient genome, and a second host cell comprising a partially synthetic donor genome and a conjugative plasmid;
- ii) a step of conjugation of the partially synthetic recipient genome and partially synthetic donor genome; and
- iii) selecting for recombinants having incorporated the synthetic region of the donor genome into the partially synthetic recipient genome.
The partially synthetic donor genome may comprise a synthetic region and a first selectable marker flanked by two homology regions immediately downstream of an origin of transfer; and the partially synthetic recipient genomes may comprise a second selectable marker flanked by two corresponding homology regions, optionally wherein the first selectable marker comprises a positive selectable marker, and/or the second selectable marker comprises a negative selectable marker. Thus, step (iii) may comprise selection for said selectable markers, i.e. selection for gain of the first selectable marker and loss of the second selectable marker.
Suitably the negative selectable marker is selected from the group consisting of sacB (sucrose sensitivity), rpsL (S12 ribosomal protein—streptomycin sensitivity), or pheST251A_A294G (4 chlorophenylalanine sensitivity). Suitably the positive selectable marker is selected from the group consisting of CmR (chloramphenicol resistance), KanR (kanamycin resistance), HygR (hygromycin resistance), GentamycinR (gentamycin resistance), or tetracyclineR (tetracycline resistance). The selectable markers may be different to those in the one or more steps of recombination-mediated genetic engineering.
Preferably the homology regions are 3 kb to 500 kb in length, most preferably about 3-5 kb. Preferably, the homology regions are 50 kb to 500 kb when the step of directed conjugation is the final step of directed conjugation.
Step (ii) may comprise incubating the first host cell and the second host cell. For example, first host cell and the second host cell may be mixed, transferred onto a suitable medium (e.g. agar plates) and incubated at about 37° C. for about 1-3 hours.
The conjugative plasmid may be an F plasmid, preferably wherein the conjugative plasmid does not comprise an origin of transfer. (e.g.
In preferred embodiments the viability of the genome is tested after each round of directed conjugation. Advantageously, this verifies that the genome edits (e.g. sense codon replacements) result in a viable genome, and allows for non-permitted edits to be corrected.
In some embodiments the sequence of the genome is verified after each round of directed conjugation.
The skilled person will understand that they can combine all features of the invention disclosed herein without departing from the scope of the invention as disclosed.
Preferred features and embodiments of the invention will now be described by way of non-limiting examples.
The practice of the present invention will employ, unless otherwise indicated, conventional techniques of chemistry, biochemistry, molecular biology, microbiology and immunology, which are within the capabilities of a person of ordinary skill in the art. Such techniques are explained in the literature. See, for example, Sambrook, J., Fritsch, E. F. and Maniatis, T. (1989) Molecular Cloning: A Laboratory Manual, 2nd Edition, Cold Spring Harbor Laboratory Press; Ausubel, F. M. et al. (1995 and periodic supplements) Current Protocols in Molecular Biology, Ch. 9, 13 and 16, John Wiley & Sons; Roe, B., Crabtree, J. and Kahn, A. (1996) DNA Isolation and Sequencing: Essential Techniques, John Wiley & Sons; Polak, J. M. and McGee, J. O'D. (1990) In Situ Hybridization: Principles and Practice, Oxford University Press; Gait, M. J. (1984) Oligonucleotide Synthesis: A Practical Approach, IRL Press; and Lilley, D. M. and Dahlberg, J. E. (1992) Methods in Enzymology: DNA Structures Part A: Synthesis and Physical Analysis of DNA, Academic Press.
EXAMPLES Example 1—Design of a Genome with Synonymous Codon CompressionWe first designed a version of the E. coli MDS42 genome (Uniprot accession number AP012306.1) in which the serine codons TCG and TCA and the stop codon TAG in open reading frames (ORFs) are systematically replaced by their synonyms AGC, AGT, and TAA, respectively (
E. coli contains numerous overlapping open reading frames (ORFs), and we classify the overlaps as 3′, 3′ (between ORFs in opposite orientations) or 5′, 3′ (between ORFs in the same orientation). Targeted codons are found within both classes of overlap. If the recoding of each ORF within a 3′, 3′ overlap could be achieved without changing the encoded protein sequence of either ORF—i.e.: by introducing synonymous codon(s)—then the overlap structure was maintained and the sequences were directly recoded. However, when this was not possible we duplicated the overlapping region, and individually recoded each ORFs (
For 5′, 3′ overlaps we separated the ORFs by duplicating both the region of overlap between the ORFs and the 20 bp sequence upstream of the overlap. This refactoring allows us to recode each ORF independently (
Using the defined rules for synonymous codon compression and refactoring we designed a genome in which all 18,218 target codons are recoded to their target synonyms (
We performed a retrosynthesis, analogous to that commonly used for designing synthetic routes to small molecules, on the designed genome (
We assembled BACs for REXER (
We initiated genome replacement in seven distinct strains, via REXER. The start point for REXER in each strain corresponds to the beginning of sections A, C, D, E, F, G or H (
In each strain, the positive and negative selection markers that are introduced in the first REXER provide a template for the next round of REXER, enabling genome stepwise interchange synthesis (GENESIS) (
Following each REXER we sequenced the resulting genomes to identify cells that were fully recoded over the targeted region of the genome (Table 4). In parallel, we carried out a large number of single step REXERs (Table 4) to rapidly identify 100 kb regions of the genome that may be challenging to recode, before we arrived at them through GENESIS. For 35 of the 38 steps, including all of sections A, C, D, E, F and G, we were able to completely recode the targeted genomic sequence by GENESIS. We only observed incomplete replacement of the corresponding genomic region by synthetic DNA for fragment 9, in section B, and for fragments 37a and 1, in section H, (Table 4).
Sequencing several clones following REXER allows us to score the frequency with which each target codon is recoded and thereby compile a recoding landscape for the genomic region. From the recoding landscape with fragment 1 we directly identified the fourth codon (Ser4, TCA) in map, an essential gene encoding methionine amino peptidase, as recalcitrant to recoding by our defined scheme (
From the post-REXER recoding landscape for fragment 9 we identified a 26 kb genomic region that was never recoded (
yceQ and yaaY both encode ‘predicted proteins’, multiple insertions in yceQ are viable, and there is no evidence of mRNA production and/or protein synthesis from these predicted genes (Pundir, S., et al., 2017. Methods Mol Biol 1558, 41-55). Notably, the codons that are recalcitrant to recoding within yceQ and yaaY all lie within the 5′ untranslated regions (UTRs) of essential genes. We suggest that the sequence changes introduced by recoding yceQ and yaaY negatively affect the regulation of the adjacent essential genes. Indeed, the target codons in yceQ map to RNA secondary structures and promoter elements within the 5′UTR of me (encoding the essential ribonuclease RNase E) (
We fixed fragment 9 by introducing a stop codon into the 5′ sequence of yceQ; this minimizes any potential translation but retains the native sequence for regulating me transcription (
Having pinpointed and fixed all the initially problematic sequences we completed the assembly of a strain in which sections A and B are fully recoded (
We developed a conjugation-based strategy (Isaacs, F. J. et al., 2011. Science 333, 348-353; Ma, N. J., et al., 2014. Nat Protoc 9, 2285-2300; and Lederberg, J. & Tatum, E. L., 1946. Nature 158, 558) to assemble the recoded sections into a single genome. Our strategy assembles the recoded genome in a clockwise manner by conjugating recoded ‘donor’ sections, containing the origin of transfer (oriT), into their adjacent recoded ‘recipient’ sections, that have been extended to provide homology to the donor (
We initiated conjugation by mixing donor and recipient cells, and varied the time and conditions of conjugation to control the extent of genome transfer from the donor to the recipient. Following conjugation between the donor and the recipient cells, we selected for recipient cells, and then for those recipients that had gained the positive marker at the end of the recoded sequence from the donor, and lost the negative marker at the end of the extension in the recipient (
We performed a convergent synthesis of a genome recoded through sections A-E (
To create a completely recoded genome we first created a recipient strain by introducing 37a and 37b into A-G to create A-G-37ab (providing a 115 kb homology region with the final donor). We created the final donor strain by conjugation between strain H and strain AB, which yielded strain H-A-09, in which H, A and fragment 9 from section B are recoded (
Syn61 doubled only 1.6 times slower than MDS42 in LB plus glucose at 37° C., and this ratio increased at 25° C., and decreased at 42° C. (
We have created E. coli in which we have replaced the entire 4 Mb genome with synthetic DNA; the scale of genomic replacement in our experiments is approximately 4 times larger than previously reported for genome replacement in mycoplasma or chromosome replacement in a single strain of S. cerevisiae (
We have demonstrated the genome-wide removal of all known, 1.8×104, target codons (two sense codons, TCG and TCA, the amber codon, TAG) in a single strain of E. coli. Our work removes 60 times more codons than experiments removing the amber stop codons by site-directed mutagenesis (
Our synthetic genome contains only 2×10−4 non-programmed mutations per target codon (
Our final synthetic genome was recoded using defined refactoring and recoding schemes; using a recoding rule we previously determined on just 83 (0.43%) of the target codons in the genome (Wang, K. et al. 2016. Nature 539, 59-64). The recoding rule worked at 99.9% of the 1.8×104 target codons in the genome, while the refactoring rules worked at 99% of overlaps.
Corrections to our initial recoding scheme were necessary at just seven of the 1.8×104 target codons in the whole genome. While one of these codons was in an essential gene the other six were within the 5′ UTRs of essential genes. Thus, all but one of the changes to our defined recoding scheme correct for unintended alterations to the 5′ UTRs of essential genes, rather than for direct effects of altered synonyms on translation.
The strategies we have developed for disconnecting a designed genome into sections, fragments, and stretches, and realizing the design through the convergent, seamless and robust integration of REXER, GENESIS and directed conjugation, provides a blueprint for future genome syntheses. In future work we will further characterize the consequences of synonymous codon compression in E. coli Syn61, and test additional recoding schemes in E. coli and other organisms. In addition we will test sense codon reassignment for non-canonical biopolymer synthesis.
Example 7—MethodsRecoded Genome Design
We based our synthetic genome design on the sequence of the E. coli MDS42 genome (accession number AP012306.1, released 7 Oct. 2016), which has 3547 annotated CDS. We manually curated the starting genome annotation to remove three CDS and add another twelve. The three predicted CDS removed were htgA, ybbV, and yzfA; there is no evidence that these sequences encode proteins (Pundir, S., et al., 2017. Methods Mol Biol 1558, 41-55), and these sequences completely or largely overlap with better characterised genes, which would make it difficult to recode them without disrupting their overlapping genes or creating large repetitive regions. Conversely, the pseudogenes ydeU, ygaY, pbl, yghX, yghY, agaW, yhiK, yhjQ, rph, ysdC, glvG, and cybC were promoted to CDS. To enable negative selection with rpsL, we mutated the genomic copy of rpsL to rpsLK43R. Finally, deep sequencing of our in-house MDS42 revealed a 51 bp insertion between mrcB and hemL which had not been reported in AP012306.1. We manually introduced and annotated this insertion in our starting genome sequence.
We produced a custom Python script that i) identifies and recodes all target codons, and ii) identifies and resolves overlapping gene sequences that contain target codons. From our curated MDS42 starting sequence, we used the script to generate a new synthetic genome in which all TCG, TCA and TAG codons were replaced with AGC, AGT and TAA respectively. The script reported 91 CDS with overlaps containing target codons. In 33 instances, genes were overlapping tail-to-tail (3′, 3′) (Table 1); 12 of these could be recoded by introducing a silent mutation in the overlapping gene, while the remaining 21 were duplicated to separate the genes (
Retrosynthesis of Recoded Stretches
We divided the designed genome into 37 fragments of between 91 and 136 kb. We chose the boundary sequences that delimit these fragments so that: i) they consist of a 5′-NGG-3′ PAM to allow REXER4 to be used for integration if necessary, ii) the PAM does not sit within 50 bp of a target codon, iii) the PAM is in-between non-essential genes and iv) the PAM does not disturb any annotated features such as promoters. We called the regions ˜50-100 bp upstream and downstream of these boundaries ‘landing sites’, and these are annotated as Lxx, where xx is the number of the upstream fragment, e.g. L01 is the landing site between fragment 1 and 2. In our design, a landing site sequence is contained in the 3′ end of a fragment and the 5′ end of the next—as a result all 37 fragments contain overlapping homologies of 54-155 bp with their neighbouring fragment.
Each fragment was further broken down to 7-14 stretches of 4-15 kb. We designed the stretches so that they contain overlaps of 80-200 bp with each other, and the overlap regions were defined at intergenic regions free of any recoding targets. A total of 409 stretches were synthesised (GENEWIZ, USA) and supplied in pSC101 or pST vectors flanked by BsaI, AvrII, SpeI, or XbaI restriction sites. The synthetic stretches naturally did not contain at least one of these restriction sites.
Construction of Selection Cassettes and Plasmids for REXER/GENESIS
The cloning procedures described in this section were performed in E. coli DH10b, which is resistant to streptomycin by virtue of an rpsLK43R mutation. The plasmid pKW20_CDFtet_pAraRedCas9_tracrRNA used throughout this study encodes Cas9 and the lambda-red recombination components alpha/beta/gamma under the control of an arabinose-inducible promoter, as well as a tracrRNA under its native promoter, as previously described (Wang, K. et al., 2016. Nature 539, 59-64).
The protospacers for REXER are encoded in the plasmid pKW1_MB1Amp_Spacer (
For each REXER step, a derivative of one of these three plasmids was constructed to harbour a protospacer/direct repeat array containing 2 (REXER2) or 4 (REXER4) protospacers, corresponding to the target sequences for cutting the BAC and genome. The different protospacer arrays were constructed from overlapping oligos through multiple rounds of PCR—the products were inserted by Gibson assembly between restriction sites Accl and EcoRI in the backbone of pKW1_MB1Amp_Spacer, pKW3_MB1Amp_TracrK_Spacer or pKW5_MB1Amp_TracrK_Cas9_Spacer. The protospacer arrays resulting from each assembly were verified to be mutation-free by Sanger sequencing.
The positive-negative selection cassettes used in REXER and GENESIS are −1/+1 (rpsL-KanR), −2/+2 (sacB-CmR) and −3/+3 (pheST251A_A294G-HygR). −1/+1 and −2/+2 are as previously described (Wang, K. et al., 2016. Nature 539, 59-64). In −3/+3, pheST251A_A294G is dominant lethal in the presence of 4-chlorophenylalanine, and HygR confers resistance to hygromycin. Both proteins are expressed polycistronically under control of the EM7 promoter. The −3/+3 cassette was synthesised de novo. The −3/+3 cassette is also referred to as pheS*/HygR.
Construction of E. coli Strains Containing Double Selection Cassettes at Genomic Landing Sites.
According to our design, each region of the genome that is targeted for replacement by a synthetic fragment is flanked by an upstream landing site and a downstream landing site; these genomic landing site sequences are the same as the landing site sequences described above. Initiation of REXER/GENESIS requires the insertion of a double selection cassette in the upstream genomic landing site. We inserted double selection cassettes at the landing sites through lambda-red mediated recombination. Briefly, either the sacB-CmR or the rpsL-KanR cassettes were PCR amplified with primers containing homology regions to the genomic landing sites of interest. For recombination experiments, we prepared electrocompetent cells as described previously (Wang, K. et al., 2016. Nature 539, 59-64) and electroporated 3 μg of the purified PCR product into 100 μL of MDS42rpLK43R cells harbouring the pKW20_CDFtet_pAraRedCas9_tracrRNA plasmid expressing the lambda-red alpha/beta/gamma genes. The recombination machinery was induced, under control of the arabinose promoter (pAra), with L-arabinose added at 0.5% for 1 hour starting at OD600=0.2. Pre-induced cells were electroporated and then recovered for 1 hour at 37° C. in 4 mL of super optimal broth (SOB) medium. Cells were then diluted into 100 mL of LB medium with 10 μg/mL tetracycline and grown for 4 hours at 37° C., 200 rpm. The cells were subsequently spun down, resuspended in 4 mL of H2O, serially diluted, plated and incubated overnight at 37° C. on LB agar plates containing 10 μg/mL tetracycline, 18 μg/mL chloramphenicol (for sacB-CmR) or 50 μg/mL kanamycin (for rpsL-KanR).
BAC Assembly and Delivery
We constructed Bacterial Artificial Chromosomes (BACs) shuttle vectors that contained 97-136 kb of synthetic DNA. On the 5′ side, the synthetic DNA was flanked by a region of homology to the genome (HR1), and a Cas9 cut site. On the 3′ side the synthetic DNA was flanked by a double selection cassette, a region of homology to the genome (HR2), and a second Cas9 cut site. The BAC also contained a negative selection marker, a BAC origin, a URA marker and YAC origin (CEN6 centromere fused to an autonomously replicating sequence (CEN/ARS)) (
BACs were assembled by homologous recombination in S. cerevisiae. Each assembly combined i) 7-14 stretches of synthetic DNA, each 6-13 kb in length, with ii) a selection construct (see below) and iii) a BAC shuttle vector backbone (
Synthetic DNA stretches were excised by digestion with BsaI, AvrII, SpeI, or XbaI restriction sites from their source vectors provided by GENEWIZ. In the case of AvrII, SpeI, and XbaI, restriction digests were followed by Mung Bean nuclease treatment to remove sticky ends.
Selection constructs contained a region of homology to the 3′ most stretch of the fragment, a double selection cassette (sacB-CmR or rpsL-KanR) a region of homology (HR2) to the targeted genomic locus, a negative selection marker (rpsL, sacB or pheS*-HygR) and YAC. For specific double selection cassettes, negative selection markers, and homology region sequences see
The episomal versions were designed so that restriction digestion with BsaI yielded a DNA fragment for BAC assembly.
The BAC backbone containing a BAC origin and a URA3 marker was amplified by PCR using a previously described BAC (Wang, K. et al., 2016. Nature 539, 59-64) as a template, and the PCR product used for BAC assembly. The primers used for these PCR assemblies are listed in
To assemble the stretches, selection construct, and BAC backbone, 30-50 fmol of each piece of DNA was transformed into S. cerevisiae spheroplasts; these were prepared as previously described (Kouprina, N., et al., 2004. Methods Mol Biol 255, 69-89). Following assembly we identified yeast clones potentially harbouring correctly assembled BACs by colony PCR at the junctions of overlapping fragments and vector-insert junctions. Clones that appeared correct by colony PCR were sequence verified by NGS after transformation into E. coli, as described below.
The assembled BACs were extracted from yeast with the Gentra Puregene Yeast/Bact. Kit (Qiagen) following the manufacturer's instructions. MDS42rpsLK43R cells were transformed with the assembled BAC by electroporation. Due to the large size of the BACs we sometimes observed inefficient electroporation into target cells. Consequently, we introduced an oriT-Apramycin cassette provided as a PCR product with 50 bp homology regions by lambda-red-mediated recombination (as described above) into some BACs post assembly (
Synthesis of Recoded Sections by REXER and GENESIS
We used various genomic and plasmid selection markers for sequential REXER experiments (GENESIS) (Table 4). We used an rpsL-KanR(−1/+1) or sacB-CmR (−2/+2) cassette at genomic landing sites for selection. We used rpsL-KanR-sacB (−1/+1,−2), rpsL-KanR-pheS*-HygR (−1/+1,−3/+3) or sacB-CmR-rpsL (−2/+2,−1) cassettes as episomal selection markers.
For each REXER, MDS42rpsLK43R cells containing pKW20_CDFtet_pAraRedCas9_tracrRNA and a double selection cassette at the relevant upstream genomic landing site were transformed with the relevant BAC. We plated cells on LB agar supplemented with 2% glucose, 5 μg/ml tetracycline and antibiotic selecting for the BAC (i.e. 18 μg/ml chloramphenicol or 50 μg/ml kanamycin). We inoculated individual colonies into LB medium with 5 μg/ml tetracycline and the BAC specific antibiotic and grew cells overnight at 37° C., 200 rpm. The overnight culture was diluted in LB medium with 5 μg/ml tetracycline, and the BAC specific antibiotic, to OD600=0.05 and grown at 37° C. with shaking for about 2 h, until OD600≈0.2. To induce lambda-red expression we added arabinose powder to the culture to a final concentration of 0.5% and the incubated the culture for one additional hour at 37° C. with shaking. We harvested the cells at OD600=0.6, and made the cells electro-competent as described previously (Wang, K. et al., 2016. Nature 539, 59-64).
For each REXER experiment a linear dsDNA protospacer array was PCR amplified from pKW1_MB1Amp_Spacers using universal primers (
We spun down the culture and resuspended it in 4 ml Milli-Q filtered water and spread in serial dilutions on selection plates of LB agar with 5 μg/ml tetracycline, an agent selecting against the negative selection marker and an antibiotic selecting for the positive marker originating from the BAC. The plates were incubated at 37° C. overnight. Multiple colonies were picked, resuspended in Milli-Q filtered water, and arrayed on several LB agar plates supplemented with 50 μg/ml kanamycin, 18 μg/ml chloramphenicol, 200 μg/ml streptomycin, 7.5% sucrose or 2.5 mM 4-chloro-phenylalanine. Colony PCR was also performed from resuspended colonies using both a primer pair flanking the genomic locus of the landing site and the position of the newly integrated selection cassette from the BAC. REXER-mediated recombination results in an approximately 500 bp band at the upstream genomic locus with a 2.5 kb (rK-landing site) or 3.5 kb (sC-landing site) band for the control MDS42rk/MDS42sC strain indicating successful removal of the landing site from the genome. Primer pairs flanking the 3′ end of the replaced DNA generate an approximately 2.5 kb (rK selection cassette on pBAC) or 3.5 kb (sC selection cassette on pBAC) band and a 500 bp band for the control MDS42rk/MDS42sC strain indicating successful integration of the selection markers.
If a plasmid based circular protospacer array was used in the previous REXER experiment the plasmid had to be lost before the next experiment. Thus, a successful clone from the first REXER experiment was grown in LB supplemented with 2% glucose, 5 μg/mL tetracycline and antibiotic selecting for the positive marker in the genome to a dense culture at 37° C. with shaking. 2 μL of the culture were then streaked out on an LB agar plate with the same supplements and incubated at 37° C. overnight. Several colonies were arrayed in replica on LB agar plate and LB agar plate supplemented with 100 μg/mL ampicillin to screen for the loss of the plasmid.
BAC Editing
When encountering loss-of-function mutations in a selection cassette on BACs in E. coli, the faulty cassette was replaced with a suitable double selection cassette provided (
Changes in the synthetic, recoded sequence of a BAC, either to correct spontaneous mutations or change recoded codons, were introduced by a two-step replacement approach; For BACs containing the selection cassettes −2/+2 and −1 in the end of the recoded sequence, the −3/+3 cassette was provided as a PCR-product flanked by 50 bp-homology regions targeting the desired locus and integrated by lambda-red-mediated recombination followed by selection for +3. Due to the homology between the recoded DNA and the genome, some of the resulting clones would contain −3/+3 on the BAC and some on the genome. To identify clones with the cassette on the BAC, clones were plated in replica on agar plates selecting (1) for +3, (2) against −3, and (3) for +2 and against −3; Only clones surviving on plate (1) and (2) but not on (3) have the −3/+3 cassette integrated on the BAC. The location of the cassette was verified by purifying the BAC using QIAprep Spin Miniprep Kit followed by genotyping. In a second step, the −3/+3 cassette was replaced by providing a PCR-product of the desired sequence flanked by 50 bp-homology regions and integrated by lambda-red-mediated recombination followed by selection for +2 and against −3. The BAC was genotyped as above and sequence-verified by NGS.
Preparing a Non-Transferable F′ Plasmid and Conjugative Transfer of Episomes
We created the version of the F′ plasmid used for conjugation of genomic DNA, as well as transfer of BACs between strains, to enable transfer of sequences bearing oriT without transfer of the F′ plasmid itself (
Transfer of episomal DNA containing oriT was performed by conjugation (Isaacs, F. J. et al., 2011. Science 333, 348-353; and Ma, N. J., et al. 2014. Nat Protoc 9, 2285-2300). A donor strain was double transformed with pJF146 and an assembled BAC with oriT (see above). A recipient strain was transformed with pKW20. 5 ml of donor and recipient culture were grown to saturation overnight in selective LB media and subsequently washed 3 times with LB media without antibiotics. The resuspended donor and recipient strains were combined in a 4:1 ratio, spotted on TYE agar plates and incubated for 1 h at 37° C. The cells were washed off the plate and spread in serial dilutions on LB agar plates with 2% glucose, 5 μg/ml tetracycline selecting for the recipient strain and antibiotic selecting for the BAC. Successful transfer of the BAC was confirmed by colony PCR of the BAC-vector insert junctions.
Assembling a Synthetic Genome from Recoded Sections
Transfer of genomic DNA was combined with subsequent recBCD-mediated recombination to assemble partially synthetic E. coli genomes into a synthetic genome. In preparation of the donor and recipient strains a rpsL-HygR-oriT or GmR-oriT cassette was supplied as PCR product and integrated into the donor strain genome via lambda-red-mediated recombination (
For conjugation, donor and recipient strain were grown to saturation overnight in LB medium with 2% glucose, 5 μg/ml tetracycline and 50 μg/ml kanamycin or 20 μg/ml chloramphenicol (donor) and 50 μg/ml apramycin and 200 μg/mL hygromycin B (recipient). The overnight cultures were diluted 1:10 in the same selective LB medium and grown to OD600=0.5. 50 ml of both donor and recipient culture were washed 3 times with LB medium with 2% glucose and then each resuspended in 400 μl LB medium with 2% glucose. 320 μl of donor was mixed with 80 μl of recipient, spotted on TYE agar plates and incubated at 37° C. The incubation time depended on the length of transferred synthetic DNA and doubling time of the recipient strain and varied from 1 h to 3 h. Cells were washed off the plate and transferred into 100 ml LB medium with 2% glucose and 5 μg/ml tetracycline and incubated at 37° C. for 2 h with shaking. Subsequently 50 μg/ml kanamycin or 20 μg/ml chloramphenicol (selecting for the transferred positive selection marker of the donor) was added, followed by another 2 h incubation at 37° C. The culture was spun down and resuspended in 4 ml Milli-Q filtered water and spread in serial dilutions on selection plates of LB agar with 2% glucose, 5 μg/ml tetracycline, 2.5 mM 4-chloro-phenylalanine and 50 μg/ml kanamycin or 20 μg/ml chloramphenicol. Successful DNA transfer and recombination was determined by colony PCR for the loss of the pheS*-HygR cassette, integration of the donor's selection cassette and absence of the Gm-oriT cassette.
Preparation of Whole-Genome and BAC Libraries for Next-Generation Sequencing
E. coli genomic DNA was purified using the DNEasy Blood and Tissue Kit (QIAgen) as per manufacturer's instructions. BACs were extracted from cells with the QIAprep Spin Miniprep Kit (QIAgen) as per manufacturer's instructions. We found that this kit was suitable for purification of BACs in excess of 130 kb. We avoided vigorous shaking of the samples throughout purification so as to reduce DNA shearing.
Paired-end Illumina sequencing libraries were prepared using the Illumina Nextera XT Kit as per manufacturer's instructions. Sequencing data was obtained in the Illumina MiSeq, running 2×300 or 2×75 cycles with the MiSeq Reagent kit v3.
Sequencing Data Analysis
The standard workflow for sequence analysis in this work is compiled in the iSeq package. In short, sequencing reads were aligned to a reference recoded or wild-type genome using bowtie2 with soft-clipping activated (Langmead, B. & Salzberg, S. L., 2012. Nat Methods 9, 357-359). Aligned reads were sorted and indexed with samtools (Li, H. et al., 2009. Bioinformatics 25, 2078-2079). A customised Python script combines functionalities of samtools and igvtools to yield a variant calling summary. This script was used to assess mutations, indels and structural variations, in combination with visual analysis in the Integrative Genomics Viewer (Thorvaldsdottir, H., et al., 2013. Brief Bioinform 14, 178-192).
We produced a custom Python script to generate recoding landscapes across a target genomic region. Briefly, the script takes a BAM alignment file, a reference in fasta and a GeneBank annotation file as inputs. It identifies the target codons for recoding, and compiles the reads that align to these target codons in the alignment file. It then outputs the frequency of recoding at each target codon, and plots these frequencies across the length of the genomic region of interest.
Growth Rate Measurement and Analysis
Bacterial clones were grown overnight at 37° C. in LB with 2% glucose and 100 μg/mL streptomycin. Overnight cultures were diluted 1:50 and monitored for growth while varying temperature (25° C., 37° C., or 42° C.) and media conditions (LB, LB with 2% glucose, M9 minimal media, 2×TY). Measurements of OD600 were taken every 5 min for 18 h on a Biomek automated workstation platform with high speed linear shaking.
To determine doubling times, the growth curves were log 2-transformed. At a linear phase of the curve during exponential growth, the first derivative was determined (d(log 2(x))/dt) and ten consecutive time-points with the maximal log 2-derivatives were used to calculate the doubling time for each replicate. A total of 10 independently grown biological replicates were measured for the recoded Syn61 strain and wt MDS42rpsLK43R. The mean doubling time and standard deviation from the mean were calculated for all n=10 replicates.
Microscopy and Cell Size Measurement
Cells were grown with shaking in LB supplemented with 100 μg/mL streptomycin to approximately OD600=0.2. A thin layer of bacteria was sandwiched between an agarose pad and a coverslip. A standard microscope slide was prepared with a 1% agarose pad (Sigma-Aldrich A4018-5G). A sample of 2 μl to 4 μl of bacterial culture was dropped onto the top of the pad. This was covered by a #1 coverslip supported on either side by a glass spacer matched to the −1 mm height of the pad. Samples were imaged on an upright Zeiss Axiophot phase contrast microscope using a 63×1.25NA Plan Neofluar phase objective (Zeiss UK, Cambridge, UK). Images were taken using an IDS ueye monochrome camera under control of ueye cockpit software (IDS Imaging Development Systems GmbH, Obersulm, Germany). 10 fields were taken of each sample. Images were loaded in Nikon NIS Elements software for further quantitation (Nikon Instruments Surrey UK). The General analysis tool was used to apply an intensity threshold to segment the bacteria. A one micron lower size limit was imposed to remove background particulates and dust. Length measurements were subsequently made on the segmented bacteria using the General Analysis quantification tools.
Mass Spectrometry
Three biological replicates were performed for each strain. Proteins from each Escherichia coli lysates were solubilized in a buffer containing 6 M urea in 50 mM ammonium bicarbonate, reduced with 10 mM DTT, and alkylated with 55 mM iodoacetamide. After alkylation, proteins were diluted to 1 M urea with 50 mM ammonium bicarbonate, digested with Lys-C (Promega, UK) at a protein to enzyme ratio of 1:50 for 2 hours at 37° C., followed by digestion with Trypsin (Promega, UK) at a protein to enzyme ratio of 1:100 for 12 hours 37° C. The resulting peptide mixtures were acidified by the addition formic acid to a final concentration of 2% v/v. The digests were analysed in duplicate (1 ug initial protein/injection) by nano-scale capillary LC-MS/MS using a Ultimate U3000 HPLC (ThermoScientific Dionex, San Jose, USA) to deliver a flow of approximately 300 nL/min. A C18 Acclaim PepMap100 5 μm, 100 μm×20 mm nanoViper (ThermoScientific Dionex, San Jose, USA), trapped the peptides prior to separation on a C18 Acclaim PepMap100 3 μm, 75 μm×250 mm nanoViper (ThermoScientific Dionex, San Jose, USA). Peptides were eluted with a 100 minute gradient of acetonitrile (2% to 60%). The analytical column outlet was directly interfaced via a nano-flow electrospray ionisation source, with a hybrid dual pressure linear ion trap mass spectrometer (Orbitrap Velos, ThermoScientific, San Jose, USA). Data dependent analysis was carried out, using a resolution of 30,000 for the full MS spectrum, followed by ten MS/MS spectra in the linear ion trap. MS spectra were collected over a m/z range of 300-2000. MS/MS scans were collected using a threshold energy of 35 for collision induced dissociation. All raw files were processed with MaxQuant 1.5.5.1 using standard settings and searched against an Escherichia coli strain K-12 with the Andromeda search engine integrated into the MaxQuant software suite. Enzyme search specificity was Trypsin/P for both endoproteinases. Up to two missed cleavages for each peptide were allowed. Carbamidomethylation of cysteines was set as fixed modification with oxidized methionine and protein N-acetylation considered as variable modifications. The search was performed with an initial mass tolerance of 6 ppm for the precursor ion and 0.5 Da for CID MS/MS spectra. The false discovery rate was fixed at 1% at the peptide and protein level. Statistical analysis was carried out using the Perseus (1.5.5.3) module of MaxQuant. Prior to statistical analysis, peptides mapped to known contaminants, reverse hits and protein groups only identified by site were removed. Only protein groups identified with at least two peptides, one of which was unique and two quantitation events were considered for data analysis. For proteins quantified at least once in each strain, the average abundance of each protein across replicates of Syn61 was divided by the abundance in MDS42 replicates, and then log 2-transformed. A P-value for the difference in abundance between strains was calculated by two-sample T-test (Perseus).
Toxicity of CYPK incorporation using orthogonal aminoacyl-tRNA synthetases tRNAXXXs (Elliott, T. S. et al., 2014. Nat Biotechnol 32, 465-472; Elliott, T. S., et al., 2016. Cell Chem Biol 23, 805-815; and Krogager, T. P. et al., 2018. Nat Biotechnol 36, 156-159)
Electrocompetent MDS42 and Syn61 cells were transformed with plasmid pKW1_MmPylS_PylTXXX for expression of PylRS and tRNAPylXXX, where XXX is the indicated anticodon. Three variants of this plasmid were used, with the anticodon of tRNAPyl mutated to CGA (pKW1_MmPylS_PylTCGA), UGA (pKW1_MmPylS_PylTUGA) or GCU (pKW1_MmPylS_PylTGCU). Cells were grown over night in LB medium with 75 μg/ml spectinomycin. Overnight cultures were diluted 1:100 into LB supplemented with Nε-(((2-methylcycloprop-2-en-1-yl) methoxy) carbonyl)-L-lysine (CYPK) at 0 mM, 0.5 mM, 1 mM, 2.5 mM and 5 mM and growth was measured as described above. “% Max Growth” was determined as the final OD600 in the presence of the indicated concentration of CYPK divided by the final OD600 in the absence of CYPK. Final OD600s were determined after 600 min.
Deletion of prfA, serU and serT by Homologous Recombination
Recoded versions of the pheS-HygR and rpsL-KanR cassettes, according to the recoding scheme described in
All publications mentioned in the above specification are herein incorporated by reference. Various modifications and variations of the disclosed methods, cells, compositions and uses of the invention will be apparent to the skilled person without departing from the scope and spirit of the invention. Although the invention has been disclosed in connection with specific preferred embodiments, it should be understood that the invention as claimed should not be unduly limited to such specific embodiments. Indeed, various modifications of the disclosed modes for carrying out the invention, which are obvious to the skilled person are intended to be within the scope of the following claims.
Claims
1. A synthetic prokaryotic genome comprising 5 or fewer occurrences of one or more sense codons.
2. The synthetic prokaryotic genome according to claim 1, wherein the synthetic prokaryotic genome comprises 100 or more genes.
3. The synthetic prokaryotic genome according to claim 1, wherein:
- (i) the synthetic prokaryotic genome is a synthetic bacterial genome;
- (ii) the one or more sense codons consist of one sense codon or two sense codons;
- (iii) the synthetic prokaryotic genome comprises no occurrences of two or more sense codons;
- (iv) the one or more sense codons are selected from TCG, TCA, TCT, TCC, AGT, AGC, GCG, GCA, GCT, GCC, CTG, CTA, CTT, CTC, TTG, and TTA; and/or
- (v) the synthetic prokaryotic genome comprises 10 or fewer occurrences, or no occurrences, of the amber stop codon (TAG).
4. The synthetic prokaryotic genome according to claim 1, wherein the synthetic prokaryotic genome is viable.
5. A synthetic prokaryotic genome derived from a parent prokaryotic genome, wherein the synthetic prokaryotic genome comprises less than 10% of the occurrences of one or more sense codons, relative to the parent prokaryotic genome, or wherein the synthetic prokaryotic genome comprises no occurrences of one or more sense codons.
6. The synthetic prokaryotic genome according to claim 5, wherein:
- (i) the synthetic prokaryotic genome is a synthetic bacterial genome;
- (ii) the one or more sense codons are selected from TCG, TCA, TCT, TCC, AGT, AGC, GCG, GCA, GCT, GCC, CTG, CTA, CTT, CTC, TTG, and TTA;
- (iii) 90% or more of the occurrences of the one or more sense codons in the parent prokaryotic genome are replaced with synonymous sense codons;
- (iv) the synthetic prokaryotic genome comprises 10 or fewer occurrences, or no occurrences, of the amber stop codon (TAG);
- (v) 99.9% or more of the occurrences of two or more sense codons in the parent prokaryotic genome are replaced with synonymous sense codons, and wherein all of the occurrences of TAG in the parent prokaryotic genome are replaced with TAA;
- (vi) one or more pairs of genes which share an overlapping region comprising the one or more sense codons in the parent prokaryotic genome are refactored; and/or
- (vii) the synthetic prokaryotic genome is 100 kb to 10 Mb in size.
7. The synthetic prokaryotic genome according to claim 6, wherein for pairs of genes in opposite orientations, a synthetic insert is inserted between the genes, wherein the synthetic insert comprises the overlapping region; and/or
- wherein for pairs of genes in the same orientation, a synthetic insert is inserted between the genes, wherein the synthetic insert comprises: (i) a stop codon; (ii) about 20-200 bp from upstream of the overlapping region; and (iii) the overlapping region.
8. A polynucleotide comprising twenty or more essential genes with no occurrences of one or more sense codons.
9. The polynucleotide according to claim 8, wherein:
- (i) the one or more sense codons consist of one sense codon or two sense codons;
- (ii) the one or more sense codons are selected from TCG, TCA, TCT, TCC, AGT, AGC, GCG, GCA, GCT, GCC, CTG, CTA, CTT, CTC, TTG, and TTA;
- (iii) the occurrences of the one or more sense codons in the genes are replaced with synonymous sense codons; and/or
- (iv) the essential genes comprise essential genes selected from one or more of the list consisting of: ribF, lspA, ispH, dapB, folA, imp, yabO, ftsL, ftsI, murE, murF, mraY, murD, ftsW, murG, murC, ftsQ, ftsA, ftsZ, lpxC, secM, secA, can, folK, hemL, yadR, dapD, map, rpsB, tsf, pyrH, frr, dxr, ispU, cdsA, yaeL, yaeT, lpxD, fabZ, lpxA, lpxB, dnaE, accA, tilS, proS, yafF, hemB, secD, secF, ribD, ribE, thiL, dxs, ispA, dnaX, adk, hemH, lpxH, cysS, folD, entD, mrdB, mrdA, nadD, holA, rlpB, leuS, lnt, ginS, fldA, cydA, infA, cydC, ftsK, lolA, serS, rpsA, msbA, lpxK, kdsB, mukF, mukE, mukB, asnS, fabA, mviN, me, fabD, fabG, acpP, tmk, holB, IC, lolD, lolE, purB, minE, minD, pth, prsA, ispE, lolB, hemA, prfA, prmC, kdsA, topA, ribA, fabI, tyrS, ribC, ydiL, pheT, pheS, rplT, infC, thrS, nadE, gapA, yeaZ, aspS, argS, pgsA, yefM, metG, folE, yejM, gyrA, nrdA, nrdB, folC, accD, fabB, gltX, ligA, zipA, dapE, dapA, der, hisS, ispG, suhB, tadA, acpS, era, rnc, bepB, rpoE, pssA, yfiO, rplS, trmD, rpsP, ffh, grpE, csrA, ispF, ispD, ftsB, eno, pyrG, chpR, lgt, fbaA, pgk, yqgD, metK, yqgF, plsC, ygiT, parE, ribB, cca, ygjD, tdcF, yraL, yhbV, infB, nusA, ftsH, obgE, rpmA, rplU, ispB, murA, yrbB, yrbK, yhbN, rpsI, rplM, degS, mreD, mreC, mreB, accB, accC, yrdC, def, fmt, rplQ, rpoA, rpsD, rpsK, rpsM, secY, rplO, rpmD, rpsE, rplR, rplF, rpsH, rpsN, rplE, rplX, rplN, rpsQ, rpmC, rplP, rpsC, rplV, rpsS, rplB, rplW, rplD, rpbC, rpsJ, fusA, rpsG, rpsL, trpS, yrfF, asd, rpoH, ftsX, ftsE, ftsY, yhhQ, bcsB, glyQ, gpsA, rfaK, kdtA, coaD, rpmB, dfp, dut, gmk, spoT, gyrB, dnaN, dnaA, rpmH, rnpA, yidC, tnaB, glmS, glmU, wzyE, hemD, hemC, yigP, ubiB, ubiD, hemG, yihA, ftsN, murI, murB, birA, secE, nusG, rplJ, rplL, rpoB, rpoC, ubiA, plsB, bexA, dnaB, ssb, alsK, groS, psd, orn, yjeE, rpsR, chpS, ppa, valS, yjgP, yjgQ, and dnaC.
10. A prokaryotic host cell comprising the synthetic prokaryotic genome according to claim 1.
11. The prokaryotic host cell according to claim 10, wherein:
- (i) the prokaryotic host cell is viable; and/or
- (ii) the prokaryotic host cell is a bacterial cell.
12. The prokaryotic host cell according to claim 11, wherein the cell is an Escherichia coli, Salmonella enterica, or Shigella dysenteriae cell.
13. A prokaryotic host cell comprising the polynucleotide according to claim 8.
14. A method for production of polypeptides comprising one or more non-proteinogenic amino acids, the method comprising culturing the prokaryotic host cell according to claim 10 under conditions and for a time sufficient for production of polypeptides comprising one or more non-proteinogenic amino acids.
15. A method for producing a synthetic genome comprising:
- (a) providing a parent genome;
- (b) carrying out one or more rounds of recombination-mediated genetic engineering on the parent genome, to produce two or more different partially synthetic genomes; and
- (c) carrying out one or more rounds of directed conjugation with the two or more different partially synthetic genomes to produce a synthetic genome; wherein the partially synthetic genomes each comprise a synthetic region that has 50 or fewer occurrences, or 0 occurrences, of each of one or more sense codons; or wherein the partially synthetic genomes each comprise a synthetic region that has less than 10% of the occurrences of each of one or more sense codons, relative to the corresponding region in the parent genome.
16. The method according to claim 15, wherein:
- (i) the synthetic regions collectively cover 90% or greater of the parent genome;
- (ii) the synthetic regions are 10-1000 kb in size;
- (iii) the viability of the partially synthetic genomes is tested after each round of recombination-mediated genetic engineering and/or after each round of directed conjugation;
- (iv) the two or more different partially synthetic genomes comprise at least one partially synthetic donor genome and at least one partially synthetic recipient genome; and/or
- (v) the one or more rounds of recombination-mediated genetic engineering comprise one or more rounds of replicon excision for enhanced genome engineering through programmed recombination (REXER).
17. The method according to claim 16, wherein the at least one partially synthetic donor genome comprises a synthetic region and a first selectable marker flanked by two homology regions immediately downstream of an origin of transfer; and wherein the at least one partially synthetic recipient genomes comprise a second selectable marker flanked by two corresponding homology regions.
18. The method according to claim 17, wherein the synthetic region present in the at least one partially synthetic recipient genomes is outside the region flanked by the homology regions.
19. The method according to claim 17, wherein the method further comprises one or more rounds of selection for the selectable markers.
20. A synthetic prokaryotic genome produced by the method according to claim 15.
Type: Application
Filed: May 22, 2023
Publication Date: Dec 7, 2023
Inventors: Julius FREDENS (Swindon), Kaihang WANG (Swindon), Daniel DE LA TORRE (Swindon), Louise F. H. FUNKE (Swindon), Wesley E. ROBERTSON (Swindon), Jason W. CHIN (Swindon)
Application Number: 18/321,475