Abstract: Methods and apparatus using a mixture of representation modalities including natural language, protein sequence, protein structure, property-vector, and small molecule drug representations to jointly train a neural network which accepts mixed modality queries as input and produces mixed modality output responses including representations of proteins for synthesis and of small molecule drugs for manufacture. In one embodiment of the invention, multicapitate transformers wherein each decoder head has a distinct loss function and represents a distinct modality, are used. Modality-specific embeddings are implemented for the mixed modality input query, and an autoregressive process yields the output protein for synthesis or small molecule drug for manufacture.
Abstract: Methods and apparatus for obtaining representations of proteins and small molecule drugs for synthesis; wherein input queries into trained mixed modality protein and natural language models are augmented with relevant query-related documents. In one embodiment, the relevant query-related documents are obtained by maximum inner product search of an embedding latent vector space into which the query and the documents are projected. The top-k most relevant documents to the query are then combined with the query as input into the trained mixed modality language model. In one embodiment, the mixed modality model is an autoregressive multicapitate transformer whose decoder output heads correspond to the represented modalities. The method returns mixed modality output representations of proteins or small molecule drugs for synthesis or manufacture.