Patents by Inventor Marc Najork

Marc Najork has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

System for information extraction from form-like documents

Patent number: 12354396

Abstract: The present disclosure is directed to extracting text from form-like documents. In particular, a computing system can obtain an image of a document that contains a plurality of portions of text. The computing system can extract one or more candidate text portions for each field type included in a target schema. The computing system can generate a respective input feature vector for each candidate for the field type. The computing system can generate a respective candidate embedding for the candidate text portion. The computing system can determine a respective score for each candidate text portion for the field type based at least in part on the respective candidate embedding for the candidate text portion. The computing system can assign one or more of the candidate text portions to the field type based on the respective scores.

Type: Grant

Filed: October 19, 2023

Date of Patent: July 8, 2025

Assignee: GOOGLE LLC

Inventors: Sandeep Tata, Bodhisattwa Prasad Majumder, Qi Zhao, James Bradley Wendt, Marc Najork, Navneet Potti
Systems and Methods for Machine-Learned Prediction of Semantic Similarity Between Documents

Publication number: 20250209277

Abstract: Systems and methods of the present disclosure are directed to a method for predicting semantic similarity between documents. The method can include obtaining a first document and a second document. The method can include parsing the first document into a plurality of first textual blocks and the second document into a plurality of second textual blocks. The method can include processing each of the plurality of first textual blocks and the second textual blocks with a machine-learned semantic document encoding model to obtain a first document encoding and a second document encoding. The method can include determining a similarity metric descriptive of a semantic similarity between the first document and the second document based on the first document encoding and the second document encoding.

Type: Application

Filed: December 20, 2024

Publication date: June 26, 2025

Inventors: Liu Yang, Marc Najork, Michael Bendersky, Mingyang Zhang, Cheng Li
Systems and methods for machine-learned prediction of semantic similarity between documents

Patent number: 12210837

Abstract: Systems and methods of the present disclosure are directed to a method for predicting semantic similarity between documents. The method can include obtaining a first document and a second document. The method can include parsing the first document into a plurality of first textual blocks and the second document into a plurality of second textual blocks. The method can include processing each of the plurality of first textual blocks and the second textual blocks with a machine-learned semantic document encoding model to obtain a first document encoding and a second document encoding. The method can include determining a similarity metric descriptive of a semantic similarity between the first document and the second document based on the first document encoding and the second document encoding.

Type: Grant

Filed: May 22, 2023

Date of Patent: January 28, 2025

Assignee: GOOGLE LLC

Inventors: Liu Yang, Marc Najork, Michael Bendersky, Mingyang Zhang, Cheng Li
System for Information Extraction from Form-Like Documents

Publication number: 20240046684

Abstract: The present disclosure is directed to extracting text from form-like documents. In particular, a computing system can obtain an image of a document that contains a plurality of portions of text. The computing system can extract one or more candidate text portions for each field type included in a target schema. The computing system can generate a respective input feature vector for each candidate for the field type. The computing system can generate a respective candidate embedding for the candidate text portion. The computing system can determine a respective score for each candidate text portion for the field type based at least in part on the respective candidate embedding for the candidate text portion. The computing system can assign one or more of the candidate text portions to the field type based on the respective scores.

Type: Application

Filed: October 19, 2023

Publication date: February 8, 2024

Inventors: Sandeep Tata, Bodhisattwa Prasad Majumder, Qi Zhao, James Bradley Wendt, Marc Najork, Navneet Potti
System for information extraction from form-like documents

Patent number: 11830269

Abstract: The present disclosure is directed to extracting text from form-like documents. In particular, a computing system can obtain an image of a document that contains a plurality of portions of text. The computing system can extract one or more candidate text portions for each field type included in a target schema. The computing system can generate a respective input feature vector for each candidate for the field type. The computing system can generate a respective candidate embedding for the candidate text portion. The computing system can determine a respective score for each candidate text portion for the field type based at least in part on the respective candidate embedding for the candidate text portion. The computing system can assign one or more of the candidate text portions to the field type based on the respective scores.

Type: Grant

Filed: July 18, 2022

Date of Patent: November 28, 2023

Assignee: GOOGLE LLC

Inventors: Sandeep Tata, Bodhisattwa Prasad Majumder, Qi Zhao, James Bradley Wendt, Marc Najork, Navneet Potti
Systems and Methods for Machine-Learned Prediction of Semantic Similarity Between Documents

Publication number: 20230297783

Abstract: Systems and methods of the present disclosure are directed to a method for predicting semantic similarity between documents. The method can include obtaining a first document and a second document. The method can include parsing the first document into a plurality of first textual blocks and the second document into a plurality of second textual blocks. The method can include processing each of the plurality of first textual blocks and the second textual blocks with a machine-learned semantic document encoding model to obtain a first document encoding and a second document encoding. The method can include determining a similarity metric descriptive of a semantic similarity between the first document and the second document based on the first document encoding and the second document encoding.

Type: Application

Filed: May 22, 2023

Publication date: September 21, 2023

Inventors: Liu Yang, Marc Najork, Michael Bendersky, Mingyang Zhang, Cheng Li
SYSTEMS AND METHODS FOR USING DOCUMENT ACTIVITY LOGS TO TRAIN MACHINE-LEARNED MODELS FOR DETERMINING DOCUMENT RELEVANCE

Publication number: 20230267277

Abstract: Systems and methods of the present disclosure are directed to a method for training a machine-learned semantic matching model. The method can include obtaining a first and second document and a first and second activity log. The method can include determining, based on the first document activity log and the second document activity log, a relation label indicative of whether the documents are related. The method can include inputting the documents into the model to receive a semantic similarity value representing an estimated semantic similarity between the first document and the second document. The method can include evaluating a loss function that evaluates a difference between the relation label and the semantic similarity value. The method can include modifying values of parameters of the model based on the loss function.

Type: Application

Filed: June 15, 2020

Publication date: August 24, 2023

Inventors: Weize Kong, Michael Bendersky, Marc Najork, Rama Kumar Pasumarthi, Zhen Qin, Rolf Jagerman
Systems and methods for machine-learned prediction of semantic similarity between documents

Patent number: 11694034

Abstract: Systems and methods of the present disclosure are directed to a method for predicting semantic similarity between documents. The method can include obtaining a first document and a second document. The method can include parsing the first document into a plurality of first textual blocks and the second document into a plurality of second textual blocks. The method can include processing each of the plurality of first textual blocks and the second textual blocks with a machine-learned semantic document encoding model to obtain a first document encoding and a second document encoding. The method can include determining a similarity metric descriptive of a semantic similarity between the first document and the second document based on the first document encoding and the second document encoding.

Type: Grant

Filed: October 23, 2020

Date of Patent: July 4, 2023

Assignee: GOOGLE LLC

Inventors: Liu Yang, Marc Najork, Michael Bendersky, Mingyang Zhang, Cheng Li
Systems and methods for active learning

Patent number: 11526752

Abstract: Provided are computing systems and methods directed to active learning and may provide advantages or improvements to active learning applications for skewed data sets. A challenge in training and developing high-quality models for many supervised learning scenarios is obtaining labeled training examples. Provided are systems and methods for active learning on a training dataset that includes both labeled and unlabeled datapoints. In particular, the systems and methods described herein can select (e.g., at each of a number of iterations) a number of the unlabeled datapoints for which labels should be obtained to gain additional labeled datapoints on which to train a machine-learned model (e.g., machine-learned classifier model). Generally, provided are cost-effective methods and systems for selecting data to improve machine-learned models in applications such as the identification of content items in text, images, and/or audio.

Type: Grant

Filed: January 23, 2020

Date of Patent: December 13, 2022

Assignee: GOOGLE LLC

Inventors: Qi Zhao, Abbas Kazerouni, Sandeep Tata, Jing Xie, Marc Najork
System for Information Extraction from Form-Like Documents

Publication number: 20220375245

Abstract: The present disclosure is directed to extracting text from form-like documents. In particular, a computing system can obtain an image of a document that contains a plurality of portions of text. The computing system can extract one or more candidate text portions for each field type included in a target schema. The computing system can generate a respective input feature vector for each candidate for the field type. The computing system can generate a respective candidate embedding for the candidate text portion. The computing system can determine a respective score for each candidate text portion for the field type based at least in part on the respective candidate embedding for the candidate text portion. The computing system can assign one or more of the candidate text portions to the field type based on the respective scores.

Type: Application

Filed: July 18, 2022

Publication date: November 24, 2022

Inventors: Sandeep Tata, Bodhisattwa Prasad Majumder, Qi Zhao, James Bradley Wendt, Marc Najork, Navneet Potti
System for information extraction from form-like documents

Patent number: 11393233

Abstract: The present disclosure is directed to extracting text from form-like documents. In particular, a computing system can obtain an image of a document that contains a plurality of portions of text. The computing system can extract one or more candidate text portions for each field type included in a target schema. The computing system can generate a respective input feature vector for each candidate for the field type. The computing system can generate a respective candidate embedding for the candidate text portion. The computing system can determine a respective score for each candidate text portion for the field type based at least in part on the respective candidate embedding for the candidate text portion. The computing system can assign one or more of the candidate text portions to the field type based on the respective scores.

Type: Grant

Filed: June 2, 2020

Date of Patent: July 19, 2022

Assignee: GOOGLE LLC

Inventors: Sandeep Tata, Bodhisattwa Prasad Majumder, Qi Zhao, James Bradley Wendt, Marc Najork, Navneet Potti
Systems and Methods for Machine-Learned Prediction of Semantic Similarity Between Documents

Publication number: 20220129638

Abstract: Systems and methods of the present disclosure are directed to a method for predicting semantic similarity between documents. The method can include obtaining a first document and a second document. The method can include parsing the first document into a plurality of first textual blocks and the second document into a plurality of second textual blocks. The method can include processing each of the plurality of first textual blocks and the second textual blocks with a machine-learned semantic document encoding model to obtain a first document encoding and a second document encoding. The method can include determining a similarity metric descriptive of a semantic similarity between the first document and the second document based on the first document encoding and the second document encoding.

Type: Application

Filed: October 23, 2020

Publication date: April 28, 2022

Inventors: Liu Yang, Marc Najork, Michael Bendersky, Mingyang Zhang, Cheng Li
System for Information Extraction from Form-Like Documents

Publication number: 20210374395

Abstract: The present disclosure is directed to extracting text from form-like documents. In particular, a computing system can obtain an image of a document that contains a plurality of portions of text. The computing system can extract one or more candidate text portions for each field type included in a target schema. The computing system can generate a respective input feature vector for each candidate for the field type. The computing system can generate a respective candidate embedding for the candidate text portion. The computing system can determine a respective score for each candidate text portion for the field type based at least in part on the respective candidate embedding for the candidate text portion. The computing system can assign one or more of the candidate text portions to the field type based on the respective scores.

Type: Application

Filed: June 2, 2020

Publication date: December 2, 2021

Inventors: Sandeep Tata, Bodhisattwa Prasad Majumder, Qi Zhao, James Bradley Wendt, Marc Najork, Navneet Potti
Systems and Methods for Active Learning

Publication number: 20200250527

Abstract: The present disclosure provides computing systems and methods directed to active learning and may provide advantages or improvements to active learning applications for skewed data sets. A challenge in training and developing high-quality models for many supervised learning scenarios is obtaining labeled training examples. This disclosure provides systems and methods for active learning on a training dataset that includes both labeled and unlabeled datapoints. In particular, the systems and methods described herein can select (e.g., at each of a number of iterations) a number of the unlabeled datapoints for which labels should be obtained to gain additional labeled datapoints on which to train a machine-learned model (e.g., machine-learned classifier model). Generally, the disclosure provides cost-effective methods and systems for selecting data to improve machine-learned models in applications such as the identification of content items in text, images, and/or audio.

Type: Application

Filed: January 23, 2020

Publication date: August 6, 2020

Inventors: Qi Zhao, Abbas Kazerouni, Sandeep Tata, Jing Xie, Marc Najork
Social network recommended content and recommending members for personalized search results

Patent number: 8949232

Abstract: Architecture that provides a data structure to facilitate personalized ranking over recommended content (e.g., documents). The data structure approximates the social distance of the searching user to the content at query time. A graph is created of content recommended by members of the social network, where the nodes of the graph include content nodes (for the content) and recommending member nodes (for members of the social network who recommended the content). If a member recommends content, an edge is created between the member node and the content node. If a member is a “friend” (tagged as related in some way) of another member, an edge is created between the two member nodes. Each node is converted to a lower dimensional feature set. Feature sets of the content are indexed and the feature set of the searching user is utilized to match and rank the search results at query time.

Type: Grant

Filed: October 4, 2011

Date of Patent: February 3, 2015

Assignee: Microsoft Corporation

Inventors: Timothy Harrington, Rajesh Shenoy, Marc Najork, Rina Panigrahy
Automatically Creating Training Data For Language Identifiers

Publication number: 20150006148

Abstract: Example apparatus and methods concern automatically creating labeled training data for automatic language identifiers. One embodiment includes logic to produce a predicted language classification for a post from geographic data associated with the post. The post may be associated with a micro-blog, a social media site, or other electronic communication service that traffics in short messages having frequent colloquialisms, non-standard spelling, emoticons, and unique usages of characters to convey meaning. The embodiment includes logic to produce an actual language classification for the post using a base language classifier. The embodiment includes logic to selectively add the post and a language label for the post to an automatically generated labeled training data upon determining that the predicted language classification matches the actual language classification.

Type: Application

Filed: July 17, 2013

Publication date: January 1, 2015

Inventors: Moises Goldszmit, Marc Najork, Stelios Paparizos
SOCIAL NETWORK RECOMMENDED CONTENT AND RECOMMENDING MEMBERS FOR PERSONALIZED SEARCH RESULTS

Publication number: 20130086057

Abstract: Architecture that provides a data structure to facilitate personalized ranking over recommended content (e.g., documents). The data structure approximates the social distance of the searching user to the content at query time. A graph is created of content recommended by members of the social network, where the nodes of the graph include content nodes (for the content) and recommending member nodes (for members of the social network who recommended the content). If a member recommends content, an edge is created between the member node and the content node. If a member is a “friend” (tagged as related in some way) of another member, an edge is created between the two member nodes. Each node is converted to a lower dimensional feature set. Feature sets of the content are indexed and the feature set of the searching user is utilized to match and rank the search results at query time.

Type: Application

Filed: October 4, 2011

Publication date: April 4, 2013

Applicant: Microsoft Corporation

Inventors: Timothy Harrington, Rajesh Shenoy, Marc Najork, Rina Panigrahy
Incremental update scheme for hyperlink database

Publication number: 20070250480

Abstract: A database of hyperlinks, stored in a hyperlink store or distributed across multiple machines such as a scalable hyperlink store, may be incrementally updated. When data is added, instead of modifying an existing data store, a hierarchy of data stores is built. The data stores are merged together, such that a new store is a suffix on an old store. Additions and updates go into new stores, which are relatively small. Lookups consult new stores first. A background thread merges adjacent stores. For example, a batch of updates is collected and incorporated into a new store and then the store is sealed. Subsequent updates are added to yet another new store. Stores are merged occasionally to prevent the chain of stores from becoming too long. Once the batch has been integrated, the new stores are sealed and are used to answer subsequent queries.

Type: Application

Filed: April 19, 2006

Publication date: October 25, 2007

Applicant: Microsoft Corporation

Inventor: Marc Najork
Fault tolerance scheme for distributed hyperlink database

Publication number: 20070220064

Abstract: Fault tolerance is provided for a database of hyperlinks distributed across multiple machines, such as a scalable hyperlink store. The fault tolerance enables the distributed database to continue operating, with brief interruptions, even when some of the machines in the cluster have failed. A primary database is provided for normal operation, and a secondary database is provided for operation in the presence of failures.

Type: Application

Filed: March 17, 2006

Publication date: September 20, 2007

Applicant: Microsoft Corporation

Inventor: Marc Najork
Domain-based spam-resistant ranking

Publication number: 20070067282

Abstract: A domain-based spam-resistant ranking architecture that computes trust in a domain based on web-servers on which a domain is hosted and a set of other domains that link to the domain. The ranks of pages are computed based on how much trust there is in each domain and which pages link to it. Web documents are ranked in a spam-resistant manner by assigning uniform significance to each IP address of a network location and then assigning trust values to domains hosted on those IP addresses. Then, based on a domain graph, the invention constructs a domain-rank which is an estimate of how authoritative the domain is. The domain ranks are then used to assign a minimum rank to each document.

Type: Application

Filed: September 20, 2005

Publication date: March 22, 2007

Applicant: Microsoft Corporation

Inventors: Amit Prakash, Michael Narayan, Darren Shakib, Marc Najork

1 2 next