Patents by Inventor Marc A. Najork
Marc A. Najork has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).
-
Publication number: 20240046684Abstract: The present disclosure is directed to extracting text from form-like documents. In particular, a computing system can obtain an image of a document that contains a plurality of portions of text. The computing system can extract one or more candidate text portions for each field type included in a target schema. The computing system can generate a respective input feature vector for each candidate for the field type. The computing system can generate a respective candidate embedding for the candidate text portion. The computing system can determine a respective score for each candidate text portion for the field type based at least in part on the respective candidate embedding for the candidate text portion. The computing system can assign one or more of the candidate text portions to the field type based on the respective scores.Type: ApplicationFiled: October 19, 2023Publication date: February 8, 2024Inventors: Sandeep Tata, Bodhisattwa Prasad Majumder, Qi Zhao, James Bradley Wendt, Marc Najork, Navneet Potti
-
Patent number: 11830269Abstract: The present disclosure is directed to extracting text from form-like documents. In particular, a computing system can obtain an image of a document that contains a plurality of portions of text. The computing system can extract one or more candidate text portions for each field type included in a target schema. The computing system can generate a respective input feature vector for each candidate for the field type. The computing system can generate a respective candidate embedding for the candidate text portion. The computing system can determine a respective score for each candidate text portion for the field type based at least in part on the respective candidate embedding for the candidate text portion. The computing system can assign one or more of the candidate text portions to the field type based on the respective scores.Type: GrantFiled: July 18, 2022Date of Patent: November 28, 2023Assignee: GOOGLE LLCInventors: Sandeep Tata, Bodhisattwa Prasad Majumder, Qi Zhao, James Bradley Wendt, Marc Najork, Navneet Potti
-
Publication number: 20230297783Abstract: Systems and methods of the present disclosure are directed to a method for predicting semantic similarity between documents. The method can include obtaining a first document and a second document. The method can include parsing the first document into a plurality of first textual blocks and the second document into a plurality of second textual blocks. The method can include processing each of the plurality of first textual blocks and the second textual blocks with a machine-learned semantic document encoding model to obtain a first document encoding and a second document encoding. The method can include determining a similarity metric descriptive of a semantic similarity between the first document and the second document based on the first document encoding and the second document encoding.Type: ApplicationFiled: May 22, 2023Publication date: September 21, 2023Inventors: Liu Yang, Marc Najork, Michael Bendersky, Mingyang Zhang, Cheng Li
-
Publication number: 20230267277Abstract: Systems and methods of the present disclosure are directed to a method for training a machine-learned semantic matching model. The method can include obtaining a first and second document and a first and second activity log. The method can include determining, based on the first document activity log and the second document activity log, a relation label indicative of whether the documents are related. The method can include inputting the documents into the model to receive a semantic similarity value representing an estimated semantic similarity between the first document and the second document. The method can include evaluating a loss function that evaluates a difference between the relation label and the semantic similarity value. The method can include modifying values of parameters of the model based on the loss function.Type: ApplicationFiled: June 15, 2020Publication date: August 24, 2023Inventors: Weize Kong, Michael Bendersky, Marc Najork, Rama Kumar Pasumarthi, Zhen Qin, Rolf Jagerman
-
Patent number: 11694034Abstract: Systems and methods of the present disclosure are directed to a method for predicting semantic similarity between documents. The method can include obtaining a first document and a second document. The method can include parsing the first document into a plurality of first textual blocks and the second document into a plurality of second textual blocks. The method can include processing each of the plurality of first textual blocks and the second textual blocks with a machine-learned semantic document encoding model to obtain a first document encoding and a second document encoding. The method can include determining a similarity metric descriptive of a semantic similarity between the first document and the second document based on the first document encoding and the second document encoding.Type: GrantFiled: October 23, 2020Date of Patent: July 4, 2023Assignee: GOOGLE LLCInventors: Liu Yang, Marc Najork, Michael Bendersky, Mingyang Zhang, Cheng Li
-
Patent number: 11526752Abstract: Provided are computing systems and methods directed to active learning and may provide advantages or improvements to active learning applications for skewed data sets. A challenge in training and developing high-quality models for many supervised learning scenarios is obtaining labeled training examples. Provided are systems and methods for active learning on a training dataset that includes both labeled and unlabeled datapoints. In particular, the systems and methods described herein can select (e.g., at each of a number of iterations) a number of the unlabeled datapoints for which labels should be obtained to gain additional labeled datapoints on which to train a machine-learned model (e.g., machine-learned classifier model). Generally, provided are cost-effective methods and systems for selecting data to improve machine-learned models in applications such as the identification of content items in text, images, and/or audio.Type: GrantFiled: January 23, 2020Date of Patent: December 13, 2022Assignee: GOOGLE LLCInventors: Qi Zhao, Abbas Kazerouni, Sandeep Tata, Jing Xie, Marc Najork
-
Publication number: 20220375245Abstract: The present disclosure is directed to extracting text from form-like documents. In particular, a computing system can obtain an image of a document that contains a plurality of portions of text. The computing system can extract one or more candidate text portions for each field type included in a target schema. The computing system can generate a respective input feature vector for each candidate for the field type. The computing system can generate a respective candidate embedding for the candidate text portion. The computing system can determine a respective score for each candidate text portion for the field type based at least in part on the respective candidate embedding for the candidate text portion. The computing system can assign one or more of the candidate text portions to the field type based on the respective scores.Type: ApplicationFiled: July 18, 2022Publication date: November 24, 2022Inventors: Sandeep Tata, Bodhisattwa Prasad Majumder, Qi Zhao, James Bradley Wendt, Marc Najork, Navneet Potti
-
Patent number: 11393233Abstract: The present disclosure is directed to extracting text from form-like documents. In particular, a computing system can obtain an image of a document that contains a plurality of portions of text. The computing system can extract one or more candidate text portions for each field type included in a target schema. The computing system can generate a respective input feature vector for each candidate for the field type. The computing system can generate a respective candidate embedding for the candidate text portion. The computing system can determine a respective score for each candidate text portion for the field type based at least in part on the respective candidate embedding for the candidate text portion. The computing system can assign one or more of the candidate text portions to the field type based on the respective scores.Type: GrantFiled: June 2, 2020Date of Patent: July 19, 2022Assignee: GOOGLE LLCInventors: Sandeep Tata, Bodhisattwa Prasad Majumder, Qi Zhao, James Bradley Wendt, Marc Najork, Navneet Potti
-
Publication number: 20220129638Abstract: Systems and methods of the present disclosure are directed to a method for predicting semantic similarity between documents. The method can include obtaining a first document and a second document. The method can include parsing the first document into a plurality of first textual blocks and the second document into a plurality of second textual blocks. The method can include processing each of the plurality of first textual blocks and the second textual blocks with a machine-learned semantic document encoding model to obtain a first document encoding and a second document encoding. The method can include determining a similarity metric descriptive of a semantic similarity between the first document and the second document based on the first document encoding and the second document encoding.Type: ApplicationFiled: October 23, 2020Publication date: April 28, 2022Inventors: Liu Yang, Marc Najork, Michael Bendersky, Mingyang Zhang, Cheng Li
-
Publication number: 20210374395Abstract: The present disclosure is directed to extracting text from form-like documents. In particular, a computing system can obtain an image of a document that contains a plurality of portions of text. The computing system can extract one or more candidate text portions for each field type included in a target schema. The computing system can generate a respective input feature vector for each candidate for the field type. The computing system can generate a respective candidate embedding for the candidate text portion. The computing system can determine a respective score for each candidate text portion for the field type based at least in part on the respective candidate embedding for the candidate text portion. The computing system can assign one or more of the candidate text portions to the field type based on the respective scores.Type: ApplicationFiled: June 2, 2020Publication date: December 2, 2021Inventors: Sandeep Tata, Bodhisattwa Prasad Majumder, Qi Zhao, James Bradley Wendt, Marc Najork, Navneet Potti
-
Publication number: 20200250527Abstract: The present disclosure provides computing systems and methods directed to active learning and may provide advantages or improvements to active learning applications for skewed data sets. A challenge in training and developing high-quality models for many supervised learning scenarios is obtaining labeled training examples. This disclosure provides systems and methods for active learning on a training dataset that includes both labeled and unlabeled datapoints. In particular, the systems and methods described herein can select (e.g., at each of a number of iterations) a number of the unlabeled datapoints for which labels should be obtained to gain additional labeled datapoints on which to train a machine-learned model (e.g., machine-learned classifier model). Generally, the disclosure provides cost-effective methods and systems for selecting data to improve machine-learned models in applications such as the identification of content items in text, images, and/or audio.Type: ApplicationFiled: January 23, 2020Publication date: August 6, 2020Inventors: Qi Zhao, Abbas Kazerouni, Sandeep Tata, Jing Xie, Marc Najork
-
Patent number: 8949232Abstract: Architecture that provides a data structure to facilitate personalized ranking over recommended content (e.g., documents). The data structure approximates the social distance of the searching user to the content at query time. A graph is created of content recommended by members of the social network, where the nodes of the graph include content nodes (for the content) and recommending member nodes (for members of the social network who recommended the content). If a member recommends content, an edge is created between the member node and the content node. If a member is a “friend” (tagged as related in some way) of another member, an edge is created between the two member nodes. Each node is converted to a lower dimensional feature set. Feature sets of the content are indexed and the feature set of the searching user is utilized to match and rank the search results at query time.Type: GrantFiled: October 4, 2011Date of Patent: February 3, 2015Assignee: Microsoft CorporationInventors: Timothy Harrington, Rajesh Shenoy, Marc Najork, Rina Panigrahy
-
Publication number: 20150006148Abstract: Example apparatus and methods concern automatically creating labeled training data for automatic language identifiers. One embodiment includes logic to produce a predicted language classification for a post from geographic data associated with the post. The post may be associated with a micro-blog, a social media site, or other electronic communication service that traffics in short messages having frequent colloquialisms, non-standard spelling, emoticons, and unique usages of characters to convey meaning. The embodiment includes logic to produce an actual language classification for the post using a base language classifier. The embodiment includes logic to selectively add the post and a language label for the post to an automatically generated labeled training data upon determining that the predicted language classification matches the actual language classification.Type: ApplicationFiled: July 17, 2013Publication date: January 1, 2015Inventors: Moises Goldszmit, Marc Najork, Stelios Paparizos
-
Patent number: 8856112Abstract: User accounts in a social networking application are divided into highly-connected accounts and regular accounts. A mapping of the highly-connected accounts to their friends, and a mapping of accounts to documents endorsed by the users associated with the accounts are stored on index servers of a search engine. When a query is received by a front-end server of the search engine, the front-end server determines if an account associated with the query is a highly-connected account. If it is, only an identifier of the account is sent to the index servers with the query. If it is not, however, then identifiers of all of the accounts that are friends with the account are sent with the query. The index servers then determine the documents that are endorsed by the friends of the account, and consider the determined documents when selecting documents that are responsive to the query.Type: GrantFiled: August 26, 2011Date of Patent: October 7, 2014Assignee: Microsoft CorporationInventors: Marc A. Najork, Rina Panigrahy, Rajesh K. Shenoy
-
Patent number: 8666920Abstract: Sketches are generated for each node in a graph. For undirected graphs, each sketch for a node may include an indicator of a node from a seed set of nodes and the shortest distance between the node and the indicated node. When a request is received for the shortest distance between two nodes of the graph, the sketches for each of the two nodes are retrieved, and nodes that are indicated in both of the sketches are determined. The distances between each of the two nodes and a determined node as indicated in the sketches is summed for each of the determined nodes, and the sum having the least distance is selected as the estimated shortest distance between the two nodes.Type: GrantFiled: February 15, 2010Date of Patent: March 4, 2014Assignee: Microsoft CorporationInventors: Marc A. Najork, Sreenivas Gollapudi, Rina Panigrahy, Atish Das Sarma
-
Publication number: 20130086057Abstract: Architecture that provides a data structure to facilitate personalized ranking over recommended content (e.g., documents). The data structure approximates the social distance of the searching user to the content at query time. A graph is created of content recommended by members of the social network, where the nodes of the graph include content nodes (for the content) and recommending member nodes (for members of the social network who recommended the content). If a member recommends content, an edge is created between the member node and the content node. If a member is a “friend” (tagged as related in some way) of another member, an edge is created between the two member nodes. Each node is converted to a lower dimensional feature set. Feature sets of the content are indexed and the feature set of the searching user is utilized to match and rank the search results at query time.Type: ApplicationFiled: October 4, 2011Publication date: April 4, 2013Applicant: Microsoft CorporationInventors: Timothy Harrington, Rajesh Shenoy, Marc Najork, Rina Panigrahy
-
Patent number: 8392366Abstract: The number of machines in a cluster of computers running a distributed database, such as a scalable hyperlink datastore or a distributed hyperlink database, may be changed such that machines may be added or removed. The data is not repartitioned all at once. Instead, only new and merged data stores are mapped to the changed set of machines. A database update mechanism may be leveraged to change the number of machines in a distributed database.Type: GrantFiled: August 29, 2006Date of Patent: March 5, 2013Assignee: Microsoft CorporationInventor: Marc A. Najork
-
Publication number: 20130054640Abstract: User accounts in a social networking application are divided into highly-connected accounts and regular accounts. A mapping of the highly-connected accounts to their friends, and a mapping of accounts to documents endorsed by the users associated with the accounts are stored on index servers of a search engine. When a query is received by a front-end server of the search engine, the front-end server determines if an account associated with the query is a highly-connected account. If it is, only an identifier of the account is sent to the index servers with the query. If it is not, however, then identifiers of all of the accounts that are friends with the account are sent with the query. The index servers then determine the documents that are endorsed by the friends of the account, and consider the determined documents when selecting documents that are responsive to the query.Type: ApplicationFiled: August 26, 2011Publication date: February 28, 2013Applicant: Microsoft CorporationInventors: Marc A. Najork, Rina Panigrahy, Rajesh K. Shenoy
-
Publication number: 20120299925Abstract: A graph is generated based on a social networking application with a node for each user account, and one or more edges representing the social networking relationships between the user accounts (e.g., friends). A sketch is generated for each node in iterations where edges are removed from the graph and a set of reachable nodes is determined for the node. A representative node is then selected from the set of reachable nodes and added to the sketch as a dimension. The generated sketches for two nodes are used to calculate an affinity score between the accounts associated with each of the two nodes.Type: ApplicationFiled: May 23, 2011Publication date: November 29, 2012Applicant: Microsoft CorporationInventors: Marc A. Najork, Rina Panigrahy
-
Patent number: 8209305Abstract: A database of hyperlinks, stored in a hyperlink store or distributed across multiple machines such as a scalable hyperlink store, may be incrementally updated. When data is added, instead of modifying an existing data store, a hierarchy of data stores is built. The data stores are merged together, such that a new store is a suffix on an old store. Additions and updates go into new stores, which are relatively small. Lookups consult new stores first. A background thread merges adjacent stores. For example, a batch of updates is collected and incorporated into a new store and then the store is sealed. Subsequent updates are added to yet another new store. Stores are merged occasionally to prevent the chain of stores from becoming too long. Once the batch has been integrated, the new stores are sealed and are used to answer subsequent queries.Type: GrantFiled: April 19, 2006Date of Patent: June 26, 2012Assignee: Microsoft CorporationInventor: Marc A. Najork