Patents Assigned to Databricks Inc.
  • Patent number: 12287698
    Abstract: A system for monitoring job execution includes an interface and a processor. The interface is configured to receive an indication to start a cluster processing job. The processor is configured to determine whether processing a data instance associated with the cluster processing job satisfies a watchdog criterion; and in the event that processing the data instance satisfies the watchdog criterion, cause the processing of the data instance to be killed.
    Type: Grant
    Filed: May 22, 2023
    Date of Patent: April 29, 2025
    Assignee: Databricks, Inc.
    Inventors: Alicja Luszczak, Srinath Shankar, Shi Xin
  • Patent number: 12277237
    Abstract: The present application discloses a method, system, and computer system for providing access to information stored on system for data storage. The method includes receiving a data request from a user, determining data corresponding to the data request, determining whether the user has requisite permissions to access the data, and in response to determining that the user has requisite permissions to access the data: determining a manner by which to provide access to the data, wherein the data comprises a filtered subset of stored data, and generating a token based at least in part on the user and the manner by which access to the data is to be provided.
    Type: Grant
    Filed: October 29, 2021
    Date of Patent: April 15, 2025
    Assignee: Databricks, Inc.
    Inventors: Matei Zaharia, David Lewis, Cheng Lian, Yuchen Huo, Ali Ghodsi
  • Patent number: 12260003
    Abstract: A data processing service facilitates the creation and processing of data processing pipelines that process data processing jobs defined with respect to a set of tasks in a sequence and with data dependencies associated with each separate task such that the output from one task is used as input for a subsequent task. In various embodiments, the set of tasks include at least one cleanroom task that is executed in a cleanroom station and at least one non-cleanroom task executed in an execution environment of a user where each task is configured to read one or more input datasets and transform the one or more input datasets into one or more output datasets.
    Type: Grant
    Filed: September 26, 2023
    Date of Patent: March 25, 2025
    Assignee: Databricks, Inc.
    Inventors: William Chau, Abhijit Chakankar, Stephen Michael Mahoney, Daniel Seth Morris, Itai Shlomo Weiss
  • Patent number: 12248818
    Abstract: The present application discloses a method, system, and computer system for starting up and maintaining a cluster in a warmed up state, and/or allocating clusters from a warmed up state. The method includes instantiating a set of virtual machines, wherein instantiating the set of virtual machines includes setting a temporary security credential for each virtual machine of the set of virtual machines, receiving a virtual machine allocation request associated with a workspace, a customer, or a tenant, in response to the virtual machine allocation request: allocating a virtual machine, wherein allocating the virtual machine comprises replacing the temporary security credential with a security credential associated with the workspace, the customer, or the tenant.
    Type: Grant
    Filed: October 29, 2021
    Date of Patent: March 11, 2025
    Assignee: Databricks, Inc.
    Inventors: Yandong Mao, Aaron Daniel Davidson
  • Patent number: 12242441
    Abstract: The present application discloses a method, system, and computer system for managing lineage data for data entities. The method includes generating lineage data, wherein generating the lineage data, and storing and indexing, in a data structure, the lineage data in association with the selected data entity. The generating the lineage data includes selecting a selected data entity, obtaining a query tree that was used to generate the selected data entity, and determining lineage data for the selected data entity based at least in part on the query tree.
    Type: Grant
    Filed: January 31, 2023
    Date of Patent: March 4, 2025
    Assignee: Databricks, Inc.
    Inventors: Tao Feng, Menglei Sun, Zhuoying Wang
  • Patent number: 12242485
    Abstract: Disclosed herein is a method, system, or non-transitory computer readable medium for evaluating a query on a columnar dataset comprising one or more dictionaries associated with columns in the dataset. The method includes receiving a request to perform a query comprising at least a operator and a request to return information about a value of interest in a columnar dataset stored on cloud storage. At least one column in the columnar dataset is based on a dictionary. The dictionary maps one or more values for a column to one or more respective identifiers. The method determines whether to perform dictionary filtering for the query by calculating a metric based on one or more factors. Responsive to the metric being below a threshold, which may be predetermined, the method performs the dictionary filtering.
    Type: Grant
    Filed: January 31, 2023
    Date of Patent: March 4, 2025
    Assignee: Databricks, Inc.
    Inventors: Utkarsh Agarwal, Shoumik Palkar, Alexander Behm, Sriram Krishnamurthy
  • Patent number: 12229169
    Abstract: The disclosed configurations provide a method (and/or a computer-readable medium or system) for determining, from a table schema describing keys of a data table, one or more clustering keys that can be used to cluster data files of a data table. The method includes generating features for the data table, generating tokens from the features, generating a prediction for each token by applying to the token a machine-learned transformer model trained to predict a likelihood that the key associated with the token is a clustering key for the data table, determining clustering keys based on the predictions, and clustering data records of the data table into data files based on key-values for the clustering keys.
    Type: Grant
    Filed: November 3, 2023
    Date of Patent: February 18, 2025
    Assignee: Databricks, Inc.
    Inventors: Terry Kim, Lin Ma, Rahul Shivu Mahadev, Rahul Potharaju
  • Patent number: 12229137
    Abstract: A system performs efficient startup of executors of a distributed computing engine used for processing queries, for example, database queries. The system starts an executor node and processes a set of queries using the executor node to warm up the executor node. The system performs a checkpoint of the warmed-up executor node to create an image. The image is restored in the target executor nodes. The system may store a checkpoint image for each configuration of an executor node. The configuration is determined based on various factors including the hardware of the executor node, memory allocation of the processes, and so on. The user or restore based on checkpoint images improves efficiency of execution of the startup of executor nodes.
    Type: Grant
    Filed: January 12, 2024
    Date of Patent: February 18, 2025
    Assignee: Databricks, Inc.
    Inventors: Xinyang Ge, Lixiang Ao, Haonan Jing, Aaron Daniel Davidson
  • Patent number: 12210521
    Abstract: A cluster computing system maintains a first set of queues for short queries and a set second set for longer queries. The first set is allocated a majority of the cluster's processing resources and processes queries on a first in first out basis. The second set is allocated a minority of the cluster's processing resources which are shared among queries in the second set. Accordingly, the system assigns each query to the first set of queues for a fixed amount of resource time. While a query is processing, the system monitors the query's resource time and reassigns the query to the second set of queues if the query has not completed within the allotted amount of resource time. Thus, short queries receive the necessary resources to complete quickly without getting stuck behind longer queries while ensuring that longer queries continue to make progress.
    Type: Grant
    Filed: April 27, 2023
    Date of Patent: January 28, 2025
    Assignee: Databricks, Inc.
    Inventors: Venkata Sai Akhil Gudesa, Herman Rudolf Petrus Catharina van Hövell tot Westerflier, Supun Chathuranga Nakandala
  • Patent number: 12210528
    Abstract: Disclosed herein is a method, system, or non-transitory computer readable medium for evaluating a query on a columnar dataset comprising one or more dictionaries associated with columns in the dataset. The method includes receiving a request to perform a query comprising at least an operator for a columnar dataset on cloud storage. At least one column in the dataset is based on a dictionary, and the dictionary maps one or more values for a column to one or more respective identifiers. The method evaluates the operator on one or more values of the dictionary to generate an updated dictionary comprising updated values. The method may decode the updated dictionary into an updated column comprising updated data values.
    Type: Grant
    Filed: January 31, 2023
    Date of Patent: January 28, 2025
    Assignee: Databricks, Inc.
    Inventors: Utkarsh Agarwal, Shoumik Palkar, Alexander Behm, Sriram Krishnamurthy
  • Patent number: 12204510
    Abstract: Disclosed is a configuration for managing the organization of data tables in cloud-based storage. The configuration receives metrics for data processing operations on the data table. Metrics include at least one of a size of the data table, a size of each file in the data table, and metadata describing the data table. The configuration automatically executes a cost-benefit analysis based on the one or more metrics for each candidate maintenance operation in a plurality of candidate maintenance operations. The configuration automatically selects a maintenance operation from the candidate maintenance operations to automate based on the cost-benefit analysis of the one or more candidate maintenance operations. The selected maintenance operation is automated and scheduled on the data table.
    Type: Grant
    Filed: May 8, 2023
    Date of Patent: January 21, 2025
    Assignee: Databricks, Inc.
    Inventors: Vijayan Prabhakaran, Himanshu Raja, Rahul Potharaju, Naga Raju Bhanoori, Lin Ma, Rajesh Parangi Sharabhalingappa, Jintian Liang, Zachary Vaughn Schuermann, Kam Cheung Ting
  • Patent number: 12204523
    Abstract: A system for retrieving and caching metadata from a remote data source is described. The system may receive a request from a client device. The request is to perform a query operation on a set of data objects stored in the remote data source. The system may access a metadata cache storing metadata information on one or more data objects of the remote data source and identify metadata corresponding to the set of data objects for the query operation in the metadata cache. The system may determine whether the identified metadata for the set of data objects meets an update condition. In response to the identified metadata meeting the update condition, the system may fetch updated metadata for at least the set of data objects from the remote data source, and store the updated metadata in the metadata cache.
    Type: Grant
    Filed: April 14, 2023
    Date of Patent: January 21, 2025
    Assignee: Databricks, Inc.
    Inventors: Zhaoxing Li, Rayman Preet Singh, Fuat Can Efeoglu, Daniel Tenedorio, Sarah Cai
  • Patent number: 12197400
    Abstract: A data processing service receives a request from a first collaborator to create a clean room for data sharing collaboration with at least a second collaborator. In response, the data processing service creates an execution environment separate from the data environment of the first collaborator and the second collaborator. The first and second collaborators can then add content into the clean room in the form of data tables and executable notebooks. Approval from each collaborator is required before a notebook can be executed using any data table shared into the clean room. Upon receiving notebook approval from each collaborator, the data processing service creates a notebook job to execute the notebook on one or more cluster computing resources of the data processing service to generate an output.
    Type: Grant
    Filed: September 25, 2023
    Date of Patent: January 14, 2025
    Assignee: Databricks, Inc.
    Inventors: William Chau, Abhijit Chakankar, Stephen Michael Mahoney, Daniel Seth Morris, Itai Shlomo Weiss
  • Patent number: 12189628
    Abstract: The present application discloses a method, system, and computer system for parsing files. The method includes receiving an indication that a first file is to be processed, determining to begin processing the first file using a first processing engine based at least in part on one or more predefined heuristics, indicating to process the first file using a first processing engine, determining whether a particular error in processing the first file using the first processing engine has been detected, in response to determining that the particular error has been detected, indicate to stop processing the first file using the first processing engine and indicate to continue processing using a second processing engine, and storing in memory information obtained based on processing the first file by one or more of the first processing engine and the second processing engine.
    Type: Grant
    Filed: January 31, 2023
    Date of Patent: January 7, 2025
    Assignee: Databricks, Inc.
    Inventors: Prashanth Menon, Alexander Behm, Sriram Krishnamurthy
  • Patent number: 12189607
    Abstract: A system includes an interface and a processor. The interface is configured to receive a table indication of a data table and to receive a transaction indication to perform a transaction. The processor is configured to determine a current position N in a transaction log, determine a current state of the metadata; determine a read set associated with a transaction; attempt to write an update to the transaction log associated with a next position N+1; in response to a transaction determination that a simultaneous transaction associated with the next position N+1 already exists, determine a set of updated files; and in response to a determination that there is not an overlap between the read set associated with the current transaction and the set of updated files associated with the simultaneous transaction, attempt to write the update to the transaction to the transaction log associated with a further position N+2.
    Type: Grant
    Filed: August 22, 2023
    Date of Patent: January 7, 2025
    Assignee: Databricks, Inc.
    Inventors: Michael Paul Armbrust, Shixiong Zhu, Burak Yavuz
  • Patent number: 12189625
    Abstract: A multi-cluster computing system which includes a query result caching system is presented. The multi-cluster computing system may include a data processing service and client devices communicatively coupled over a network. The data processing service may include a control layer and a data layer. The control layer may be configured to receive and process requests from the client devices and manage resources in the data layer. The data layer may be configured to include instances of clusters of computing resources for executing jobs. The data layer may include a data storage system, which further includes a remote query result cache Store. The query result cache store may include a cloud storage query result cache which stores data associated with results of previously executed requests. As such, when a cluster encounters a previously executed request, the cluster may efficiently retrieve the cached result of the request from the in-memory query result cache or the cloud storage query result cache.
    Type: Grant
    Filed: July 14, 2023
    Date of Patent: January 7, 2025
    Assignee: Databricks, Inc.
    Inventors: Bogdan Ionut Ghit, Saksham Garg, Christian Stuart, Christopher Stevens
  • Patent number: 12182292
    Abstract: The present application discloses a method, system, and computer system for providing access to data. The method includes receiving, by a data manager service from a data requesting service, a request using an identifier for a high-level data object to access a set of data associated with the high-level data object, determining, by the data manager service, low-level data object(s) corresponding to the set of data based on the identifier for the high-level data object, determining whether a user associated with the request has permission to access at least a subset of the low-level data object(s), and in response to determining that the user associated has permission to access the at least the subset of the low-level data object(s), generating, by the data manager service, a uniform resource locator (URL) via which the at least the subset of the one or more low-level data objects is accessible by the user.
    Type: Grant
    Filed: January 31, 2023
    Date of Patent: December 31, 2024
    Assignee: Databricks, Inc.
    Inventors: Matei Zaharia, Shixiong Zhu, Xiaotong Sun, Ramesh Chandra, Michael Paul Armbrust, Ali Ghodsi
  • Patent number: 12153558
    Abstract: A system includes a plurality of computing units. A first computing unit of the plurality of computing units comprises: a communication interface configured to receive an indication to roll up data in a data table; and a processor coupled to the communication interface and configured to: build a preaggregation hash table based at least in part on a set of columns and the data table by aggregating input rows of the data table; for each preaggregated hash table entry of the preaggregated hash table: provide the preaggregated hash table entry to a second computing unit of the plurality of computing units based at least in part on a distribution hash value; receive a set of received entries from computing units of the plurality of computing units; and build an aggregation hash table based at least in part on the set of received entries by aggregating the set of received entries.
    Type: Grant
    Filed: January 31, 2023
    Date of Patent: November 26, 2024
    Assignee: Databricks, Inc.
    Inventors: Alexander Behm, Ankur Dave
  • Patent number: 12147555
    Abstract: The present application discloses a method, system, and computer system for providing access to data. The method includes receiving, by a data manager service from a data requesting service, a request using an identifier for a high-level data object to access a set of data associated with the high-level data object, determining, by the data manager service, low-level data object(s) corresponding to the set of data based on the identifier for the high-level data object, determining whether a user associated with the request has permission to access at least a subset of the low-level data object(s), and in response to determining that the user associated has permission to access the at least the subset of the low-level data object(s), generating, by the data manager service, a uniform resource locator (URL) via which the at least the subset of the one or more low-level data objects is accessible by the user.
    Type: Grant
    Filed: April 29, 2022
    Date of Patent: November 19, 2024
    Assignee: Databricks, Inc.
    Inventors: Matei Zaharia, Shixiong Zhu, Xiaotong Sun, Ramesh Chandra, Michael Paul Armbrust, Ali Ghodsi
  • Patent number: 12147412
    Abstract: A disclosed configuration receives a first indication that a first transaction is committed to update a first subset of records in a data table at a first version to generate a second version of the data table and receiving a second indication to commit a second transaction to update a second subset of records in a data file of the data table at the first version. The configuration determines a logical prerequisite based on whether the first subset of records changes content of one or more records in the second subset of records and determining a physical prerequisite on whether the second subset of records corresponds to respective data records in data files of the second version of the data table. The configuration commits the second transaction to generate a third version of the data table by updating elements of the deletion vector if the prerequisites are satisfied.
    Type: Grant
    Filed: January 18, 2023
    Date of Patent: November 19, 2024
    Assignee: Databricks, Inc.
    Inventors: Bart Samwel, Christos Stavrakakis