Patents Assigned to Cloudera, Inc.
  • Patent number: 11899937
    Abstract: Systems and methods of a memory allocation buffer to reduce heap fragmentation. In one embodiment, the memory allocation buffer structures a memory arena dedicated to a target region that is one of a plurality of regions in a server in a database cluster such as an HBase cluster. The memory area has a chunk size (e.g., 2 MB) and an offset pointer. Data objects in write requests targeted to the region are received and inserted to the memory arena at a location specified by the offset pointer. When the memory arena is filled, a new one is allocated. When a MemStore of the target region is flushed, the entire memory arenas for the target region are freed up. This reduces heap fragmentation that is responsible for long and/or frequent garbage collection pauses.
    Type: Grant
    Filed: March 3, 2020
    Date of Patent: February 13, 2024
    Assignee: Cloudera, Inc.
    Inventor: Todd Lipcon
  • Patent number: 11768739
    Abstract: Scalable architectures, systems, and services are provided herein for creating manifest-based snapshots in distributed computing environments. In some embodiments, responsive to receiving a request to create a snapshot of a data object, a master node identifies multiple slave nodes on which a data object is stored in the cloud-computing platform and creates a snapshot manifest representing the snapshot of the data object. The snapshot manifest comprises a file including a listing of multiple file names in the snapshot manifest and reference information for locating the multiple files in the distributed database system. The snapshot can be created without disrupting I/O operations, e.g., in an online mode by various region servers as directed by the master node. Additionally, a log roll approach to creating the snapshot is also disclosed in which log files are marked. The replaying of log entries can reduce the probability of causal consistency in the snapshot.
    Type: Grant
    Filed: July 30, 2020
    Date of Patent: September 26, 2023
    Assignee: Cloudera, Inc.
    Inventors: Jonathan Ming-Cyn Hsieh, Matteo Bertozzi
  • Patent number: 11758029
    Abstract: A first event occurs at a first computer at a first time, as measured by a local clock. A second event is initiated at a second computer by sending a message that includes the first time. The second event occurs at a second time, as measured by a local clock. Because of clock error, the first time is later than the second time. Based on the first time being later than the second time, an alternate second time, that is based on the first time, is used as the time of the second event. When a third system determines the order of the two events, the first time is obtained from the first computer, and the alternate second time is obtained from the second computer, and the order of the events is determined based on a comparison of the two times.
    Type: Grant
    Filed: June 9, 2022
    Date of Patent: September 12, 2023
    Assignee: Cloudera, Inc.
    Inventors: David Alves, Todd Lipcon
  • Patent number: 11726743
    Abstract: A technique is described for merging multiple lists of ordinal elements such as keys into a sorted output. In an example embodiment, a merge window is defined, based on the bounds of the multiple lists of ordinal elements, that is representative of a portion of an overall element space associated with the multiple lists. Lists of elements to be sorted can be placed into one of at least two different heaps based on whether they overlap the merge window. For example, lists that overlap the merge window may be placed into an active or “hot” heap, while lists that do not overlap the merge window may be placed into a separate inactive or “cold” heap. A sorted output can then be generated by iteratively processing the active heap. As the processing of the active heap progresses, the merge window advances, and lists may move between the active and inactive heaps.
    Type: Grant
    Filed: April 11, 2022
    Date of Patent: August 15, 2023
    Assignee: Cloudera, Inc.
    Inventors: Adar Lieber-Dembo, Todd Lipcon
  • Patent number: 11663257
    Abstract: Transient computing clusters can be temporarily provisioned in cloud-based infrastructure to run data processing tasks. Such tasks may be run by services operating in the clusters that consume and produce data including operational metadata. Techniques are introduced for tracking data lineage across multiple clusters, including transient computing clusters, based on the operational metadata. In some embodiments, operational metadata is extracted from the transient computing clusters and aggregated at a metadata system for analysis. Based on the analysis of the metadata, operations can be summarized at a cluster level even if the transient computing cluster no longer exists. Further relationships between workflows, such as dependencies or redundancies, can be identified and utilized to optimize the provisioning of computing clusters and tasks performed by the computing clusters.
    Type: Grant
    Filed: July 2, 2021
    Date of Patent: May 30, 2023
    Assignee: Cloudera, Inc.
    Inventors: Sudhanshu Arora, Mark Donsky, Guang Yao Leng, Naren Koneru, Chang She, Vikas Singh, Himabindu Vuppula
  • Patent number: 11663213
    Abstract: The problem of distinct value estimation has many applications, but is particularly important in the field of database technology where such information is utilized by query planners to generate and optimize query plans. Introduced is a novel technique for estimating the number of distinct values in a given dataset without scanning all of the values in the dataset. In an example embodiment, the introduced technique includes gathering multiple intermediate probabilistic estimates based on varying samples of the dataset, 2) plotting the multiple intermediate probabilistic estimates against indications of sample size, 3) fitting a function to the plotted data points, and 4) determining an overall distinct value estimate by extrapolating the objective function to an estimated or known total number of values in the dataset.
    Type: Grant
    Filed: November 25, 2020
    Date of Patent: May 30, 2023
    Assignee: Cloudera, Inc.
    Inventors: Alexander Behm, Mostafa Mokhtar
  • Patent number: 11663033
    Abstract: Techniques are disclosed for inferring design-time information based on run-time artifacts generated by services operating in a distributed computing cluster. In an embodiment, a metadata system extracts metadata including run-time artifacts generated by services in a distributed computing cluster while processing a workflow including multiple jobs. The extracted metadata is processed to identify entities and entity relationships which can then be used to generate lineage information. Using the lineage information, the metadata system can infer design-time information associated with the workflow. The inferred design-time information can then be utilized to, for example, recreate the workflow, recreate previous versions of the workflow, optimize the workflow, etc.
    Type: Grant
    Filed: February 18, 2021
    Date of Patent: May 30, 2023
    Assignee: Cloudera, Inc.
    Inventors: Vikas Singh, Sudhanshu Arora, Philip Zeyliger, Marcelo Masiero Vanzin, Chang She
  • Patent number: 11645294
    Abstract: Systems and methods for very fast grouping of “similar” SQL queries according to user-supplied similarity criteria. The user-supplied similarity criteria include a threshold quantifying the degree of similarity between SQL queries and common artifacts included in the queries. A similarity-characterizing data structure allows for the very fast grouping of “similar” SQL queries. Because the computation is distributed among multiple compute nodes, a small cluster of compute nodes takes a short time to compute the similarity-characterizing data on a workload of tens of millions of queries. The user can supply the similarity criteria through a UI or a command line tool. Furthermore, the user can adjust the degree of similarity by supplying new similarity criteria. Accordingly, the system can display in real time or near real time, updated SQL groupings corresponding to the newly supplied similarity criteria using the originally computed similarity-characterizing data structure.
    Type: Grant
    Filed: March 20, 2020
    Date of Patent: May 9, 2023
    Assignee: Cloudera, Inc.
    Inventors: Rituparna Agrawal, Anupam Singh, Prithviraj Pandian
  • Patent number: 11630830
    Abstract: A format conversion engine for Apache Hadoop that converts data from its original format to a database-like format at certain time points for use by a low latency (LL) query engine. The format conversion engine comprises a daemon that is installed on each data node in a Hadoop cluster. The daemon comprises a scheduler and a converter. The scheduler determines when to perform the format conversion and notifies the converter when the time comes. The converter converts data on the data node from its original format to a database-like format for use by the low latency (LL) query engine.
    Type: Grant
    Filed: July 6, 2020
    Date of Patent: April 18, 2023
    Assignee: Cloudera Inc.
    Inventors: Marcel Kornacker, Justin Erickson, Nong Li, Lenni Kuff, Henry Noel Robinson, Alan Choi, Alex Behm
  • Patent number: 11567956
    Abstract: A format conversion engine for Apache Hadoop that converts data from its original format to a database-like format at certain time points for use by a low latency (LL) query engine. The format conversion engine comprises a daemon that is installed on each data node in a Hadoop cluster. The daemon comprises a scheduler and a converter. The scheduler determines when to perform the format conversion and notifies the converter when the time comes. The converter converts data on the data node from its original format to a database-like format for use by the low latency (LL) query engine.
    Type: Grant
    Filed: July 6, 2020
    Date of Patent: January 31, 2023
    Assignee: Cloudera, Inc.
    Inventors: Marcel Kornacker, Justin Erickson, Nong Li, Lenni Kuff, Henry Noel Robinson, Alan Choi, Alex Behm
  • Patent number: 11388271
    Abstract: A first event occurs at a first computer at a first time, as measured by a local clock. A second event is initiated at a second computer by sending a message that includes the first time. The second event occurs at a second time, as measured by a local clock. Because of clock error, the first time is later than the second time. Based on the first time being later than the second time, an alternate second time, that is based on the first time, is used as the time of the second event. When a third system determines the order of the two events, the first time is obtained from the first computer, and the alternate second time is obtained from the second computer, and the order of the events is determined based on a comparison of the two times.
    Type: Grant
    Filed: October 6, 2020
    Date of Patent: July 12, 2022
    Assignee: Cloudera, Inc.
    Inventors: David Alves, Todd Lipcon
  • Patent number: 11341134
    Abstract: A system comprises a computer network and worker machines connected to the computer network. The worker machines store partitions of a distributed database. A master machine is connected to the computer network. The master machine includes a query processor to identify a star query that references a fact table and related dimension tables that characterize attributes of facts in the fact table. Eager aggregation is applied to a query plan associated with the star query. The eager aggregation alters the query plan by moving an aggregation operation before a join operation to form an eager aggregated query plan. An analytical view with data responsive to the eager aggregated query plan is identified. The eager aggregated query plan is revised to form a final query plan. The final query plan references the analytical view. The final query plan is executed to produce query results.
    Type: Grant
    Filed: August 10, 2020
    Date of Patent: May 24, 2022
    Assignee: Cloudera, Inc.
    Inventors: Anjali Betawadkar-Norwood, Priyank Patel
  • Patent number: 11301210
    Abstract: A technique is described for merging multiple lists of ordinal elements such as keys into a sorted output. In an example embodiment, a merge window is defined, based on the bounds of the multiple lists of ordinal elements, that is representative of a portion of an overall element space associated with the multiple lists. Lists of elements to be sorted can be placed into one of at least two different heaps based on whether they overlap the merge window. For example, lists that overlap the merge window may be placed into an active or “hot” heap, while lists that do not overlap the merge window may be placed into a separate inactive or “cold” heap. A sorted output can then be generated by iteratively processing the active heap. As the processing of the active heap progresses, the merge window advances, and lists may move between the active and inactive heaps.
    Type: Grant
    Filed: January 28, 2020
    Date of Patent: April 12, 2022
    Assignee: Cloudera, Inc.
    Inventors: Adar Lieber-Dembo, Todd Lipcon
  • Patent number: 11151135
    Abstract: A pre-computed result module computes a result prior to receiving a query. The pre-computed result module includes instructions executed by a processor to assess a pre-computation query to designate each identified database source that contributes to the answer to the pre-computation query and corresponding database source metadata. A metadata signature is computed for each identified database source to create a store of identified database sources and corresponding metadata signatures. The query is evaluated to identify accessed database sources responsive to the query. A current metadata signature for each accessed database source is compared to the metadata signatures to identify each updated database source. Re-computed results are formed for each updated database source. Pre-computed results are utilized for each database source that is not updated. A response is supplied to the query using the re-computed results and the pre-computed results.
    Type: Grant
    Filed: August 5, 2016
    Date of Patent: October 19, 2021
    Assignee: Cloudera, Inc.
    Inventor: Douglas J. Cameron
  • Patent number: 11146668
    Abstract: A first event occurs at a first computer at a first time, as measured by a local clock. A second event is initiated at a second computer by sending a message that includes the first time. The second event occurs at a second time, as measured by a local clock. Because of clock error, the first time is later than the second time. Based on the first time being later than the second time, an alternate second time, that is based on the first time, is used as the time of the second event. When a third system determines the order of the two events, the first time is obtained from the first computer, and the alternate second time is obtained from the second computer, and the order of the events is determined based on a comparison of the two times.
    Type: Grant
    Filed: June 8, 2020
    Date of Patent: October 12, 2021
    Assignee: Cloudera, Inc.
    Inventors: David Alves, Todd Lipcon
  • Patent number: 11108661
    Abstract: A machine has a bus and a network interface circuit to receive different data streams from a network. The network interface circuit is connected to the network and the bus. A processor is connected to the bus. A memory is connected to the bus. The memory stores instructions executed by the processor to continuously increment aggregate functions associated with data parameters within the different data streams. Visualizations of the different data streams are periodically updated on different client devices connected to the network.
    Type: Grant
    Filed: February 20, 2019
    Date of Patent: August 31, 2021
    Assignee: Cloudera, Inc.
    Inventors: Charu Anchlia, Sushil Thomas
  • Patent number: 11099892
    Abstract: Embodiments are disclosed for a utilization-aware approach to cluster scheduling, to address this resource fragmentation and to improve cluster utilization and job throughput. In some embodiments a resource manager at a master node considers actual usage of running tasks and schedules opportunistic work on underutilized worker nodes. The resource manager monitors resource usage on these nodes and preempts opportunistic containers in the event this over-subscription becomes untenable. In doing so, the resource manager effectively utilizes wasted resources, while minimizing adverse effects on regularly scheduled tasks.
    Type: Grant
    Filed: February 21, 2020
    Date of Patent: August 24, 2021
    Assignee: Cloudera, Inc.
    Inventor: Karthik Kambatla
  • Patent number: 11086917
    Abstract: Transient computing clusters can be temporarily provisioned in cloud-based infrastructure to run data processing tasks. Such tasks may be run by services operating in the clusters that consume and produce data including operational metadata. Techniques are introduced for tracking data lineage across multiple clusters, including transient computing clusters, based on the operational metadata. In some embodiments, operational metadata is extracted from the transient computing clusters and aggregated at a metadata system for analysis. Based on the analysis of the metadata, operations can be summarized at a cluster level even if the transient computing cluster no longer exists. Further relationships between workflows, such as dependencies or redundancies, can be identified and utilized to optimize the provisioning of computing clusters and tasks performed by the computing clusters.
    Type: Grant
    Filed: February 26, 2020
    Date of Patent: August 10, 2021
    Assignee: Cloudera, Inc.
    Inventors: Sudhanshu Arora, Mark Donsky, Guang Yao Leng, Naren Koneru, Chang She, Vikas Singh, Himabindu Vuppula
  • Patent number: 11016947
    Abstract: A system has a distributed database with database partitions distributed across worker nodes connected by a network. An analytical view recommendation engine defines an analytical view comprising attributes and measures defined prior to the receipt of a query. The analytical view is maintained as a data unit separate from the distributed database. The analytical view recommendation engine includes instructions executed by a processor to identify a poorly performing report, evaluate queries associated with the poorly performing report, create a recommended analytical view to enhance the performance of the poorly performing report, and deploy the recommended analytical view.
    Type: Grant
    Filed: December 20, 2016
    Date of Patent: May 25, 2021
    Assignee: Cloudera, Inc.
    Inventors: Priyank Patel, Anjali Betawadkar-Norwood, Douglas J. Cameron, Shant Hovsepian, Sushil Thomas
  • Patent number: 11003642
    Abstract: Columnar storage provides many performance and space saving benefits for analytic workloads, but previous mechanisms for handling single row update transactions in column stores suffer from poor performance. A columnar data layout facilitates both low-latency random access capabilities together with high-throughput analytical access capabilities, simplifying Hadoop architectures for use cases involving real-time data. In disclosed embodiments, mutations within a single row are executed atomically across columns and do not necessarily include the entirety of a row. This allows for faster updates without the overhead of reading or rewriting larger columns.
    Type: Grant
    Filed: January 26, 2018
    Date of Patent: May 11, 2021
    Assignee: Cloudera, Inc.
    Inventor: Todd Lipcon