Patents Assigned to Cloudera, Inc.

Apparatus and method for utilizing pre-computed results for query processing in a distributed database

Patent number: 11151135

Abstract: A pre-computed result module computes a result prior to receiving a query. The pre-computed result module includes instructions executed by a processor to assess a pre-computation query to designate each identified database source that contributes to the answer to the pre-computation query and corresponding database source metadata. A metadata signature is computed for each identified database source to create a store of identified database sources and corresponding metadata signatures. The query is evaluated to identify accessed database sources responsive to the query. A current metadata signature for each accessed database source is compared to the metadata signatures to identify each updated database source. Re-computed results are formed for each updated database source. Pre-computed results are utilized for each database source that is not updated. A response is supplied to the query using the re-computed results and the pre-computed results.

Type: Grant

Filed: August 5, 2016

Date of Patent: October 19, 2021

Assignee: Cloudera, Inc.

Inventor: Douglas J. Cameron
Ensuring properly ordered events in a distributed computing environment

Patent number: 11146668

Abstract: A first event occurs at a first computer at a first time, as measured by a local clock. A second event is initiated at a second computer by sending a message that includes the first time. The second event occurs at a second time, as measured by a local clock. Because of clock error, the first time is later than the second time. Based on the first time being later than the second time, an alternate second time, that is based on the first time, is used as the time of the second event. When a third system determines the order of the two events, the first time is obtained from the first computer, and the alternate second time is obtained from the second computer, and the order of the events is determined based on a comparison of the two times.

Type: Grant

Filed: June 8, 2020

Date of Patent: October 12, 2021

Assignee: Cloudera, Inc.

Inventors: David Alves, Todd Lipcon
Apparatus and method for processing streaming data and forming visualizations thereof

Patent number: 11108661

Abstract: A machine has a bus and a network interface circuit to receive different data streams from a network. The network interface circuit is connected to the network and the bus. A processor is connected to the bus. A memory is connected to the bus. The memory stores instructions executed by the processor to continuously increment aggregate functions associated with data parameters within the different data streams. Visualizations of the different data streams are periodically updated on different client devices connected to the network.

Type: Grant

Filed: February 20, 2019

Date of Patent: August 31, 2021

Assignee: Cloudera, Inc.

Inventors: Charu Anchlia, Sushil Thomas
Utilization-aware resource scheduling in a distributed computing cluster

Patent number: 11099892

Abstract: Embodiments are disclosed for a utilization-aware approach to cluster scheduling, to address this resource fragmentation and to improve cluster utilization and job throughput. In some embodiments a resource manager at a master node considers actual usage of running tasks and schedules opportunistic work on underutilized worker nodes. The resource manager monitors resource usage on these nodes and preempts opportunistic containers in the event this over-subscription becomes untenable. In doing so, the resource manager effectively utilizes wasted resources, while minimizing adverse effects on regularly scheduled tasks.

Type: Grant

Filed: February 21, 2020

Date of Patent: August 24, 2021

Assignee: Cloudera, Inc.

Inventor: Karthik Kambatla
Design-time information based on run-time artifacts in transient cloud-based distributed computing clusters

Patent number: 11086917

Abstract: Transient computing clusters can be temporarily provisioned in cloud-based infrastructure to run data processing tasks. Such tasks may be run by services operating in the clusters that consume and produce data including operational metadata. Techniques are introduced for tracking data lineage across multiple clusters, including transient computing clusters, based on the operational metadata. In some embodiments, operational metadata is extracted from the transient computing clusters and aggregated at a metadata system for analysis. Based on the analysis of the metadata, operations can be summarized at a cluster level even if the transient computing cluster no longer exists. Further relationships between workflows, such as dependencies or redundancies, can be identified and utilized to optimize the provisioning of computing clusters and tasks performed by the computing clusters.

Type: Grant

Filed: February 26, 2020

Date of Patent: August 10, 2021

Assignee: Cloudera, Inc.

Inventors: Sudhanshu Arora, Mark Donsky, Guang Yao Leng, Naren Koneru, Chang She, Vikas Singh, Himabindu Vuppula
Apparatus and method for recommending and maintaining analytical views

Patent number: 11016947

Abstract: A system has a distributed database with database partitions distributed across worker nodes connected by a network. An analytical view recommendation engine defines an analytical view comprising attributes and measures defined prior to the receipt of a query. The analytical view is maintained as a data unit separate from the distributed database. The analytical view recommendation engine includes instructions executed by a processor to identify a poorly performing report, evaluate queries associated with the poorly performing report, create a recommended analytical view to enhance the performance of the poorly performing report, and deploy the recommended analytical view.

Type: Grant

Filed: December 20, 2016

Date of Patent: May 25, 2021

Assignee: Cloudera, Inc.

Inventors: Priyank Patel, Anjali Betawadkar-Norwood, Douglas J. Cameron, Shant Hovsepian, Sushil Thomas
Mutations in a column store

Patent number: 11003642

Abstract: Columnar storage provides many performance and space saving benefits for analytic workloads, but previous mechanisms for handling single row update transactions in column stores suffer from poor performance. A columnar data layout facilitates both low-latency random access capabilities together with high-throughput analytical access capabilities, simplifying Hadoop architectures for use cases involving real-time data. In disclosed embodiments, mutations within a single row are executed atomically across columns and do not necessarily include the entirety of a row. This allows for faster updates without the overhead of reading or rewriting larger columns.

Type: Grant

Filed: January 26, 2018

Date of Patent: May 11, 2021

Assignee: Cloudera, Inc.

Inventor: Todd Lipcon
Automated discovery, profiling, and management of data assets across distributed file systems through machine learning

Patent number: 10983963

Abstract: Embodiments for locating, identifying and categorizing data-assets through advanced machine learning algorithms implemented by profiler components across Hadoop and Hadoop Compatible File Systems, databases and in-memory objects automatically and periodically to provide a visual representation of the category of data infrastructure distributed across data-centers and multiple clusters, for the purposes of enriching data quality, enabling data discovery and improving outcomes from downstream systems.

Type: Grant

Filed: September 24, 2018

Date of Patent: April 20, 2021

Assignee: Cloudera, Inc.

Inventors: Srikanth Venkatasubramanian, Babu Prakash Rao, Hemanth Yamijala, Rohit Choudhary, Raghumitra Kandikonda
Secure service deployment and access layer spanning multi-cluster environments

Patent number: 10951604

Abstract: Embodiments for deploying services to multiple Hadoop clusters and providing user access to these services in a secure manner. A process allows authorized users to select a service, validate its entitlement to the organization and then install distributed components of the service onto multiple hosts on different Hadoop clusters. In order to enable this deployment and secure access of this service, an identity federation mechanism is used to ensure the user identity of the system is propagated to distributed clusters in a secure fashion thereby ensuring authorized access to clusters or services is provided in a seamless fashion.

Type: Grant

Filed: September 24, 2018

Date of Patent: March 16, 2021

Assignee: Cloudera, Inc.

Inventors: Srikanth Venkatasubramanian, Hemanth Yamijala, Abhishek Kumar, Ashwin Rajeev, Lawrence J McCay, III
Design-time information based on run-time artifacts in a distributed computing cluster

Patent number: 10929173

Abstract: Techniques are disclosed for inferring design-time information based on run-time artifacts generated by services operating in a distributed computing cluster. In an embodiment, a metadata system extracts metadata including run-time artifacts generated by services in a distributed computing cluster while processing a workflow including multiple jobs. The extracted metadata is processed to identify entities and entity relationships which can then be used to generate lineage information. Using the lineage information, the metadata system can infer design-time information associated with the workflow. The inferred design-time information can then be utilized to, for example, recreate the workflow, recreate previous versions of the workflow, optimize the workflow, etc.

Type: Grant

Filed: October 29, 2019

Date of Patent: February 23, 2021

Assignee: Cloudera, Inc.

Inventors: Vikas Singh, Sudhanshu Arora, Philip Zeyliger, Marcelo Masiero Vanzin, Chang She
Extensible framework for managing multiple Hadoop clusters

Patent number: 10922284

Abstract: Embodiments for managing data in a large-scale computer network coupling one or more client computer to a server and having multiple clusters having respective applications, by: encoding web-based data of services to a web browser of a client computer; forwarding requests from the web browser to a cluster access subsystem that wraps the requests in a security protocol interaction that preserves an identity of a user of the client computer; deploying to deploy the applications using an application descriptor for each application of the deployed applications; and storing data about how each application can be accessed through service endpoints including a network address and port identifier for access by queries by any other component, application, or service in the network.

Type: Grant

Filed: September 24, 2018

Date of Patent: February 16, 2021

Assignee: Cloudera, Inc.

Inventors: Srikanth Venkatasubramanian, Babu Prakash Rao, Hemanth Yamijala, Rohit Choudhary, Ram Venkatesh
Apparatus and method for sampling large data sets in a distributed data storage system

Patent number: 10866874

Abstract: A system includes a distributed data storage system disseminated across worker machines connected by a network. A distributed data storage management module has instructions executed by a processor to utilize data block identifiers to track data block accesses to the distributed data storage system. A sampling module with instructions executed by the processor receives a new sample request from a client machine connected to the network. Initial data block samples are gathered from the distributed data storage system during a first time period. A revised sample request is received from the client machine during the first time period. The initial data block samples are gathered. New data block samples are collected from the distributed data storage system. The initial data block samples and the new data block samples are combined to form cumulative data block sample results. The cumulative data block sample results are supplied to the client machine.

Type: Grant

Filed: June 27, 2019

Date of Patent: December 15, 2020

Assignee: Cloudera, Inc.

Inventors: Shaun Ahmadian, Sushil Thomas
Distinct value estimation for query planning

Patent number: 10853368

Abstract: The problem of distinct value estimation has many applications, but is particularly important in the field of database technology where such information is utilized by query planners to generate and optimize query plans. Introduced is a novel technique for estimating the number of distinct values in a given dataset without scanning all of the values in the dataset. In an example embodiment, the introduced technique includes gathering multiple intermediate probabilistic estimates based on varying samples of the dataset, 2) plotting the multiple intermediate probabilistic estimates against indications of sample size, 3) fitting a function to the plotted data points, and 4) determining an overall distinct value estimate by extrapolating the objective function to an estimated or known total number of values in the dataset.

Type: Grant

Filed: April 2, 2018

Date of Patent: December 1, 2020

Assignee: Cloudera, Inc.

Inventors: Alexander Behm, Mostafa Mokhtar
APPARATUS AND METHOD FOR ACCELERATED QUERY PROCESSING USING EAGER AGGREGATION AND ANALYTICAL VIEW MATCHING

Publication number: 20200372029

Abstract: A system comprises a computer network and worker machines connected to the computer network. The worker machines store partitions of a distributed database. A master machine is connected to the computer network. The master machine includes a query processor to identify a star query that references a fact table and related dimension tables that characterize attributes of facts in the fact table. Eager aggregation is applied to a query plan associated with the star query. The eager aggregation alters the query plan by moving an aggregation operation before a join operation to form an eager aggregated query plan. An analytical view with data responsive to the eager aggregated query plan is identified. The eager aggregated query plan is revised to form a final query plan. The final query plan references the analytical view. The final query plan is executed to produce query results.

Type: Application

Filed: August 10, 2020

Publication date: November 26, 2020

Applicant: Cloudera, Inc.

Inventors: Anjali BETAWADKAR-NORWOOD, Priyank Patel
Manifest-based snapshots in distributed computing environments

Patent number: 10776217

Abstract: Scalable architectures, systems, and services are provided herein for creating manifest-based snapshots in distributed computing environments. In some embodiments, responsive to receiving a request to create a snapshot of a data object, a master node identifies multiple slave nodes on which a data object is stored in the cloud-computing platform and creates a snapshot manifest representing the snapshot of the data object. The snapshot manifest comprises a file including a listing of multiple file names in the snapshot manifest and reference information for locating the multiple files in the distributed database system. The snapshot can be created without disrupting I/O operations, e.g., in an online mode by various region servers as directed by the master node. Additionally, a log roll approach to creating the snapshot is also disclosed in which log files are marked. The replaying of log entries can reduce the probability of causal consistency in the snapshot.

Type: Grant

Filed: May 25, 2017

Date of Patent: September 15, 2020

Assignee: Cloudera, Inc.

Inventors: Jonathan Ming-Cyn Hsieh, Matteo Bertozzi
Apparatus and method for accelerated query processing using eager aggregation and analytical view matching

Patent number: 10740333

Abstract: A system comprises a computer network and worker machines connected to the computer network. The worker machines store partitions of a distributed database. A master machine is connected to the computer network. The master machine includes a query processor to identify a star query that references a fact table and related dimension tables that characterize attributes of facts in the fact table. Eager aggregation is applied to a query plan associated with the star query. The eager aggregation alters the query plan by moving an aggregation operation before a join operation to form an eager aggregated query plan. An analytical view with data responsive to the eager aggregated query plan is identified. The eager aggregated query plan is revised to form a final query plan. The final query plan references the analytical view. The final query plan is executed to produce query results.

Type: Grant

Filed: June 27, 2018

Date of Patent: August 11, 2020

Assignee: Cloudera, Inc.

Inventors: Anjali Betawadkar-Norwood, Priyank Patel
Background format optimization for enhanced SQL-like queries in Hadoop

Patent number: 10706059

Abstract: A format conversion engine for Apache Hadoop that converts data from its original format to a database-like format at certain time points for use by a low latency (LL) query engine. The format conversion engine comprises a daemon that is installed on each data node in a Hadoop cluster. The daemon comprises a scheduler and a converter. The scheduler determines when to perform the format conversion and notifies the converter when the time comes. The converter converts data on the data node from its original format to a database-like format for use by the low latency (LL) query engine.

Type: Grant

Filed: October 12, 2016

Date of Patent: July 7, 2020

Assignee: Cloudera, Inc.

Inventors: Marcel Kornacker, Justin Erickson, Nong Li, Lenni Kuff, Henry Noel Robinson, Alan Choi, Alex Behm
Ensuring properly ordered events in a distributed computing environment

Patent number: 10681190

Abstract: A first event occurs at a first computer at a first time, as measured by a local clock. A second event is initiated at a second computer by sending a message that includes the first time. The second event occurs at a second time, as measured by a local clock. Because of clock error, the first time is later than the second time. Based on the first time being later than the second time, an alternate second time, that is based on the first time, is used as the time of the second event. When a third system determines the order of the two events, the first time is obtained from the first computer, and the alternate second time is obtained from the second computer, and the order of the events is determined based on a comparison of the two times.

Type: Grant

Filed: November 21, 2018

Date of Patent: June 9, 2020

Assignee: Cloudera, Inc.

Inventors: David Alves, Todd Lipcon
Design-time information based on run-time artifacts in transient cloud-based distributed computing clusters

Patent number: 10635700

Abstract: Transient computing clusters can be temporarily provisioned in cloud-based infrastructure to run data processing tasks. Such tasks may be run by services operating in the clusters that consume and produce data including operational metadata. Techniques are introduced for tracking data lineage across multiple clusters, including transient computing clusters, based on the operational metadata. In some embodiments, operational metadata is extracted from the transient computing clusters and aggregated at a metadata system for analysis. Based on the analysis of the metadata, operations can be summarized at a cluster level even if the transient computing cluster no longer exists. Further relationships between workflows, such as dependencies or redundancies, can be identified and utilized to optimize the provisioning of computing clusters and tasks performed by the computing clusters.

Type: Grant

Filed: April 2, 2018

Date of Patent: April 28, 2020

Assignee: Cloudera, Inc.

Inventors: Sudhanshu Arora, Mark Donsky, Guang Yao Leng, Naren Koneru, Chang She, Vikas Singh, Himabindu Vuppula
Memory allocation buffer for reduction of heap fragmentation

Patent number: 10613762

Abstract: Systems and methods of a memory allocation buffer to reduce heap fragmentation. In one embodiment, the memory allocation buffer structures a memory arena dedicated to a target region that is one of a plurality of regions in a server in a database cluster such as an HBase cluster. The memory area has a chunk size (e.g., 2 MB) and an offset pointer. Data objects in write requests targeted to the region are received and inserted to the memory arena at a location specified by the offset pointer. When the memory arena is filled, a new one is allocated. When a MemStore of the target region is flushed, the entire memory arenas for the target region are freed up. This reduces heap fragmentation that is responsible for long and/or frequent garbage collection pauses.

Type: Grant

Filed: January 20, 2017

Date of Patent: April 7, 2020

Assignee: Cloudera, Inc.

Inventor: Todd Lipcon

prev 1 2 3 4 5 next