Patents Assigned to Cloudera, Inc.
-
Patent number: 11151135Abstract: A pre-computed result module computes a result prior to receiving a query. The pre-computed result module includes instructions executed by a processor to assess a pre-computation query to designate each identified database source that contributes to the answer to the pre-computation query and corresponding database source metadata. A metadata signature is computed for each identified database source to create a store of identified database sources and corresponding metadata signatures. The query is evaluated to identify accessed database sources responsive to the query. A current metadata signature for each accessed database source is compared to the metadata signatures to identify each updated database source. Re-computed results are formed for each updated database source. Pre-computed results are utilized for each database source that is not updated. A response is supplied to the query using the re-computed results and the pre-computed results.Type: GrantFiled: August 5, 2016Date of Patent: October 19, 2021Assignee: Cloudera, Inc.Inventor: Douglas J. Cameron
-
Patent number: 11146668Abstract: A first event occurs at a first computer at a first time, as measured by a local clock. A second event is initiated at a second computer by sending a message that includes the first time. The second event occurs at a second time, as measured by a local clock. Because of clock error, the first time is later than the second time. Based on the first time being later than the second time, an alternate second time, that is based on the first time, is used as the time of the second event. When a third system determines the order of the two events, the first time is obtained from the first computer, and the alternate second time is obtained from the second computer, and the order of the events is determined based on a comparison of the two times.Type: GrantFiled: June 8, 2020Date of Patent: October 12, 2021Assignee: Cloudera, Inc.Inventors: David Alves, Todd Lipcon
-
Patent number: 11108661Abstract: A machine has a bus and a network interface circuit to receive different data streams from a network. The network interface circuit is connected to the network and the bus. A processor is connected to the bus. A memory is connected to the bus. The memory stores instructions executed by the processor to continuously increment aggregate functions associated with data parameters within the different data streams. Visualizations of the different data streams are periodically updated on different client devices connected to the network.Type: GrantFiled: February 20, 2019Date of Patent: August 31, 2021Assignee: Cloudera, Inc.Inventors: Charu Anchlia, Sushil Thomas
-
Patent number: 11099892Abstract: Embodiments are disclosed for a utilization-aware approach to cluster scheduling, to address this resource fragmentation and to improve cluster utilization and job throughput. In some embodiments a resource manager at a master node considers actual usage of running tasks and schedules opportunistic work on underutilized worker nodes. The resource manager monitors resource usage on these nodes and preempts opportunistic containers in the event this over-subscription becomes untenable. In doing so, the resource manager effectively utilizes wasted resources, while minimizing adverse effects on regularly scheduled tasks.Type: GrantFiled: February 21, 2020Date of Patent: August 24, 2021Assignee: Cloudera, Inc.Inventor: Karthik Kambatla
-
Patent number: 11086917Abstract: Transient computing clusters can be temporarily provisioned in cloud-based infrastructure to run data processing tasks. Such tasks may be run by services operating in the clusters that consume and produce data including operational metadata. Techniques are introduced for tracking data lineage across multiple clusters, including transient computing clusters, based on the operational metadata. In some embodiments, operational metadata is extracted from the transient computing clusters and aggregated at a metadata system for analysis. Based on the analysis of the metadata, operations can be summarized at a cluster level even if the transient computing cluster no longer exists. Further relationships between workflows, such as dependencies or redundancies, can be identified and utilized to optimize the provisioning of computing clusters and tasks performed by the computing clusters.Type: GrantFiled: February 26, 2020Date of Patent: August 10, 2021Assignee: Cloudera, Inc.Inventors: Sudhanshu Arora, Mark Donsky, Guang Yao Leng, Naren Koneru, Chang She, Vikas Singh, Himabindu Vuppula
-
Patent number: 11016947Abstract: A system has a distributed database with database partitions distributed across worker nodes connected by a network. An analytical view recommendation engine defines an analytical view comprising attributes and measures defined prior to the receipt of a query. The analytical view is maintained as a data unit separate from the distributed database. The analytical view recommendation engine includes instructions executed by a processor to identify a poorly performing report, evaluate queries associated with the poorly performing report, create a recommended analytical view to enhance the performance of the poorly performing report, and deploy the recommended analytical view.Type: GrantFiled: December 20, 2016Date of Patent: May 25, 2021Assignee: Cloudera, Inc.Inventors: Priyank Patel, Anjali Betawadkar-Norwood, Douglas J. Cameron, Shant Hovsepian, Sushil Thomas
-
Patent number: 11003642Abstract: Columnar storage provides many performance and space saving benefits for analytic workloads, but previous mechanisms for handling single row update transactions in column stores suffer from poor performance. A columnar data layout facilitates both low-latency random access capabilities together with high-throughput analytical access capabilities, simplifying Hadoop architectures for use cases involving real-time data. In disclosed embodiments, mutations within a single row are executed atomically across columns and do not necessarily include the entirety of a row. This allows for faster updates without the overhead of reading or rewriting larger columns.Type: GrantFiled: January 26, 2018Date of Patent: May 11, 2021Assignee: Cloudera, Inc.Inventor: Todd Lipcon
-
Patent number: 10983963Abstract: Embodiments for locating, identifying and categorizing data-assets through advanced machine learning algorithms implemented by profiler components across Hadoop and Hadoop Compatible File Systems, databases and in-memory objects automatically and periodically to provide a visual representation of the category of data infrastructure distributed across data-centers and multiple clusters, for the purposes of enriching data quality, enabling data discovery and improving outcomes from downstream systems.Type: GrantFiled: September 24, 2018Date of Patent: April 20, 2021Assignee: Cloudera, Inc.Inventors: Srikanth Venkatasubramanian, Babu Prakash Rao, Hemanth Yamijala, Rohit Choudhary, Raghumitra Kandikonda
-
Patent number: 10951604Abstract: Embodiments for deploying services to multiple Hadoop clusters and providing user access to these services in a secure manner. A process allows authorized users to select a service, validate its entitlement to the organization and then install distributed components of the service onto multiple hosts on different Hadoop clusters. In order to enable this deployment and secure access of this service, an identity federation mechanism is used to ensure the user identity of the system is propagated to distributed clusters in a secure fashion thereby ensuring authorized access to clusters or services is provided in a seamless fashion.Type: GrantFiled: September 24, 2018Date of Patent: March 16, 2021Assignee: Cloudera, Inc.Inventors: Srikanth Venkatasubramanian, Hemanth Yamijala, Abhishek Kumar, Ashwin Rajeev, Lawrence J McCay, III
-
Patent number: 10929173Abstract: Techniques are disclosed for inferring design-time information based on run-time artifacts generated by services operating in a distributed computing cluster. In an embodiment, a metadata system extracts metadata including run-time artifacts generated by services in a distributed computing cluster while processing a workflow including multiple jobs. The extracted metadata is processed to identify entities and entity relationships which can then be used to generate lineage information. Using the lineage information, the metadata system can infer design-time information associated with the workflow. The inferred design-time information can then be utilized to, for example, recreate the workflow, recreate previous versions of the workflow, optimize the workflow, etc.Type: GrantFiled: October 29, 2019Date of Patent: February 23, 2021Assignee: Cloudera, Inc.Inventors: Vikas Singh, Sudhanshu Arora, Philip Zeyliger, Marcelo Masiero Vanzin, Chang She
-
Patent number: 10922284Abstract: Embodiments for managing data in a large-scale computer network coupling one or more client computer to a server and having multiple clusters having respective applications, by: encoding web-based data of services to a web browser of a client computer; forwarding requests from the web browser to a cluster access subsystem that wraps the requests in a security protocol interaction that preserves an identity of a user of the client computer; deploying to deploy the applications using an application descriptor for each application of the deployed applications; and storing data about how each application can be accessed through service endpoints including a network address and port identifier for access by queries by any other component, application, or service in the network.Type: GrantFiled: September 24, 2018Date of Patent: February 16, 2021Assignee: Cloudera, Inc.Inventors: Srikanth Venkatasubramanian, Babu Prakash Rao, Hemanth Yamijala, Rohit Choudhary, Ram Venkatesh
-
Patent number: 10866874Abstract: A system includes a distributed data storage system disseminated across worker machines connected by a network. A distributed data storage management module has instructions executed by a processor to utilize data block identifiers to track data block accesses to the distributed data storage system. A sampling module with instructions executed by the processor receives a new sample request from a client machine connected to the network. Initial data block samples are gathered from the distributed data storage system during a first time period. A revised sample request is received from the client machine during the first time period. The initial data block samples are gathered. New data block samples are collected from the distributed data storage system. The initial data block samples and the new data block samples are combined to form cumulative data block sample results. The cumulative data block sample results are supplied to the client machine.Type: GrantFiled: June 27, 2019Date of Patent: December 15, 2020Assignee: Cloudera, Inc.Inventors: Shaun Ahmadian, Sushil Thomas
-
Patent number: 10853368Abstract: The problem of distinct value estimation has many applications, but is particularly important in the field of database technology where such information is utilized by query planners to generate and optimize query plans. Introduced is a novel technique for estimating the number of distinct values in a given dataset without scanning all of the values in the dataset. In an example embodiment, the introduced technique includes gathering multiple intermediate probabilistic estimates based on varying samples of the dataset, 2) plotting the multiple intermediate probabilistic estimates against indications of sample size, 3) fitting a function to the plotted data points, and 4) determining an overall distinct value estimate by extrapolating the objective function to an estimated or known total number of values in the dataset.Type: GrantFiled: April 2, 2018Date of Patent: December 1, 2020Assignee: Cloudera, Inc.Inventors: Alexander Behm, Mostafa Mokhtar
-
Publication number: 20200372029Abstract: A system comprises a computer network and worker machines connected to the computer network. The worker machines store partitions of a distributed database. A master machine is connected to the computer network. The master machine includes a query processor to identify a star query that references a fact table and related dimension tables that characterize attributes of facts in the fact table. Eager aggregation is applied to a query plan associated with the star query. The eager aggregation alters the query plan by moving an aggregation operation before a join operation to form an eager aggregated query plan. An analytical view with data responsive to the eager aggregated query plan is identified. The eager aggregated query plan is revised to form a final query plan. The final query plan references the analytical view. The final query plan is executed to produce query results.Type: ApplicationFiled: August 10, 2020Publication date: November 26, 2020Applicant: Cloudera, Inc.Inventors: Anjali BETAWADKAR-NORWOOD, Priyank Patel
-
Patent number: 10776217Abstract: Scalable architectures, systems, and services are provided herein for creating manifest-based snapshots in distributed computing environments. In some embodiments, responsive to receiving a request to create a snapshot of a data object, a master node identifies multiple slave nodes on which a data object is stored in the cloud-computing platform and creates a snapshot manifest representing the snapshot of the data object. The snapshot manifest comprises a file including a listing of multiple file names in the snapshot manifest and reference information for locating the multiple files in the distributed database system. The snapshot can be created without disrupting I/O operations, e.g., in an online mode by various region servers as directed by the master node. Additionally, a log roll approach to creating the snapshot is also disclosed in which log files are marked. The replaying of log entries can reduce the probability of causal consistency in the snapshot.Type: GrantFiled: May 25, 2017Date of Patent: September 15, 2020Assignee: Cloudera, Inc.Inventors: Jonathan Ming-Cyn Hsieh, Matteo Bertozzi
-
Patent number: 10740333Abstract: A system comprises a computer network and worker machines connected to the computer network. The worker machines store partitions of a distributed database. A master machine is connected to the computer network. The master machine includes a query processor to identify a star query that references a fact table and related dimension tables that characterize attributes of facts in the fact table. Eager aggregation is applied to a query plan associated with the star query. The eager aggregation alters the query plan by moving an aggregation operation before a join operation to form an eager aggregated query plan. An analytical view with data responsive to the eager aggregated query plan is identified. The eager aggregated query plan is revised to form a final query plan. The final query plan references the analytical view. The final query plan is executed to produce query results.Type: GrantFiled: June 27, 2018Date of Patent: August 11, 2020Assignee: Cloudera, Inc.Inventors: Anjali Betawadkar-Norwood, Priyank Patel
-
Patent number: 10706059Abstract: A format conversion engine for Apache Hadoop that converts data from its original format to a database-like format at certain time points for use by a low latency (LL) query engine. The format conversion engine comprises a daemon that is installed on each data node in a Hadoop cluster. The daemon comprises a scheduler and a converter. The scheduler determines when to perform the format conversion and notifies the converter when the time comes. The converter converts data on the data node from its original format to a database-like format for use by the low latency (LL) query engine.Type: GrantFiled: October 12, 2016Date of Patent: July 7, 2020Assignee: Cloudera, Inc.Inventors: Marcel Kornacker, Justin Erickson, Nong Li, Lenni Kuff, Henry Noel Robinson, Alan Choi, Alex Behm
-
Patent number: 10681190Abstract: A first event occurs at a first computer at a first time, as measured by a local clock. A second event is initiated at a second computer by sending a message that includes the first time. The second event occurs at a second time, as measured by a local clock. Because of clock error, the first time is later than the second time. Based on the first time being later than the second time, an alternate second time, that is based on the first time, is used as the time of the second event. When a third system determines the order of the two events, the first time is obtained from the first computer, and the alternate second time is obtained from the second computer, and the order of the events is determined based on a comparison of the two times.Type: GrantFiled: November 21, 2018Date of Patent: June 9, 2020Assignee: Cloudera, Inc.Inventors: David Alves, Todd Lipcon
-
Patent number: 10635700Abstract: Transient computing clusters can be temporarily provisioned in cloud-based infrastructure to run data processing tasks. Such tasks may be run by services operating in the clusters that consume and produce data including operational metadata. Techniques are introduced for tracking data lineage across multiple clusters, including transient computing clusters, based on the operational metadata. In some embodiments, operational metadata is extracted from the transient computing clusters and aggregated at a metadata system for analysis. Based on the analysis of the metadata, operations can be summarized at a cluster level even if the transient computing cluster no longer exists. Further relationships between workflows, such as dependencies or redundancies, can be identified and utilized to optimize the provisioning of computing clusters and tasks performed by the computing clusters.Type: GrantFiled: April 2, 2018Date of Patent: April 28, 2020Assignee: Cloudera, Inc.Inventors: Sudhanshu Arora, Mark Donsky, Guang Yao Leng, Naren Koneru, Chang She, Vikas Singh, Himabindu Vuppula
-
Patent number: 10613762Abstract: Systems and methods of a memory allocation buffer to reduce heap fragmentation. In one embodiment, the memory allocation buffer structures a memory arena dedicated to a target region that is one of a plurality of regions in a server in a database cluster such as an HBase cluster. The memory area has a chunk size (e.g., 2 MB) and an offset pointer. Data objects in write requests targeted to the region are received and inserted to the memory arena at a location specified by the offset pointer. When the memory arena is filled, a new one is allocated. When a MemStore of the target region is flushed, the entire memory arenas for the target region are freed up. This reduces heap fragmentation that is responsible for long and/or frequent garbage collection pauses.Type: GrantFiled: January 20, 2017Date of Patent: April 7, 2020Assignee: Cloudera, Inc.Inventor: Todd Lipcon