Patents Assigned to Cloudera, Inc.
-
Patent number: 10599664Abstract: Systems and methods for very fast grouping of “similar” SQL queries according to user-supplied similarity criteria. The user-supplied similarity criteria include a threshold quantifying the degree of similarity between SQL queries and common artifacts included in the queries. A similarity-characterizing data structure allows for the very fast grouping of “similar” SQL queries. Because the computation is distributed among multiple compute nodes, a small cluster of compute nodes takes a short time to compute the similarity-characterizing data on a workload of tens of millions of queries. The user can supply the similarity criteria through a UI or a command line tool. Furthermore, the user can adjust the degree of similarity by supplying new similarity criteria. Accordingly, the system can display in real time or near real time, updated SQL groupings corresponding to the newly supplied similarity criteria using the originally computed similarity-characterizing data structure.Type: GrantFiled: April 24, 2017Date of Patent: March 24, 2020Assignee: Cloudera, Inc.Inventors: Rituparna Agrawal, Anupam Singh, Prithviraj Pandian
-
Patent number: 10572306Abstract: Embodiments are disclosed for a utilization-aware approach to cluster scheduling, to address this resource fragmentation and to improve cluster utilization and job throughput. In some embodiments a resource manager at a master node considers actual usage of running tasks and schedules opportunistic work on underutilized worker nodes. The resource manager monitors resource usage on these nodes and preempts opportunistic containers in the event this over-subscription becomes untenable. In doing so, the resource manager effectively utilizes wasted resources, while minimizing adverse effects on regularly scheduled tasks.Type: GrantFiled: May 15, 2017Date of Patent: February 25, 2020Assignee: Cloudera, Inc.Inventor: Karthik Kambatla
-
Patent number: 10514948Abstract: Techniques are disclosed for inferring design-time information based on run-time artifacts generated by services operating in a distributed computing cluster. In an embodiment, a metadata system extracts metadata including run-time artifacts generated by services in a distributed computing cluster while processing a workflow including multiple jobs. The extracted metadata is processed to identify entities and entity relationships which can then be used to generate lineage information. Using the lineage information, the metadata system can infer design-time information associated with the workflow. The inferred design-time information can then be utilized to, for example, recreate the workflow, recreate previous versions of the workflow, optimize the workflow, etc.Type: GrantFiled: November 9, 2017Date of Patent: December 24, 2019Assignee: Cloudera, Inc.Inventors: Vikas Singh, Sudhanshu Arora, Philip Zeyliger, Marcelo Masiero Vanzin, Chang She
-
Patent number: 10346432Abstract: A compaction policy imposing soft limits to optimize system efficiency is used to select various rowsets on which to perform compaction, each rowset storing keys within an interval called a keyspace. For example, the disclosed compaction policy results in a decrease in a height of the tablet, removes overlapping rowsets, and creates smaller sized rowsets. The compaction policy is based on the linear relationship shared between the keyspace height and the cost associated with performing an operation (e.g., an insert operation) in that keyspace. Accordingly, various factors determining which rowsets are to be compacted, how large the compacted rowsets are to be made, and when to perform the compaction, are considered within the disclosed compaction policy. Furthermore, a system and method for performing compaction on the selected datasets in a log-structured database is also provided.Type: GrantFiled: March 17, 2016Date of Patent: July 9, 2019Assignee: Cloudera, Inc.Inventor: Todd Lipcon
-
Patent number: 10255335Abstract: Techniques are described for analyzing usage of data stored in a data storage system without accessing the stored data. In some embodiments, workload data indicative of queries executed at the data storage system on stored data is received. This workload data can include query logs generated during execution of the queries. The workload data is processed to identify data elements such as tables, columns, and views associated with the stored data as well as information regarding usage of the identified data elements. Usage can include operations performed on the data elements during execution of the queries. Based on this processing relationships between the identified data elements can be inferred and visualizations generated that convey information regarding usage of the data stored at the data storage system.Type: GrantFiled: November 7, 2016Date of Patent: April 9, 2019Assignee: Cloudera, Inc.Inventor: Yihua Ding
-
Patent number: 10187461Abstract: Methods for configuring a system to collect and aggregate datasets are disclosed. One embodiment includes, identifying a data source in the system from where dataset is to be collected, configuring a machine in the system that generates the dataset to be collected, to send the dataset to the data source, identifying an arrival location where the dataset that is collected is to be aggregated or written, and/or configuring an agent node by specifying a source for the agent node as the data source in the system and specifying a sink for the agent node as the arrival location.Type: GrantFiled: April 13, 2016Date of Patent: January 22, 2019Assignee: Cloudera, Inc.Inventors: Jonathan Ming-Cyn Hsieh, Henry Noel Robinson
-
Patent number: 10171635Abstract: A first event occurs at a first computer at a first time, as measured by a local clock. A second event is initiated at a second computer by sending a message that includes the first time. The second event occurs at a second time, as measured by a local clock. Because of clock error, the first time is later than the second time. Based on the first time being later than the second time, an alternate second time, that is based on the first time, is used as the time of the second event. When a third system determines the order of the two events, the first time is obtained from the first computer, and the alternate second time is obtained from the second computer, and the order of the events is determined based on a comparison of the two times.Type: GrantFiled: August 18, 2014Date of Patent: January 1, 2019Assignee: Cloudera, Inc.Inventors: David Alves, Todd Lipcon
-
Patent number: 10120904Abstract: Systems and methods are disclosed for resource management in a distributed computing environment. In some embodiments, a resource manager for a large distributed cluster needs to be able to provide resource responses very quickly. But each query may also not be accurate in initial resource request and will often have to come back to the resource manager multiple times. An artifact may provide low latency query responses by using resource request caching that can handle re-requests of resources. According to some embodiments, a queuing mechanism may take into account resources currently expended and any resource requirement estimates available in order to make queuing decisions that meet policies set by an administrator. In some embodiments, scheduling decisions are distribute across a cluster of computing systems while still maintaining approximate compliance with resource management policies set by an administrator.Type: GrantFiled: December 31, 2014Date of Patent: November 6, 2018Assignee: Cloudera, Inc.Inventor: Jairam Ranganathan
-
Patent number: 10007864Abstract: An image processing system involves a camera, at least one processor associated with the camera, non-transitory storage, a lexical database of terms and image classification software. The image processing system uses the image classification software to assign hyponyms and associated probabilities to an image and then builds a subset hierarchical tree of hypernyms from the lexical database of terms. The processor then scores the hypernyms and identifies at least one hypernym for the image that has a score that is calculated to have a value that is greater than one of: a pre-specified threshold score, or all other calculated level scores within the subset hierarchical tree. The associated methods are also disclosed.Type: GrantFiled: February 8, 2018Date of Patent: June 26, 2018Assignee: Cloudera, Inc.Inventors: Micha Gorelick, Hilary Mason, Grant Custer
-
Patent number: 9990399Abstract: A low latency query engine for APACHE HADOOP™ that provides real-time or near real-time, ad hoc query capability, while completing batch-processing of MapReduce. In one embodiment, the low latency query engine comprises a daemon that is installed on data nodes in a HADOOP™ cluster for handling query requests and all internal requests related to query execution. In a further embodiment, the low latency query engine comprises a daemon for providing name service and metadata distribution. The low latency query engine receives a query request via client, turns the request into collections of plan fragments and coordinates parallel and optimized execution of the plan fragments on remote daemons to generate results at a much faster speed than existing batch-oriented processing frameworks.Type: GrantFiled: May 13, 2016Date of Patent: June 5, 2018Assignee: Cloudera, Inc.Inventors: Marcel Kornacker, Justin Erickson, Nong Li, Lenni Kuff, Henry Noel Robinson, Alan Choi, Alex Behm
-
Patent number: 9977826Abstract: A computerized method for generating and evaluating natural language-generated text involves receiving, in a computer, data input by a user, generating, using a natural language generation technique, multiple instances of text stories based upon both contents of a corpus and the received data; analyzing the multiple instances of text stories as a weighted combination of computed geographic scores, distance scores, information content scores, replacement scores and extra aspect scores, providing a ranked set of the generated text stories to a user, receiving a selection of one of the text stories in the ranked set, and storing the selected story.Type: GrantFiled: October 21, 2015Date of Patent: May 22, 2018Assignee: Cloudera, Inc.Inventors: Micha Gorelick, Hilary Mason, Grant Custer
-
Patent number: 9946958Abstract: An image processing system involves a camera, at least one processor associated with the camera, non-transitory storage, a lexical database of terms and image classification software. The image processing system uses the image classification software to assign hyponyms and associated probabilities to an image and then builds a subset hierarchical tree of hypernyms from the lexical database of terms. The processor then scores the hypernyms and identifies at least one hypernym for the image that has a score that is calculated to have a value that is greater than one of: a pre-specified threshold score, or all other calculated level scores within the subset hierarchical tree. The associated methods are also disclosed.Type: GrantFiled: October 14, 2016Date of Patent: April 17, 2018Assignee: Cloudera, Inc.Inventors: Micha Gorelick, Hilary Mason, Grant Custer
-
Patent number: 9934382Abstract: Embodiments of the present disclosure include systems and methods for encrypting a virtual machine image and accessing an encrypted virtual machine image. According to some embodiments an encryption module can encrypt a virtual machine image and place an encryption boot loader. The encryption boot loader may be extracted from the encrypted virtual machine image, be transmitted to, and stored at a key storage system. Upon a request to boot an operating system associated with the encrypted virtual machine image, a pre-boot execution environment may communicate with an image service to retrieve the encryption boot loader from the remote key storage system. The virtual machine image may therefore be decrypted suing the encryption boot loader, which may allow booting of the operating system.Type: GrantFiled: October 28, 2014Date of Patent: April 3, 2018Assignee: Cloudera, Inc.Inventor: Eduardo Garcia
-
Patent number: 9842126Abstract: Systems and methods for checking for region consistency and table integrity problems and automatically repairing a corrupted HBase cluster. The methods and systems operate in a diagnostic mode and a diagnostic and repair mode. The methods include fixing table integrity problems, such as backwards table regions, table region holes, table region overlap, and the like to restore table integrity invariant. Once the table integrity has been restored, each row key resolves to exactly one region. The methods further include fixing region inconsistencies, such as bad region assignment, no region present in the meta table, region information not in the Hadoop Distributed File System (HDFS), and the like to restore region consistency invariant. The information in the HDFS is taken as ground truth and any meta table or assignment problems that are inconsistent with the HDFS is deemed wrong and removed.Type: GrantFiled: March 15, 2013Date of Patent: December 12, 2017Assignee: Cloudera, Inc.Inventor: Jonathan Ming-Cyn Hsieh
-
Patent number: 9819491Abstract: Embodiments of the present disclosure include systems and methods for secure release of secret information over a network. The server can be configured to receive a request from a client to access the deposit of secret information, send an authorization request to at least one designated trustee in the set of designated trustees for the deposit of secret information, receive responses over the network from one or more of the designated trustees in the set of designated trustees and apply a trustee policy to the responses from the one or more designated trustees in the set of trustees to determine if the request is authorized. If the request is authorized, the server can send the secret information to the client. If the request is not authorized, the server denies access by the client to the secret information.Type: GrantFiled: May 9, 2016Date of Patent: November 14, 2017Assignee: Cloudera, Inc.Inventors: Dustin C. Kirkland, Eduardo Garcia
-
Patent number: 9817859Abstract: Systems and methods of collecting and aggregating log data with fault tolerance are disclosed. One embodiment includes, one or more devices that generate log data, the one or more machines each associated with an agent node to collect the log data, wherein, the agent node generates a batch comprising multiple messages from the log data and assigns a tag to the hatch. In one embodiment, the agent node further computes a checksum for the batch of multiple messages. The system may further include a collector device, the collector device being associated with a collector tier having a collector node to which the agent sends the log data; wherein, the collector determines the checksum for the hatch of multiple messages received from the agent node.Type: GrantFiled: June 1, 2016Date of Patent: November 14, 2017Assignee: Cloudera, Inc.Inventors: Jonathan Ming-Cyn Hsieh, Henry Noel Robinson
-
Patent number: 9817867Abstract: Systems and methods of dynamically processing an event using an extensible data model are disclosed. One embodiment includes, specifying attributes of the event in a data model; the data model being extensible to add properties to the event as the dataset is streamed from the source to the sink.Type: GrantFiled: November 16, 2015Date of Patent: November 14, 2017Assignee: Cloudera, Inc.Inventors: Jonathan Ming-Cyn Hsieh, Henry Noel Robinson
-
Patent number: 9753954Abstract: Systems and methods for data node fencing in a distributed file system to prevent data inconsistencies and corruptions are disclosed. An embodiment includes implementing a protocol whereby data nodes detect a failover and determine an active name node based on transaction identifiers associated with transaction requests. The data nodes also provide to the active name node block location information and an acknowledgment. The embodiment further includes a protocol whereby a name node refrains from issuing invalidation requests to the data nodes until the name node receives acknowledgments from all data nodes that are functional.Type: GrantFiled: September 11, 2013Date of Patent: September 5, 2017Assignee: Cloudera, Inc.Inventors: Todd Lipcon, Aaron T. Myers, Eli Collins
-
Patent number: 9747333Abstract: A sysSQL technology for querying operating system states of multiple hosts in a cluster using a Structured Query Language (SQL) query is disclosed. An administrator of a cluster can use a graphical or text-based user interface to submit an SQL query to determine the operating system states of multiple hosts in parallel. The technology parses the SQL query to determine the datasets needed to execute the SQL query and aggregates those datasets from the multiple hosts. The technology then creates a temporary database to execute the SQL query and provides the results from the SQL query for display on the user interface.Type: GrantFiled: October 8, 2014Date of Patent: August 29, 2017Assignee: Cloudera, Inc.Inventor: Philip Zeyliger
-
Patent number: 9716624Abstract: Systems and methods for centralized configuration of a distributed computing cluster are disclosed. One embodiment of the disclosed technology provides a user environment that facilitates a selection of a service to be run on hosts in the distributed computing cluster and configuration of the service or hosts in the distributed computer cluster. The disclosed technology can further configure each of the hosts in the distributed computing cluster to run the service based on a set of configuration settings.Type: GrantFiled: October 8, 2014Date of Patent: July 25, 2017Assignee: Cloudera, Inc.Inventors: Philip Zeyliger, Philip Lee Langdale, Patrick David Hunt