System and Method for Code and Data Versioning in Computerized Data Modeling and Analysis
Code and data versioning allow developers to work on code and data without affecting production code and data and without affecting the development activities of other developers. Code and data being worked on by a developer are associated with a task. The system automatically determines the dataset to use for a given development task from among a production dataset, a latest dataset, or a temporary dataset associated with the development task so that development code does not have to be modified to read from a specific dataset.
This application is a continuation of, and therefore claims priority from, U.S. patent application Ser. No. 15/629,342 entitled SYSTEM AND METHOD FOR CODE AND DATA VERSIONING IN COMPUTERIZED DATA MODELING AND ANALYSIS filed on Jun. 21, 2017 (issuing as U.S. Pat. No. 11,175,910 on Nov. 16, 2021), which is a continuation-in-part of, and therefore claims priority from, U.S. patent application Ser. No. 15/388,388 entitled SYSTEM AND METHOD FOR RAPID DEVELOPMENT AND DEPLOYMENT OF REUSABLE ANALYTIC CODE FOR USE IN COMPUTERIZED DATA MODELING AND ANALYSIS filed on Dec. 22, 2016 (now U.S. Pat. No. 10,394,532 issued Aug. 27, 2019), which claims the benefit of U.S. Provisional Application No. 62/271,041 filed on Dec. 22, 2015; each of which patent applications is hereby incorporated herein by reference in its entirety.
This application also may be related to one or more of the following commonly-owned patent applications filed on Jun. 21, 2017, each of which is hereby incorporated herein by reference in its entirety:
U.S. patent application Ser. No. 15/629,316 entitled SYSTEM AND METHOD FOR INTERACTIVE REPORTING IN COMPUTERIZED DATA MODELING AND ANALYSIS (U.S. Pat. No. 10,275,502 issued Apr. 30, 2019); and
U.S. patent application Ser. No. 15/629,328 entitled SYSTEM AND METHOD FOR OPTIMIZED QUERY EXECUTION IN COMPUTERIZED DATA MODELING AND ANALYSIS (U.S. Pat. No. 10,268,753 issued Apr. 23, 2019).
FIELD OF THE DISCLOSUREThe present disclosure relates generally to computer-based tools for developing and deploying analytic computer code. More specifically, the present disclosure relates to a system and method for rapid development and deployment of reusable analytic code for use in computerized data modeling and analysis.
BACKGROUNDIn today's information technology world, there is an increased interest in processing “big” data to develop insights (e.g., better analytical insight, better customer understanding, etc.) and business advantages (e.g., in enterprise analytics, data management processes, etc.). Customers leave an audit trail or digital log of the interactions, purchases, inquiries, and preferences through online interactions with an organization. Discovering and interpreting audit trails within big data provides a significant advantage to companies looking to realize greater value from the data they capture and manage every day. Structured, semi-structured, and unstructured data points are being generated and captured at an ever-increasing pace, thereby forming big data, which is typically defined in terms of velocity, variety, and volume. Big data is fast-flowing, ever-growing, heterogeneous, and has exceedingly noisy input, and as a result transforming data into signals is critical. As more companies (e.g., airlines, telecommunications companies, financial institutions, etc.) focus on real-world use cases, the demand for continually refreshed signals will continue to increase.
Due to the depth and breadth of available data, data science (and data scientists) is required to transform complex data into simple digestible formats for quick interpretation and understanding. Thus, data science, and in particular, the field of data analytics, focuses on transforming big data into business value (e.g., helping companies anticipate customer behaviors and responses). The current analytic approach to capitalize on big data starts with raw data and ends with intelligence, which is then used to solve a particular business need so that data is ultimately translated into value.
However, a data scientist tasked with a well-defined problem (e.g., rank customers by probability of attrition in the next 90 days) is required to expend a significant amount of effort on tedious manual processes (e.g., aggregating, analyzing, cleansing, preparing, and transforming raw data) in order to begin conducting analytics. In such an approach, significant effort is spent on data preparation (e.g., cleaning, linking, processing), and less is spent on analytics (e.g., business intelligence, visualization, machine learning, model building).
Further, usually the intelligence gathered from the data is not shared across the enterprise (e.g., across use cases, business units, etc.) and is specific to solving a particular use case or business scenario. In this approach, whenever a new use case is presented, an entirely new analytics solution needs to be developed, such that there is no reuse of intelligence across different use cases. Each piece of intelligence that is derived from the data is developed from scratch for each use case that requires it, which often means that it's being recreated multiple times for the same enterprise. There are no natural economies of scale in the process, and there are not enough data scientists to tackle the growing number of business opportunities while relying on such techniques. This can result in inefficiencies and waste, including lengthy use case execution and missed business opportunities.
Currently, to conduct analytics on “big” data, data scientists are often required to develop large quantities of software code. Often, such code is expensive to develop, is highly customized, and is not easily adopted for other uses in the analytics field. Minimizing redundant costs and shortening development cycles requires significantly reducing the amount of time that data scientists spend managing and coordinating raw data. Further, optimizing this work can allow data scientists to improve their effectiveness by honing signals and ultimately improving the foundation that drives faster results and business responsiveness. Thus, there is a need for a system to rapidly develop and deploy analytic code for rapid development and deployment of reusable analytic code for use in computerized data modeling and analysis.
SUMMARYThe present disclosure relates to a system and method for rapid development and deployment of reusable analytic code for use in computerized data modeling and analysis. The system includes a centralized, continually updated environment to capture pre-processing steps used in analyzing big data, such that the complex transformations and calculations become continually fresh and accessible to those investigating business opportunities. This centralized, continually refreshed system provides a data-centric competitive advantage for users (e.g., to serve customers better, reduce costs, etc.), as it provides the foresight to anticipate future problems and reuses development efforts. The system incorporates deep domain expertise as well as ongoing expertise in data science, big data architecture, and data management processes. In particular, the system allows for rapid development and deployment of analytic code that can easily be re-used in various data analytics applications, and on multiple computer systems.
Benefits of the system include a faster time to value as data scientists can now assemble pre-existing ETL (extract, transform, and load) processes as well as signal generation components to tackle new use cases more quickly. The present disclosure is a technological solution for coding and developing software to extract information for “big data” problems. The system design allows for increased modularity by integrating with various other platforms seamlessly. The system design also incorporates a new technological solution for creating “signals” which allows a user to extract information from “big data” by focusing on high-level issues in obtaining the data the user desires and not having to focus on the low-level minutia of coding big data software as was required by previous systems. The present disclosure allows for reduced software development complexity, quicker software development lifecycle, and reusability of software code.
In accordance with one embodiment of the invention, a computer-implemented method, system, and computer program product are provided for code and data versioning for managing shared datasets in a collaborative data processing system including data files and code files, the data files including production data files, the code files including production code files. The computer-implemented method, system, and computer program product perform processes including maintaining a storage system for storing the data files and code files; receiving a request by a given user to modify a given production code file; establishing a task for the user; placing a lock on the given production code file; storing a modified version of the given production code file in a logical partition of the storage system associated with the task; applying the modified version of the given production code file to a specified data file to create a modified version of the specified data file; assigning a first unique version identifier for the modified version of the specified data file; and storing the modified version of the specified data file in the logical partition of the storage system associated with the task in a manner accessible using the first unique version identifier, such that the modified code file is isolated from the production code files and code files of other users, and the modified data filed is isolated from the production data files and data files of other users.
In various alternative embodiments, the specified data file may be a production data file or a modified version of a production data file. The logical partition may be a folder. The storage associated with the task includes an append-only file system and wherein the modified version of the specified data file includes append-only files representing changes relative to the specified data file.
Additionally or alternatively, the processes may further include committing the modified version of the given production code file such that the modified version of the given production code file is designated as the latest version of the given production code file; and committing the modified version of the specified data file such that the modified version of the specified data file is designated as the latest version of the specified data file among the set of production data files. Data label features and a plurality of configuration files may be used to allow the user to publish and use the latest version of analytic code. The user's workspace may be isolated from previous versions of analytic code so that the user does not encounter interruptions from new versions of the analytic code.
In accordance with another embodiment of the invention, a computer-implemented method, system, and computer-program product are provided for code and data versioning for managing shared datasets in a collaborative data processing system including data files and code files, the data files including production data files, the code files including production code files. The computer-implemented method, system, and computer program product perform processes including maintaining a first view having a production version of a dataset; creating a task for a developer, the task being associated with a second view; associating a first code file with the task for the first view, the first code file including code that modifies the dataset; creating a temporary version of the dataset in the first view; associating the temporary version of the dataset with the task; associating a second code file with the task for the second view, the second code file including an instruction to read the dataset from the first view without identifying a specific version of the dataset from the first view; and upon execution of the code file in the second view, automatically reading from the temporary dataset associated with the task based on the association of the temporary dataset with the task.
In various alternative embodiments, locks may be placed on the first code file in the first view and the second code file in the second view. Placing locks on the first and second code files may involve checking out the first and second code files from a source control system. The processes may further include receiving, from the developer, a request to commit changes made to the first and second code files; checking the first and second code files into a source control system; changing the temporary dataset to be a latest dataset; and terminating the task.
Additional embodiments may be disclosed and claimed.
The foregoing features of the disclosure will be apparent from the following Detailed Description, taken in connection with the accompanying drawings, in which:
Disclosed herein is a system and method for rapid development and deployment of reusable analytic code for use in computerized data modeling and analysis, as discussed in detail below in connection with
As used herein, the terms “signal” and “signals” refers to the data elements, patterns, and calculations that have, through scientific experimentation, been proven valuable in predicting a particular outcome. Signals can be generated by the system using analytic code that can be rapidly developed, deployed, and reused. Signals carry useful information about behaviors, events, customers, systems, interactions, attributes, and can be used to predict future outcomes. In effect, signals capture underlying drivers and patterns to create useful, accurate inputs that are capable of being processed by a machine into algorithms. High-quality signals are necessary to distill the relationships among all the entities surrounding a problem and across all the attributes (including their time dimension) associated with these entities. For many problems, high-quality signals are as important in generating an accurate prediction as the underlying machine-learning algorithm that acts upon these signals in creating the prescriptive action.
The system of the present disclosure is referred to herein as “Signal Hub.” Signal Hub enables transforming data into intelligence as analytic code and then maintaining the intelligence as signals in a computer-based production environment that allows an entire organization to access and exploit the signals for value creation. In a given domain, many signals can be similar and reusable across different use cases and models. This signal-based approach enables data scientists to “write once and reuse everywhere,” as opposed to the traditional approach of “write once and reuse never.” The system provides signals (and the accompanying analytic code) in the fastest, most cost-effective method available, thereby accelerating the development of data science applications and lowering the cost of internal development cycles. Signal Hub allows ongoing data management tasks to be performed by systems engineers, shifting more mundane tasks away from scarce data scientists.
Signal Hub integrates data from a variety of sources, which enables the process of signal creation and utilization by business users and systems. Signal Hub provides a layer of maintained and refreshed intelligence (e.g., Signals) on top of the raw data that serves as a repository for scientists (e.g., data scientists) and developers (e.g., application developers) to execute analytics. This prevents users from having to go back to the raw data for each new use case, and can instead benefit from existing signals stored in Signal Hub. Signal Hub continually extracts, stores, refreshes, and delivers the signals needed for specific applications, such that application developers and data scientists can work directly with signals rather than raw data. As the number of signals grows, the model development time shrinks. In this “bow tie” architecture, model developers concentrate on creating the best predictive models with expedited time to value for analytics. Signal Hub is highly scalable in terms of processing large amounts of data as well as supporting the implementation of a myriad of use cases. Signal Hub could be enterprise-grade, which means that in addition to supporting industry-standard scalability and security features, it is easy to integrate with existing systems and workflows. Signal Hub can also have a data flow engine that is flexible to allow processing of different computing environments, languages, and frameworks. A multi target system data flow compiler can generate code to deploy on different target data flow engines utilizing different computer environments, languages, and frameworks. For applications with hard return on investment (ROI) metrics (e.g., churn reduction), faster time to value can equate to millions of dollars earned. Additionally, the system could lower development costs as data science project timelines potentially shrink, such as from 1 year to 3 months (e.g., a 75% improvement). Shorter development cycles and lower development costs could result in increased accessibility of data science to more parts of the business. Further, the system could reduce the total costs of ownership (TCO) for big data analytics.
The system 10 could be web-based and remotely accessible such that the system 10 communicates through a network 20 with one or more of a variety of computer systems 22 (e.g., personal computer system 26a, a smart cellular telephone 26b, a tablet computer 26c, or other devices). Network communication could be over the Internet using standard TCP/IP communications protocols (e.g., hypertext transfer protocol (HTTP), secure HTTP (HTTPS), file transfer protocol (FTP), electronic data interchange (EDI), etc.), through a private network connection (e.g., wide-area network (WAN) connection, emails, electronic data interchange (EDI) messages, extensible markup language (XML) messages, file transfer protocol (FTP) file transfers, etc.), or any other suitable wired or wireless electronic communications format. Further, the system 10 could be in communication through a network 20 with one or more third party servers 28. These servers 28 could be disparate “compute” servers on which analytics could be performed (e.g., Hadoop, etc.). The Hadoop system can manage resources (e.g., split workload and/or automatically optimize how and where computation is performed). For example, the system could be fully or partially executed on Hadoop, a cloud-based implementation, or a stand-alone implementation on a single computer. More specifically, for example, system development could be executed on a laptop, and production could be on Hadoop, where Hadoop could be hosted in a data center.
Signals are key ingredients to solving an array of problems, including classification, regression, clustering (segmentation), forecasting, natural language processing, intelligent data design, simulation, incomplete data, anomaly detection, collaborative filtering, optimization, etc. Signals can be descriptive, predictive, or a combination thereof. For instance, Signal Hub can identify high-yield customers who have a high propensity to buy a discounted ticket to destinations that are increasing in popularity. Descriptive signals are those which use data to evaluate past behavior. Predictive signals are those which use data to predict future behavior. Signals become more powerful when the same data is examined over a (larger) period of time, rather than just an instance.
Descriptive signals could include purchase history, usage patterns, service disruptions, browsing history, time-series analysis, etc. As an example, an airline trying to improve customer satisfaction may want to know about the flying experiences of its customers, and it may be important to find out if a specific customer had his/her last flight cancelled. This is a descriptive signal that relies on flight information as it relates to customers. In this example, a new signal can be created to look at the total number of flight cancelations a given customer experienced over the previous twelve months. Signals can measure levels of satisfaction by taking into account how many times a customer was, for instance, delayed or upgraded in the last twelve months.
Descriptive signals can also look across different data domains to find information that can be used to create attractive business deals and/or to link events over time. For example, a signal may identify a partner hotel a customer tends to stay with so that a combined discounted deal (e.g., including the airline and the same hotel brand) can be offered to encourage the customer to continue flying with the same airline. This also allows for airlines to benefit from and leverage the customer's satisfaction level with the specific hotel partner. In this way, raw input data is consolidated across industries to create a specific relationship with a particular customer. Further, a flight cancelation followed by a hotel stay could indicate that the customer got to the destination but with a different airline or a different mode of transportation.
Predictive signals allow for an enterprise to determine what a customer will do next or how a customer will respond to a given event and then plan appropriately. Predictive signals could include customer fading, cross-sell/up-sell, propensity to buy, price sensitivity, offer personalization, etc. A predictive signal is usually created with a use case in mind. For example, a predictive signal could cluster customers that tend to fly on red-eye flights, or compute the propensity level a customer has for buying a business class upgrade.
Signals can be categorized into classes including sentiment signals, behavior signals, event/anomaly signals, membership/cluster signals, and correlation signals. Sentiment signals capture the collective prevailing attitude about an entity (e.g., consumer, company, market, country, etc.) given a context. Typically, sentiment signals have discrete states, such as positive, neutral, or negative (e.g., current sentiment on X corporate bonds is positive). Behavior signals capture an underlying fundamental behavioral pattern for a given entity or a given dataset (e.g., aggregate money flow into ETFs, number of “30 days past due” in last year for a credit card account, propensity to buy a given product, etc.). These signals are most often a time series and depend on the type of behavior being tracked and assessed. Event/Anomaly signals are discrete in nature and are used to trigger certain actions or alerts when a certain threshold condition is met (e.g., ATM withdrawal that exceeds three times the daily average, bond rating downgrade by a rating agency), etc. Membership/Cluster signals designate where an entity belongs, given a dimension. For example, gaming establishments create clusters of their customers based on spending (e.g., high rollers, casual gamers, etc.), or wealth management firms can create clusters of their customers based on monthly portfolio turnover (e.g., frequent traders, buy and hold, etc.). Correlation signals continuously measure the correlation of various entities and their attributes throughout a time series of values between 0 and 1 (e.g., correlation of stock prices within a sector, unemployment and retail sales, interest rates and GDP, home prices and interest rates, etc.).
Signals have attributes based on their representation in time or frequency domains. In a time domain, a Signal can be continuous (e.g., output from a blood pressure monitor) or discrete (e.g., daily market close values of the Dow Jones Index). Within the frequency domain, signals can be defined as high or low frequency (e.g., asset allocation trends of a brokerage account can be measured every 15 minutes, daily, and monthly). Depending on the frequency of measurement, a signal derived from the underlying data can be fast-moving or slow-moving.
Signals are organized into signal sets that describe (e.g., relate to) specific business domains (e.g. customer management). Signal sets are industry-specific and cover domains including customer management, operations, fraud and risk management, maintenance, network optimization, digital marketing, etc. Signal Sets could be dynamic (e.g., continually updated as source data is refreshed), flexible (e.g., adaptable for expanding parameters and targets), and scalable (e.g., repeatable across multiple use cases and applications).
The platform architecture provides great deployment flexibility. It can be implemented on a single server as a single process (e.g., a laptop), or it can run on a large-scale Hadoop cluster with distributed processing, without modifying any code. It could also be implemented on a standalone computer. This allows scientists to develop code on their laptops and then move it into a Hadoop cluster to process large volumes of data. The Signal Hub Server architecture addresses the industry need for large-scale production-ready analytics, a need that popular tools such as SAS and R cannot fulfill even today, as their basic architecture is fundamentally main memory—limited.
Signal Hub components include signal sets, ETL processing, dataflow engine, signal-generating components (e.g., signal-generation processes), APIs, centralized security, model execution, and model monitoring. The more use cases that are executed using Signal Hub 60, the less time it takes to actually implement them over time because the answers to a problem may already exist inside Signal Hub 60 after a few rounds of signal creation and use case implementation. Signals are hierarchical, such that within Signal Hub 60, a signal array might include simple signals that can be used by themselves to predict behavior (e.g., customer behavior powering a recommendation) and/or can be used as inputs into more sophisticated predictive models. These models, in turn, could generate second-order, highly refined signals, which could serve as inputs to business-process decision points.
The design of the system and Signal Hub 60 allows users to use a single simple expression that represents multiple expressions of different levels of data aggregations. For example, suppose there is a dataset with various IDs. Each ID could be associated with an ID type which could also be associated with an occurrence of an event. One level of aggregation could be to determine for each ID and each ID type, the number of occurrence of an event. A second level of aggregation could be to determine for each ID, what is the most common type of ID based on the number of occurrence of an event. The system of the present disclosure allows this determination based on multiple layers of aggregation to be based on a single scalar expression and returning one expected output at one time. For example, using the code category_histogram(col), the system will create a categorical histogram for a given column, with each unique value in the column being considered a category. Using the code “mode(histogram, n=1),” allows the system to return the category with the highest number of entries. If n>1, retrieve the n'th most common value (2nd, 3rd . . . ); if n<0, retrieve the least common value (n=−1); and second least common (n=−2) etc. In the event several keys have equal frequencies, the smallest (if keys are numerical) or earliest (if keys are alphabetical) are returned. The following an example of a sample input and output based on the foregoing example.
Input:
The user interface of the Workbench could include components such as a tree view 72, an analytic code development window 74, and a supplementary display portion 76. The tree view 72 displays each collection of raw data files (e.g., indicated by “Col” 73a) as well as logical data views (e.g., indicated by “Vw” 73b), as well as third-party code called as user defined functions if any (e.g., python, R, etc.). The analytic code development window 74 has a plurality of tabs including Design 78, Run 80, and Results 82. The Design tab 78 provides a space where analytic code can be written by the developer. The Run tab 80 allows the developer to run the code and generate signal sets. Finally, the Results tab 82 allows the developer to view the data produced by the operations defined in the Run tab 80.
The supplementary display portion 76 could include additional information including schemas 84 and dependencies 86. Identifying, extracting, and calculating signals at scale from noisy big data requires a set of predefined signal schema and a variety of algorithms. A signal schema is a specific type of template used to transform data into signals. Different types of schema may be used, depending on the nature of the data, the domain, and/or the business environment. Initial signal discovery could fall into one or more of a variety of problem classes (e.g., regression classification, clustering, forecasting, optimization, simulation, sparse data inference, anomaly detection, natural language processing, intelligent data design, etc.). Solving these problem classes could require one or more of a variety of modeling techniques and/or algorithms (e.g., ARMA, CART, CIR++, compression nets, decision trees, discrete time survival analysis, D-Optimality, ensemble model, Gaussian mixture model, genetic algorithm, gradient boosted trees, hierarchical clustering, kalman filter, k-means, KNN, linear regression, logistic regression, Monte Carlo Simulation, Multinomial logistic regression, neural networks, optimization (LP, IP, NLP), poisson mixture model, Restricted Boltzmann Machine, Sensitivity trees, SVD, A-SVD, SVD++, SVM, projection on latent structures, spectral graph theory, etc.).
Advantageously, the Workbench 70 provides access to pre-defined libraries of such algorithms, so that they can be easily accessed and included in analytic code being generated. The user then can re-use analytic code in connection with various data analytics projects. Both data models and schemas can be developed within the Workbench 70 or imported from popular third-party data modeling tools (e.g., CA Erwin). The data models and schemas are stored along with the code and can be governed and maintained using modern software lifecycle tools. Typically, at the beginning of a Signal Hub project, the Workbench 70 is used by data scientists for profiling and schema discovery of unfamiliar data sources. Signal Hub provides tools that can discover schema (e.g., data types and column names) from a flat file or a database table. It also has built-in profiling tools, which automatically compute various statistics on each column of the data such as missing values, distribution parameters, frequent items, and more. These built-in tools accelerate the initial data load and quality checks.
Once data is loaded and discovered, it needs to be transformed from its raw form into a standard representation that will be used to feed the signals in the signal layer. Using the Workbench 70, data scientists can build workflows composed of “views” that transform the data and apply data quality checks and statistical measures. The Signal Hub platform can continuously execute these views as new data appears, thus keeping the signals up to date.
The dependencies tab 86 could display a dependency diagram (e.g., a graph) of all the activities comprising the analytic project, as discussed below in more detail. A bottom bar 88 could include compiler information, such as the number of errors and warnings encountered while processing views and signal sets.
Once the signals are selected, then in step 216, solutions and models are developed based on the signals selected. In step 218, results are evaluated and if necessary, signals (e.g., created and/or selected) and/or solutions/models are revised accordingly. Then in step 220, the solutions/models are deployed. In step 222, results are monitored and feedback gathered to incorporate back into the signals and/or solutions/models.
The Signal Hub Server is able to perform large-scale processing of terabytes of data across thousands of Signals. It follows a data-flow architecture for processing on a Hadoop cluster (e.g., Hadoop 2.0). Hadoop 2.0 introduced YARN (a large-scale, distributed operating system for big data applications), which allows many different data processing frameworks to coexist and establishes a strong ecosystem for innovating technologies. With YARN, Signal Hub Server solutions are native certified Hadoop applications that can be managed and administered alongside other applications. Signal Hub users can leverage their investment in Hadoop technologies and IT skills and run Signal Hub side-by-side with their current Hadoop applications.
Raw data is stored in the raw data database 258 of the Hadoop Data Lake 256. In step 260, Hadoop/Yarn and Signal Hub 254 process the raw data 258 with ETL (extract, transform, and load) modules, data quality management modules, and standardization modules. The results of step 260 are then stored in a staging database 262 of the Hadoop Data Lake. In step 260, Hadoop/Yarn and Signal Hub 254 process the staging data 262 with signal calculation modules, data distribution modules, and sampling modules. The results of step 264 are then stored in the Signals and Model Input database 266. In step 268, the model development and validation module 268 of the model building tools 252 processes the signals and model input data 266. The results of step 268 are then stored in the model information and parameters database 270. In step 272, the model execution module 272 of the Hadoop/Yarn and Signal Hub 254 processes signals and model input data 266 and/or model information and parameters data 270. The results of step 272 are then stored in the model output database 274. In step 276, the Hadoop/Yarn and Signal Hub 254 processes the model output data 274 with a business rules execution output transformation for business intelligence and case management user interface. The results of step 276 are then stored in the final output database 278. Enterprise applications 280 and business intelligence systems 282 access the final output data 278, and can provide feedback to the system which could be integrated into the raw data 258, the staging data 262, and/or the signals and model input 266.
The Signal Hub Server automates the processing of inputs to outputs. Because of its data flow architecture, it has a speed advantage. The Signal Hub Server has multiple capabilities to automate server management. It can detect data changes within raw file collections and then trigger a chain of processing jobs to update existing Signals with the relevant data changes without transactional system support.
Signal Hub 304 allows companies to absorb information from various data sources 302 to be able to address many types of problems. More specifically, Signal Hub 304 can ingest both internal and external data as well as structured and unstructured data. As part of the Hadoop ecosystem, the Signal Hub Server can be used together with tools such as Sqoop or Flume to digest data after it arrives in the Hadoop system. Alternatively, the Signal Hub Server can directly access any JDBC (Java Database Connectivity) compliant database or import various data formats transferred (via FTP, SFTP, etc.) from source systems.
Signal Hub 304 can incorporate existing code 318 coded in various (often non-compatible) languages (e.g., Python, R, Unix Shell, etc.), called from the Signal Hub platform as user defined functions. Signal hub 304 can further communicate with modeling tools 320 (e.g., SAS, SPSS, etc.), such as via flat file, PMML (Predictive Model Markup Language), etc. The PMML format is a file format describing a trained model. A model developed in SAS, R, SPSS, or other tools can be consumed and run within Signal Hub 304 via the PMML standard. Advantageously, such a solution allows existing analytic code that may be written in various, non-compatible languages (e.g., SAS, SPSS, Python, R, etc.) to be seamlessly converted and integrated for use together within the system, without requiring that the existing code be re-written. Additionally, Signal Hub 304 can create tests and reports as needed. Through the Workbench, descriptive signals can be exported into a flat file for the training of predictive models outside Signal Hub 304. When the model is ready, it can then be brought back to Signal Hub 304 via the PMML standard. This feature is very useful if a specific machine-learning technique is not yet part of the model repertoire available in Signal Hub 304. It also allows Signal Hub 304 to ingest models created by clients in third-party analytic tools (including R, SAS, SPSS). The use of PMML allows Signal Hub users to benefit from a high level of interoperability among systems where models built in any PMML-compliant analytics environment can be easily consumed. In other words, because the system can automatically convert existing (legacy) analytic code modules/libraries into a common format that can be executed by the system (e.g., by automatically converting such libraries into PMML-compliant libraries that are compatible with other similarly compliant libraries), the system thus permits easy integration and re-use of legacy analytic code, interoperably with other modules throughout the system.
Signal Hub 304 integrates seamlessly with a variety of front-end systems 322 (e.g., use-case specific apps, business intelligence, customer relationship management (CRM) system, content management system, campaign execution engine, etc.). More specifically, Signal Hub 304 can communicate with front end systems 322 via a staging database (e.g., MySQL, HIVE, Pig, etc.). Signals are easily fed into visualization tools (e.g. Pentaho, Tableau), CRM systems, and campaign execution engines (e.g. Hubspot, ExactTarget). Data is transferred in batches, written to a special data landing zone, or accessed on-demand via APIs (application programming interfaces). Signal Hub 304 could also integrate with existing analytic tools, pre-existing code, and models. Client code can be loaded as an external library and executed within the server. All of this ensures that existing client investments in analytics can be reused with no need for recoding.
The Workbench 306 could include a workflow to process signals that includes loading 330, data ingestion and preparation 332, descriptive signal generation 336, use case building 338, and sending 340. In the loading step 330, source data is loaded into the Workbench 306 in any of a variety of formats (e.g., SFTP, JDBC, Sqoop, Flume, etc.). In the data ingestion and preparation step 332, the Workbench 306 provides the ability to process a variety of big data (e.g., internal, external, structured, unstructured, etc.) in a variety of ways (e.g., delta processing, profiling, visualizations, ETL, DQM, workflow management, etc.). In the descriptive signal generation step 334, a variety of descriptive signals could be generated (e.g., mathematical transformations, time series, distributions, pattern detection, etc.). In the predictive signal generation step 336, a variety of predictive signals could be generated (e.g., linear regression, logistic regression, decision tree, Naïve Bayes, PCA, SVM, deep autoencoder, etc.). In the use case building step 338, uses cases could be created (e.g., reporting, rules engine, workflow creator, visualizations, etc.). In the sending step 340, the Workbench 306 electronically transmits the output to downstream connectors (e.g., APIs, SQL, batch file transfer, etc.).
As for predictive signals, training and testing of models can easily be done in the Workbench through its intuitive and interactive user interface. Current techniques available for modeling and dimensionality reduction include SVMs, k-means, decision trees, association rules, linear and logistic regression, neural networks, RBM (machine-learning technique), PCA, and Deep AutoEncoder (machine-learning technique) which allows data scientists to train and score deep-learning nets. Some of these advanced machine-learning techniques (e.g., Deep AutoEncoder and RBM) project data from a high-dimensional space into a lower-dimensional one. These techniques are then used together with clustering algorithms to understand customer behavior.
Multiple features of the Knowledge Center facilitate accessing and consuming intelligence. The first is its filtering and searching capabilities. When signals are created, they are tagged based on metadata and organized around a taxonomy. The Knowledge Center empowers business users to explore the signals through multiple filtering and searching mechanisms.
Key components of the metadata in each signal include the business description, which explains what the signal is (e.g., number of times a customer sat in the middle seat on a long-haul flight in the past three years). Another key component of the metadata in each signal is the taxonomy, which shows each signal's classification based on its subject, object, relationship, time window, and business attributes (e.g., subject=customer, object=flight, relationship=count, time window=single period, and business attributes=long haul and middle seat).
The Knowledge Center facilitates exploring and identifying signals based on this metadata when executing use cases by using filtering and free-text searching. The Knowledge Center also allows for a complete visualization of all the elements involved in the analytical solution. Users can visualize how data sources connect to models through a variety of descriptive signals, which are grouped into Signal Sets depending on a pre-specified and domain-driven taxonomy. The same interface also allows users to drill into specific signals. Visualization tools can also allow a user to visualize end-to-end analytics solution components from the data, to the signal and finally to the use-cases. The system can automatically detect the high level lineage between the data, signal and use-cases when hovering over specific items. The system can also allow a user to further drill down specific data, signal and use-cases by predefined metadata which can also allow a user to view the high level lineage as well.
The Signal Hub platform 600 also displays all the data sources that are fed into the signals of the category chosen. For example, for the “route” category, the data sources include event mater, customer, clickthrough, hierarchy, car destination, ticket coupon, non-flight delivery item, booking master, holiday hotel destination, customer, ancillary master, customer membership, ref table: station pair, table: city word cloud, web session level, ref table: city info, ref table: country code, web master, redemption flight items, email notification, gold guest list, table: station pair info, customer account tcns, service recovery master, etc. A user can then choose one or more of these data sources to further filter the signals (and/or to navigate to those data sources for additional information).
The Signal Hub platform 600 also displays all the models that utilize the signals of the category chosen. For example, for the “route” category, the predictive signals within that category include hotel propensity, destination propensity, pay-for-seat propensity, upgrade propensity, etc. A user can then choose one or more of these predictive signals.
The system can further facilitate collaboration by allowing a single library to be developed by a single developer at one point in time which will reduce code merging issues. Furthermore, the system can use source control to make code modifications. A user can update when she wants to receive changes from her team members, and commit when she wants them to be able to see other developers' changes. Each developer at a point in time can be responsible for specific views and their data assets. The owner of the view can be responsible for creating new versions of their data while other developers can only read the data that has been made public to them. Ownership can change between developers or even to a common shared user. A dedicated workspace can be created in the shared cluster which can be read-only for other developers and only the owner of the workspace can write and update data. When new code and data is developed, the developer can commit the changes to the source control and publish the new data in the cluster to the other developers. This allows the other developers to see the code changes and determine if they would like to integrate it with their current work.
In most cases, users can “own” a piece of code, either independently or as a team. They can be responsible for updating and testing the code, upgrading the inputs to their code, and releasing versions to be consumed by other users downstream. Thus, if the team maintaining a given set of code needs an input upgraded, they can contact the team responsible for that code and request the relevant changes and new release. If the team upstream is not able to help, the user can change the “libraryOutputPaths” for the necessary code to a directory in which they have permissions. It involves no code changes past the small change in the environment file. If the upstream team is able to help, they can make the release. This allows collaboration with minimum disruption.
The functionality provided by the present disclosure could be provided by a Signal Hub program/engine 106, which could be embodied as computer-readable program code stored on the storage device 104 and executed by the CPU 112 using any suitable, high or low level computing language, such as Python, Java, C, C++, C#, .NET, MATLAB, etc. The network interface 108 could include an Ethernet network interface device, a wireless network interface device, or any other suitable device which permits the server 102 to communicate via the network. The CPU 112 could include any suitable single- or multiple-core microprocessor of any suitable architecture that is capable of implementing and running the signal hub program 106 (e.g., Intel processor). The random access memory 114 could include any suitable, high-speed, random access memory typical of most modern computers, such as dynamic RAM (DRAM), etc.
In certain exemplary embodiments, an interactive reporting tool allows the user to define, modify, and selectively execute a sequence of queries (which also may be referred to as query steps of an overall query) in an interactive manner. Generally speaking, such an interactive reporting tool provides user interface screens including an interactive query code table into which the user can configure a sequence of queries and also including a data table in which the results of a particular query or sequence of queries is displayed. The interactive query code table may be configured to include a plurality of rows, with each row representing a query. In certain exemplary embodiments, the interactive query code table may be configured to include a plurality of cells organized into rows and columns, with each column representing a distinct query parameter (signal) from among a plurality of query parameters (signals), and with each row representing a query involving at least one of the distinct query parameters (signals) by way of a query operator in the column/cell corresponding to each distinct query parameter (signal) involved in the query. Thus, an interactive query code table generally includes n rows representing a sequence of n queries identifiable as queries 1 through n. When the user selects a given row i or a cell in a given row i (where i is between 1 and n, inclusive), then the data table is updated to show the results of the queries 1 through i, such that the user effectively can step through the queries in any order (i.e., sequential or non-sequential, forward or backward) to see the results of each step. Furthermore, if the user changes the query in a given row (e.g., by adding a new query parameter or changing an existing query parameter in a column/cell of the row), then the prior results from the queries associated with rows i through n may be invalidated, and at least the changed query (and typically all of the queries 1 through i) is executed in order to update the data table to include results from the changed query. Embodiments may cache the results of various queries so that unchanged queries need not be re-executed. Embodiments additionally or alternatively may place a temporary lock on a set of queries in the interactive query code table so as to limit the types of changes that can be made to the interactive query code table, e.g., allowing only the addition of new queries while the temporary lock is in place. The lock can be enforced, for example, by prompting the user upon receiving a user input that would change an existing query in the interactive query code table.
In the example shown in
With reference again to
If the user then selected cell 3512 in query 3410 as in
Thus, as discussed above, the user can switch between different query steps by moving the mouse cursor and view the result for corresponding step(s). For example, in
In order to allow for such interactive database reporting in a dynamic system in which query parameters (signals) may change from time to time, the Signal Hub platform 600 may store temporary copies of query parameters (signals) that are used in the queries so that the results can be replicated and/or manipulated using a baseline of query parameters (signals). Also, results from one or more of the queries may be cached so that, for example, when the user steps from one row/query to another row/query, the resulting data table can be produced quickly from cached data. It should be noted that the interactive database reporting tool can be implemented using virtually any database or query processing engine and therefore is not limited to any particular database or query processing engine.
In any database query system, including embodiments described above, queries are often executed against big data sets, and each “pass” made over the data set to execute such queries can take many hours to complete depending on the data set size and the number of types of queries. It is helpful, then, to minimize the number of “passes” made over the data set when executing the queries in order to optimize query execution.
Therefore, in certain exemplary embodiments, a sequence of queries is divided into stages, where each stage involves one pass over the data, such that the sequence of queries can be executed using the minimum number of passes over the data. Specifically, in certain exemplary embodiments, the sequence of queries is processed into a functional dependency graph that represents the relationships between query parameters (signals) and query operations, and the functional dependency graph is then processed to divide the queries into a number of successive stages such that each stage includes queries that can be executed based on data that exists prior to execution of that stage. A sequence of queries may, and often does, require that one or more intermediate values or datasets be generated using an aggregate function (i.e., a function that needs to pass over the entire dataset in order to compute a value) for use or consumption by another query. A query that produces a given intermediate result must be executed at a stage that is prior to execution any query that consumes the intermediate data set. It is possible for multiple intermediate results to be created at a given stage.
By way of example, the following set of expressions involves generation and consumption of an intermediate value when converted into database queries:
sum_discount=sum(discount);
avg_amt=avg(amt);
ct_large_amt=count( ) when amt>avg_amt;
In this example, the value “avg_amt” (average amount) needs to be computed before a count can be made of the number of records having an “amt” (amount) that is greater than the average amount “avg_amt”. To express this in SQL, the developer would have to plan this in two stages.
As the number of expression increases, it becomes more difficult for the developer to form the necessary nested queries. For example, the following set of expressions involves generation and consumption of two intermediate values when converted into database queries:
sum_discount=sum(discount);
avg_amt=avg(amt);
ct_large_amt=count( ) when amt>avg_amt;
ct_prod_with_large_amt=count_unig(prod_id) when ct_large_amt>=10
In this example, ct_large_amt is an aggregate function that cannot be executed until avg_amt is determined, and ct_prod_with_large_amt is an aggregate function that cannot be executed until ct_large_amt is determined. To express this in SQL, the developer would have to plan this in three stages.
Therefore, in certain exemplary embodiments, a functional dependency graph is first produced from the set of expressions.
The functional dependency graph is then traversed using a specially-modified breadth-first search traversal to assign each expression to an execution stage. Breadth-first search traversal of graphs/trees is well-known for tracking each node in the graph/tree that is reached and outputting a list of nodes in breadth-first order, i.e., outputting all nodes at a given level of the graph/tree before outputting any nodes at the next level down in the graph/tree. In exemplary embodiments, when evaluating nodes at a given level of the graph/tree associated with a given execution stage, the specially-modified breadth-first search traversal will not associate a particular lower-level node with the given execution stage if that node is associated with an aggregate function. Instead, the lower-level node associated with an aggregate function is placed in a later execution stage in accordance with the specially-modified breadth-first search traversal. The following is an algorithmic description for assigning each expression to an execution stage based on the functional dependency graph, in accordance with one exemplary embodiment:
-
- 1. Make each input column an input node in the graph
- 2. For each given expression, transform the expression into the functional dependency graph and connect it to the inputs or other expressions.
- 3. Set PhaseID to 0
- 4. Perform the specially-modified breadth-first search (BFS) graph traversal in accordance with Steps 5 through 21.
- 5. Breadth-First-Search—Aggregates Expressions (Graph, root):
- 6.
- 7. create empty set S
- 8. create empty queue Q
- 9.
- 10. add root to S
- 11. Q.enqueue(root)
- 12.
- 13. while Q is not empty:
- 14. current=Q.dequeue( )
- 15. if current is the goal:
- 16. return current
- 17. for each node n that is adjacent to current:
- 18. if n is not in S AND n is not Aggregate function:
- 19. add n to S
- 20. n.parent=current
- 21. Q.enqueue(n)
- 22. Add emitted nodes to the current phase.
- 23. Remove nodes emitted in Step 22 from the graph, increase phaseID by 1, and repeat Step 4 until no more nodes remain in the graph.
Steps 5 through 21 listed above are based on a standard breadth-first search traversal methodology published in WIKIPEDIA but with the addition of the conditional “AND n is not Aggregate function” in order to omit lower-level aggregate functions from being included in a given execution stage. The goal of the standard BFS traversal methodology is to queue/output the nodes from the graph in breadth-first order, i.e., the root, followed by all nodes one layer down from the root, followed by all nodes two layers down from the root, and so on. Literally applying this specially-modified BFS traversal methodology to the graph shown in
Thus, this particular specially-modified BFS traversal methodology is presented for example only, and implementations may use a variation of this methodology or other, suitably-modified, BFS traversal methodology.
Once the expressions are assigned to different stages, database queries are automatically generated in accordance with known methods so that the expressions can be executed in accordance with the stages. The following is example SQL code for executing the staged expressions from the example above:
Here, each phase is a single flat query with the same grouping key as the original grouping. The result of each phase is joined with the inputs using the grouping key, and made available for the next phase. Any relational engine can now execute these queries in a pipeline using the relational schema illustrated by the SQL syntax above. It should be noted that this algorithm is not only applicable to Database that support SQL, but for any relational engine or data flow engine such as Apache Spark and Apache Tez.
While the example above uses simple aggregate functions such as sum, count, count_unique, and simple scalar functions such as “>=” and “>”, this schema can be easily expanded to any aggregate and scalar functions, where aggregate functions are functions that need to pass over the entire data in order to compute a value and scalar functions can be computed on each record. Using this designation, functions can be categorized appropriately and the schema can work on any set of expressions. Thus, starting with a simple non-nested set of expressions that have inter-dependency, a relational nested representation of the computation is produced.
In certain exemplary embodiments, the functional dependency graph is produced by processing each expression as follows. An output parameter and a set of input parameters associated with the scalar expression are identified, as in block 4012. A node for the output parameter is created in the functional dependency graph if the node for the output parameter does not exist in the functional dependency graph, as in block 4014. For for each input parameter in the set of input parameters, a distinct node is created in the functional dependency graph if the distinct node for the input parameter does not exist in the functional dependency graph, as in block 4016. An association is established in the functional dependency graph between the output parameter node and each of the distinct input parameter nodes, as in block 4018.
In certain specific exemplary embodiments, the Signal Hub server 600 may be configured to divide a sequence of queries into stages such that the sequence of queries can be executed using the minimum number of passes over the data set. For example, the Workbench discussed herein may provide user interface screens through which the user can enter a sequence of expressions that are then divided into stages and converted into database queries (e.g., SQL queries), as discussed above. Additionally or alternatively, the Signal Hub server 600 may apply query execution optimization as discussed herein to any set of queries, such as, for example, queries executed as part of the interactive reporting tool discussed above.
As discussed above, e.g., with reference to
For example, with reference again to
In typical coding environments, a source control system is used to allow developers to make code changes without affecting each other's work and without affecting the production code, until changes are ready to be committed for production. Source control systems allow each developer to work on a separate version of the code, but when development is done and the changes can be shared with others, the code can be committed into the shared version. However, in data processing environments, while source control systems can be used to allow for source control, versioning, and collaboration on the code, large datasets cannot be versioned efficiently via the source control system. In addition, because of the dependency graph shown in
To resolve these challenges, certain exemplary embodiments provide code and data versioning based on the concept of “views” and “workflows,” where a view is a defined by a relational processing logic (e.g., a query) that consumes data from other views or data sources and persists the processed data into files (e.g., on the append-only file system), and a workflow is a directed graph of views where the nodes are views and the edges between views represent that one view reads from another view (e.g., an edge between a view v1 to a later view v2 describes that v2 reads from v1). A workflow engine can traverse a workflow graph and execute each view in topographical order.
The code and data versioning method can be summarized as follows:
-
- 1. The source code of the views is managed by a source control system.
- 2. Each developer has a separate workspace that contains a snapshot of the code from the source control system.
- 3. Production workflows will share the same workspace.
- 4. When a developer makes a code change, the code change will have to be committed before other workspaces can be updated (via a separate action) to reflect the code change.
- 5. Before a developer can make a code change, a lock is placed on the view, allowing only a single developer to work on a single dataset.
- 6. From the moment a developer makes a code change until the time the code changes are committed, the developer is associated with a Task that describes the activity for which the code changes are made.
These steps involve standard source control operations that are supported in one way or another by various source control system (e.g., SVN, Git), such as Start Task, Finish Task, and Update Workspace. When a user starts a new task, any change to the code will invoke a lock on the code file with the changes and prevent other users from changing this code file. Once a user finishes a task, all code is committed into the shared version and all locks are released. The Update Workspace operation updates the developer workspace files from the shared version.
In order to support code and data versioning using shared datasets, the following additional steps are involved:
-
- 7. Each View has a write operation and read operation. These operations must support adding a version while producing the output data. If a new version name is given to the write operation, the entire data will be output to a new sub-directory with this version name. If the same version name is given, then the data in that version will be overwritten by the write operation.
- 8. Production workflows will all use a “no version” value for all read\write operations.
- 9. When a developer runs a definition from a workspace, if the definition includes changes, the version will receive the Task name that is currently associated with the changes. If the definition is unchanged and represents the latest version of the code in the source control system, then the data version will be referred to as “latest”.
- 10. The system will maintain, for each developer workspace, a version map that stores which views have been executed for the developer and what was the version value for that view. Any read operation on the view while executing subsequent views from the developer workspace will be instrumented by the version value. In addition, views that are not executed from the workspace will be read without a version and therefore will be read from the production data. When finishing a task and committing the code, the map can be cleared.
In order to accomplish such code and data versioning, the Start Task, Finish Task, and Update Workspace operations are updated as follows. Once a user starts a new task, any change to the code will invoke a lock on the file with the changes and prevent other users from changing this code file, and locked changed code will also change the version as described in Step 7 while a task is active. Once a user finishes a task, all code is committed into the shared version and all locks are released, and then the version for changed definitions will switch back to “latest.” The Update Workspace operation will update the developer workspace files from the shared version.
By way of example, imagine that there is a view V1 containing the production code and data. A production process in view V2 includes code that contains an instruction to read from view V1. When the production process in view V2 is executed with no version specified, the system automatically reads from the “no version” production data in view V1, as shown schematically in
Imagine instead that view V2 is associated with a developer who has made code changes in view V2 but no changes to code or data associated with view V1. The code in view V2 again contains an instruction to read from view V1. A temporary data set is created in view V1 for use by the developer in view V2 (i.e., so that any production changes that occur do not affect the developer's work, and any changes made by the developer do not affect the production data). When the development process in view V2 is executed with no version specified, the system automatically reads from the temporary dataset rather than from the production dataset, even though the code was not modified to read from the temporary dataset (i.e., the system automatically correlates the temporary dataset with the development view). When the developer checks in the code changes from view V2, the temporary dataset in view V1 is renamed to be the “latest,” as shown schematically in
Imagine instead that view V2 is associated with a developer assigned to “Task 1” who has made code changes in view V2 and also in view V1 (e.g., to add a new column to, or change a column definition in, a table in view V1). In this case, a lock is placed on the code file associated with view V1 in to prevent other developers from making conflicting changes, and a temporary dataset is created in view V1 for use by Task 1. The temporary dataset is named Task 1 so that it is correlated with the Task 1 development view, i.e., view V2. When the development process in view V2 is executed with no version specified, the system automatically reads from the Task 1 temporary dataset rather than from the production dataset, as shown schematically in
One major advantage of this code and data versioning methodology is that developers do not have to modify code to read from a specific dataset. Rather, the system automatically determines the dataset to use for a given development task from among a production dataset, a latest dataset, or a temporary dataset associated with the development task.
In certain exemplary embodiments, code and data versioning, as well as other data storage operations, utilize an append-only file system such as the Hadoop distributed file system (HDFS). In order to support a relational database, insert, update and delete record operations must be supported over a table of records. Typically, this requires the database to update a file stored within the underlying file system on which the database is persisted. However, append-only file systems, such as HDFS, generally do not provide update operations, i.e., one cannot update an existing file in the file system. For append-only file systems, the only change operations allowed are create file, append to file, delete file, create directory, and delete directory. For this reason, databases (e.g., Apache Hive) that are implemented over the append-only file system generally do not support update operations.
Another problem that typically occurs is the need to perform a recovery or a rollback when an error is found after a change to a table. Database systems typically support transactions to abort changes performed to tables (e.g., akin to an “undo” function). However, such transactions generally have a short lifespan (e.g., must be performed soon after the unwanted change is made) and are not adequate for large changes on multiple tables. Such transactions are generally also built on file system update support that allows the database system to update “dirty” records and clear them when the records are committed, which is generally not possible with append-only file systems.
Therefore, certain exemplary embodiments provide support for update and delete operations in an append-only file system. Specifically, a folder is created for each view, and a new subfolder is created for each update to the view. Each subfolder is associated with a timestamp indicating the time the update was made to the view. The following is an example of a folder structure for a view named “View 1”:
Directory “View 1”
-
- Directory “timestamp1”
- File “datafile1”
- File “datafile2”
- Directory “timestamp2”
- File “datafile4”
- File “datafile7”
- Directory “timestamp3”
- File “datafile6”
- File “datafile9”
- Directory “timestamp1”
For every execution of the view, a new timestamp is generated, and new data is inserted into files in the new directory. This structure on its own supports appending new records as they go to new files but still does not support update or delete operations.
Therefore, in certain exemplary embodiments, update and delete operations always work on a primary unique key that can identify the unique logical record being changed. For example, a table of customers may have a unique customer identifier (customerID) representing the customer entity. When an update operation needs to change a certain field in a record, it provides the key for the operation, the field to change, and the new value for the field. The following is an example of such an update operation:
Update Customer where customerID=1, Set Active=False
This operation changes the value of the field named ‘Active’ to “False” for the record that belongs to the Customer with ID 1.
Due to the lack of update operations in the append-only file system, exemplary embodiments use the append operation for saving the update and add a new record representing the new record. For example, assume that the value of the field named ‘Active’ is initially set to “True” at Time 1 and is later changed to “False” at Time 3 for the record that belongs to the Customer with ID 1. A data file containing the record with Active=True would be added to a subfolder associated with the timestamp for Time 1 and a separate data file containing the record with Active=False would be added to a subfolder associated with the timestamp for Time 3. The following is an example of the folder structure containing the updated record:
Directory “View 1”
-
- Directory “timestamp1”
- File “datafile1”
- File “datafile2” (Customer ID=1, active=True: first appearance of this record)
- Directory “timestamp2”
- File “datafile4”
- File “datafile7”
- Directory “timestamp3”
- File “datafile6”
- File “datafile9” (Customer ID=1, active=False: second appearance of this record)
- Directory “timestamp1”
A read operation that reads only the last timestamp will only have the updated records of the last update, which represents only partial data. On the other hand, reading all of the files from all timestamps will contain duplicate records with inconsistent data for the same logical key. In order to solve this, exemplary embodiments read all data from all timestamps and then de-duplicate the records by taking the “latest” record associated with each logical key. This operation can be done efficiently if the data is partitioned by logical key, and the de-duplication can be done in memory with only slight overhead over normal read operations.
In order to support a delete operation, a ‘delete record’ is appended into the data in a manner similar to the update operation, and the read operation method is modified. A delete record is a record with the same key as the record to be deleted but with a special field called ‘delete’ that is set to true. The read operation now reads all data from all timestamps, de-duplicates the records by taking the “latest” record associated with each logical key, and then filters the resulting records based on the ‘delete’ field to keep or utilize only records with ‘delete’=false. The following is an example of the folder structure containing the deleted record:
Directory “View 1”
-
- Directory “timestamp1”
- File “datafile1”
- File “datafile2” (Customer ID=1, active=True: first appearance of this record)
- Directory “timestamp2”
- File “datafile4”
- File “datafile7”
- Directory “timestamp3”
- File “datafile6”
- File “datafile9” (Customer ID=1, active=False: second appearance of this record)
- Directory “timestamp4”
- File “datafile10”
- File “datafile11” (Customer ID=1, delete=True: third appearance of this record)
- Directory “timestamp1”
It should be noted that a periodic “compression” can be run over the folder in order to eliminate old updates and deleted records by a new snapshot that contains all logical keys in the same single timestamp directory. One method to achieve this is to read and process the entire table as discussed above, write the resulting records back into a new timestamp directory, and delete all previous directories. The following is an example of the folder structure following compression:
Directory “View 1”
-
- Directory “timestamp5”
- File “datafile12”
- File “datafile13” (excludes record with Customer ID=1)
- Directory “timestamp5”
From time to time, it might be necessary or desirable to roll back a workflow to an earlier view. Consider a workflow graph in which view V2 reads from view V1 and view V3 reads from view V2, for example, as depicted in
It should be noted that these described methods for updating and deleting versions in an append-only file system can be implemented in any data flow engine or database query processor that implements read\write relational data and is using an append-only file system.
Various embodiments of the present invention may be characterized by the potential claims listed in the paragraphs following this paragraph (and before the actual claims provided at the end of this application). These potential claims form a part of the written description of this application. Accordingly, subject matter of the following potential claims may be presented as actual claims in later proceedings involving this application or any application claiming priority based on this application. Inclusion of such potential claims should not be construed to mean that the actual claims do not cover the subject matter of the potential claims. Thus, a decision to not present these potential claims in later proceedings should not be construed as a donation of the subject matter to the public.
Without limitation, potential subject matter that may be claimed (prefaced with the letter “P” so as to avoid confusion with the actual claims presented below) includes:
P1. A computer-implemented method for interactive database query reporting, the method comprising:
causing display, by a server, on a display screen of a computer of a user, of a first graphical user interface screen including an interactive query code table having a plurality of rows, each row representing a database query, the interactive query code table including n rows identifiable as rows 1 through n and representing a sequence of n database queries identifiable as database queries 1 through n, wherein n is greater than one;
receiving, by the server, a first user input associated with a given row i of the interactive query code table; and
displaying, by the server, a first data table including results from execution of queries 1 through i, such that the server enables the user to select any given database query in the interactive query code table to view results through the given database query.
P2. The method of claim P1, further comprising:
receiving, by the server, a second user input making a change to the database query associated with the given row i;
executing, by the server, at least the changed database query associated with the given row i; and
displaying, by the server, a second data table including the results from execution of the changed database query i.
P3. The method of claim P2, further comprising:
invalidating, by the server, prior results from the database queries associated with rows i through n prior to executing at least the changed database query associated with the given row and displaying the second data table.
P4. The method of claim P2, wherein executing at least the changed database query associated with the given row comprises:
executing, by the server, the queries associated with rows 1 through i including the changed database query associated with the given row i.
P5. The method of claim P2, further comprising:
placing a temporary lock on the n rows of the interactive query code table to prevent the user from making inadvertent changes; and
allowing the user to override the temporary lock to provide the second user input.
P6. The method of claim P5, further comprising:
allowing the user to enter additional queries into additional rows of the interactive query code table but not change existing queries when the temporary lock is in place.
P7. The method of claim P1, wherein:
the interactive query code table includes a plurality of cells organized into rows and columns, each column representing a distinct query parameter (signal) from among a plurality of query parameters (signals), each row representing a database query involving at least one of the distinct query parameters (signals) by way of a query operator in the column/cell corresponding to each distinct query parameter (signal) involved in the database query; and
the first user input includes a selection of a given cell in the given row.
P8. The method of claim P2, wherein:
the interactive query code table includes a plurality of cells organized into rows and columns, each column representing a distinct query parameter (signal) from among a plurality of query parameters (signals), each row representing a database query involving at least one of the distinct query parameters (signals) by way of a query operator in the column/cell corresponding to each distinct query parameter (signal) involved in the database query;
the first user input includes a selection of a given cell in the given row; and
the second user input includes a change to the contents of the given cell to add a new query operator to the given cell or change a prior query operator in the given cell associated with query i.
P9. A system for interactive database query reporting, the system comprising:
a computer system having stored thereon and executing interactive database query reporting computer processes comprising:
causing display, on a display screen of a computer of a user, of a first graphical user interface screen including an interactive query code table having a plurality of rows, each row representing a database query, the interactive query code table including n rows identifiable as rows 1 through n and representing a sequence of n database queries identifiable as database queries 1 through n, wherein n is greater than one;
receiving a first user input associated with a given row i of the interactive query code table; and
displaying a first data table including results from execution of queries 1 through i, such that the server enables the user to select any given database query in the interactive query code table to view results through the given database query.
P10. The system of claim P9, wherein the interactive database query reporting computer processes further comprise:
receiving a second user input making a change to the database query associated with the given row i;
executing at least the changed database query associated with the given row i; and
displaying a second data table including the results from execution of the changed database query i.
P11. The system of claim P10, wherein the interactive database query reporting computer processes further comprise:
invalidating prior results from the database queries associated with rows i through n prior to executing at least the changed database query associated with the given row and displaying the second data table.
P12. The system of claim P10, wherein executing at least the changed database query associated with the given row comprises:
executing the queries associated with rows 1 through i including the changed database query associated with the given row i.
P13. The system of claim P10, wherein the interactive database query reporting computer processes further comprise:
placing a temporary lock on the n rows of the interactive query code table to prevent the user from making inadvertent changes; and
allowing the user to override the temporary lock to provide the second user input.
P14. The system of claim P13, wherein the interactive database query reporting computer processes further comprise:
allowing the user to enter additional queries into additional rows of the interactive query code table but not change existing queries when the temporary lock is in place.
P15. The system of claim P9, wherein:
the interactive query code table includes a plurality of cells organized into rows and columns, each column representing a distinct query parameter (signal) from among a plurality of query parameters (signals), each row representing a database query involving at least one of the distinct query parameters (signals) by way of a query operator in the column/cell corresponding to each distinct query parameter (signal) involved in the database query; and
the first user input includes a selection of a given cell in the given row.
P16. The system of claim P10, wherein:
the interactive query code table includes a plurality of cells organized into rows and columns, each column representing a distinct query parameter (signal) from among a plurality of query parameters (signals), each row representing a database query involving at least one of the distinct query parameters (signals) by way of a query operator in the column/cell corresponding to each distinct query parameter (signal) involved in the database query;
the first user input includes a selection of a given cell in the given row; and
the second user input includes a change to the contents of the given cell to add a new query operator to the given cell or change a prior query operator in the given cell associated with query i.
P17. A computer program product comprising a tangible, non-transitory computer readable medium having stored thereon a computer program for interactive database query reporting, which, when run on a computer system, causes the computer system to execute interactive database query reporting computer processes comprising:
causing display, on a display screen of a computer of a user, of a first graphical user interface screen including an interactive query code table having a plurality of rows, each row representing a database query, the interactive query code table including n rows identifiable as rows 1 through n and representing a sequence of n database queries identifiable as database queries 1 through n, wherein n is greater than one;
receiving a first user input associated with a given row i of the interactive query code table; and
displaying a first data table including results from execution of queries 1 through i, such that the server enables the user to select any given database query in the interactive query code table to view results through the given database query.
P18. The computer program product of claim P17, wherein the interactive database query reporting computer processes further comprise:
receiving a second user input making a change to the database query associated with the given row i;
executing at least the changed database query associated with the given row i; and
displaying a second data table including the results from execution of the changed database query i.
P19. The computer program product of claim P18, wherein the interactive database query reporting computer processes further comprise:
invalidating prior results from the database queries associated with rows i through n prior to executing at least the changed database query associated with the given row and displaying the second data table.
P20. The computer program product of claim P18, wherein executing at least the changed database query associated with the given row comprises:
executing the queries associated with rows 1 through i including the changed database query associated with the given row i.
P21. The computer program product of claim P18, wherein the interactive database query reporting computer processes further comprise:
placing a temporary lock on the n rows of the interactive query code table to prevent the user from making inadvertent changes; and
allowing the user to override the temporary lock to provide the second user input.
P22. The computer program product of claim P21, wherein the interactive database query reporting computer processes further comprise:
allowing the user to enter additional queries into additional rows of the interactive query code table but not change existing queries when the temporary lock is in place.
P23. The computer program product of claim P17, wherein:
the interactive query code table includes a plurality of cells organized into rows and columns, each column representing a distinct query parameter (signal) from among a plurality of query parameters (signals), each row representing a database query involving at least one of the distinct query parameters (signals) by way of a query operator in the column/cell corresponding to each distinct query parameter (signal) involved in the database query; and
the first user input includes a selection of a given cell in the given row.
P24. The computer program product of claim P18, wherein:
the interactive query code table includes a plurality of cells organized into rows and columns, each column representing a distinct query parameter (signal) from among a plurality of query parameters (signals), each row representing a database query involving at least one of the distinct query parameters (signals) by way of a query operator in the column/cell corresponding to each distinct query parameter (signal) involved in the database query;
the first user input includes a selection of a given cell in the given row; and
the second user input includes a change to the contents of the given cell to add a new query operator to the given cell or change a prior query operator in the given cell associated with query i.
P25. A computer-implemented method for converting scalar expressions to relational database queries applied to a dataset, the method comprising:
producing a functional dependency graph representing the scalar expressions and any interdependencies between the scalar expressions;
assigning each scalar expression to one of a plurality of successive execution stages identifiable as stages 1 through n based on the functional dependency graph such that each execution stage includes at least one of the scalar expressions, wherein the at least one scalar expression associated with any given stage does not require results from a subsequent stage of the plurality of successive stages; and
converting the scalar expressions at each stage into one or more relational database queries to create a sequence of relational database queries, wherein each execution stage involves at most one pass through the dataset.
P26. The method of claim P25, wherein at least one stage involves generation of a temporary data set used in at least one subsequent stage.
P27. The method of claim P25, wherein each stage includes the maximum number of possible expressions that can be executed at that stage.
P28. The method of claim P25, wherein producing the functional dependency graph comprises, for each scalar expression:
identifying an output parameter and a set of input parameters associated with the scalar expression;
creating a node for the output parameter in the functional dependency graph if the node for the output parameter does not exist in the functional dependency graph;
creating, for each input parameter in the set of input parameters, a distinct node in the functional dependency graph if the distinct node for the input parameter does not exist in the functional dependency graph; and
establishing, in the functional dependency graph, an association between the output parameter node and each of distinct input parameter nodes.
P29. The method of claim P28, wherein the association includes an operator node when the output parameter node is dependent on a combination of two or more distinct input parameter nodes, the operator node representing an operator to be performed on the two or more distinct input parameters.
P30. The method of claim P25, where assigning each scalar expression to one of the plurality of successive stages based on the functional dependency graph comprises:
traversing the functional dependency graph using a breadth first traversal configured to exclude, from each given execution stage, any aggregate expression at a lower-level of the functional dependency graph;
assigning each expression to a given stage based on the breadth first traversal.
P31. The method of claim P25, wherein converting the nodes at each stage into one or more relational database queries to create a sequence of relational database queries comprises:
generating a sequence of structured query language commands.
P32. The method of claim P25, wherein assigning each scalar expression to one of the plurality of successive execution stages based on the functional dependency graph ensures the minimum number of passes through the dataset for execution of the scalar expressions.
P33. A system for converting scalar expressions to relational database queries for application to a dataset, the system comprising:
a computer system having stored thereon and executing computer processes comprising:
producing a functional dependency graph representing the scalar expressions and any interdependencies between the scalar expressions;
assigning each scalar expression to one of a plurality of successive execution stages identifiable as stages 1 through n based on the functional dependency graph such that each execution stage includes at least one of the scalar expressions, wherein the at least one scalar expression associated with any given stage does not require results from a subsequent stage of the plurality of successive stages; and
converting the scalar expressions at each stage into one or more relational database queries to create a sequence of relational database queries, wherein each execution stage involves at most one pass through the dataset.
P34. The system of claim P33, wherein at least one stage involves generation of a temporary data set used in at least one subsequent stage.
P35. The system of claim P33, wherein each stage includes the maximum number of possible expressions that can be executed at that stage.
P36. The system of claim P33, wherein producing the functional dependency graph comprises, for each scalar expression:
identifying an output parameter and a set of input parameters associated with the scalar expression;
creating a node for the output parameter in the functional dependency graph if the node for the output parameter does not exist in the functional dependency graph;
creating, for each input parameter in the set of input parameters, a distinct node in the functional dependency graph if the distinct node for the input parameter does not exist in the functional dependency graph; and
establishing, in the functional dependency graph, an association between the output parameter node and each of distinct input parameter nodes.
P37. The system of claim P36, wherein the association includes an operator node when the output parameter node is dependent on a combination of two or more distinct input parameter nodes, the operator node representing an operator to be performed on the two or more distinct input parameters.
P38. The system of claim P33, where assigning each scalar expression to one of the plurality of successive stages based on the functional dependency graph comprises:
traversing the functional dependency graph using a breadth first traversal configured to exclude, from each given execution stage, any aggregate expression at a lower-level of the functional dependency graph;
assigning each expression to a given stage based on the breadth first traversal.
P39. The system of claim P33, wherein converting the nodes at each stage into one or more relational database queries to create a sequence of relational database queries comprises:
generating a sequence of structured query language commands.
P40. The system of claim P33, wherein assigning each scalar expression to one of the plurality of successive execution stages based on the functional dependency graph ensures the minimum number of passes through the dataset for execution of the scalar expressions.
P41. A computer program product comprising a tangible, non-transitory computer readable medium having stored thereon a computer program for converting scalar expressions to relational database queries for application to a dataset, which, when run on a computer system, causes the computer system to execute interactive database query reporting computer processes comprising:
producing a functional dependency graph representing the scalar expressions and any interdependencies between the scalar expressions;
assigning each scalar expression to one of a plurality of successive execution stages identifiable as stages 1 through n based on the functional dependency graph such that each execution stage includes at least one of the scalar expressions, wherein the at least one scalar expression associated with any given stage does not require results from a subsequent stage of the plurality of successive stages; and
converting the scalar expressions at each stage into one or more relational database queries to create a sequence of relational database queries, wherein each execution stage involves at most one pass through the dataset.
P42. The computer program product of claim P41, wherein at least one stage involves generation of a temporary data set used in at least one subsequent stage.
P43. The computer program product of claim P41, wherein each stage includes the maximum number of possible expressions that can be executed at that stage.
P44. The computer program product of claim P41, wherein producing the functional dependency graph comprises, for each scalar expression:
identifying an output parameter and a set of input parameters associated with the scalar expression;
creating a node for the output parameter in the functional dependency graph if the node for the output parameter does not exist in the functional dependency graph;
creating, for each input parameter in the set of input parameters, a distinct node in the functional dependency graph if the distinct node for the input parameter does not exist in the functional dependency graph; and
establishing, in the functional dependency graph, an association between the output parameter node and each of distinct input parameter nodes.
P45. The computer program product of claim P44, wherein the association includes an operator node when the output parameter node is dependent on a combination of two or more distinct input parameter nodes, the operator node representing an operator to be performed on the two or more distinct input parameters.
P46. The computer program product of claim P41, where assigning each scalar expression to one of the plurality of successive stages based on the functional dependency graph comprises:
traversing the functional dependency graph using a breadth first traversal configured to exclude, from each given execution stage, any aggregate expression at a lower-level of the functional dependency graph;
assigning each expression to a given stage based on the breadth first traversal.
P47. The computer program product of claim P41, wherein converting the nodes at each stage into one or more relational database queries to create a sequence of relational database queries comprises:
generating a sequence of structured query language commands.
P48. The computer program product of claim P41, wherein assigning each scalar expression to one of the plurality of successive execution stages based on the functional dependency graph ensures the minimum number of passes through the dataset for execution of the scalar expressions.
P49. A system for rapid development and deployment of reusable analytic code for use in computerized data modeling and analysis comprising:
a computer system having stored thereon and executing computer processes for implementing a signal hub, the computer processes comprising:
a signal hub engine configured to generate and monitor a set of named signals from a plurality of data sources to provide a reusable signal layer of maintained and refreshed named signals on top of the source data for consumption by analytic code applications; and
a graphical user interface configured to allow users to define signal categories and relationships used by the signal hub engine to generate the set of named signals, explore lineage and dependencies of the named signals in the signal layer, monitor and manage the signal layer including recovery from issues identified by monitoring of the named signals by the signal hub engine, and create and execute analytic code applications that utilize the named signals.
P50. The system of claim P49, wherein the set of named signals includes descriptive signals and predictive signals.
P51. The system of claim P49, wherein the signal hub engine is configured to generate the set of named signals based on combinations of signal categories including entity, transformation, attribute, and time frame.
P52. The system of claim P51, wherein the signal hub engine is configured to associate each named signal with a name that is automatically generated for the signal based on the source data used to generate the named signal.
P53. The system of claim P49, wherein the signal hub engine is further configured to store, for each named signal, metadata providing lineage information for the named signal, and to provide the metadata for consumption by analytic code applications.
P54. The system of claim P49, the graphical user interface is configured to categorize a plurality of named signals based on taxonomies and allow users to search for named signals based on the taxonomies.
P55. The system of claim P49, wherein the signal hub engine is configured to automatically detect changes from the data sources update the set of named signals based on relevant data changes without transactional system support.
P56. The system of claim P49, wherein the signal hub engine is configured to enable a named signal to be created from at least one other previously created named signal.
It should be understood by persons of ordinary skill in the art that the term “computer process” as used herein is the performance of a described function in a computer using computer hardware (such as a processor, field-programmable gate array or other electronic combinatorial logic, or similar device), which may be operating under control of software or firmware or a combination of any of these or operating outside control of any of the foregoing. All or part of the described function may be performed by active or passive electronic components, such as transistors or resistors. A “computer process” does not necessarily require a schedulable entity, or operation of a computer program or a part thereof, although, in some embodiments, a computer process may be implemented by such a schedulable entity, or operation of a computer program or a part thereof. Furthermore, unless the context otherwise requires, a “process” may be implemented using more than one processor or more than one (single- or multi-processor) computer.
Having thus described the system and method in detail, it is to be understood that the foregoing description is not intended to limit the spirit or scope thereof. It will be understood that the embodiments of the present disclosure described herein are merely exemplary and that a person skilled in the art may make any variations and modification without departing from the spirit and scope of the disclosure. All such variations and modifications, including those discussed above, are intended to be included within the scope of the disclosure.
Claims
1. A computer-implemented method for code and data versioning for managing shared datasets in a collaborative data processing system including data files and code files, the data files including production data files, the code files including production code files that operate on data files, the method comprising:
- maintaining a first view having a production version of a dataset;
- creating a task for a developer, the task being associated with a second view;
- associating a first code file with the task for the first view, the first code file including code that modifies the dataset;
- creating a temporary version of the dataset in the first view;
- associating the temporary version of the dataset with the task;
- associating a second code file with the task for the second view, the second code file including an instruction to read the dataset from the first view without identifying a specific version of the dataset from the first view; and
- upon execution of the second code file in the second view, automatically reading from the temporary dataset associated with the task based on the association of the temporary dataset with the task such that the second code file in the second view modifies the temporary dataset.
2. The method of claim 1, further comprising:
- placing locks on the first code file in the first view and the second code file in the second view.
3. The method of claim 2, wherein placing the locks on the first and second code files comprises checking out the first and second code files from a source control system.
4. The method of claim 1, further comprising:
- receiving, from the developer, a request to commit changes made to the first and second code files;
- checking the first and second code files into a source control system;
- changing the temporary dataset to be a latest dataset; and
- terminating the task.
5. A system for code and data versioning for managing shared datasets in a collaborative data processing system including data files and code files, the data files including production data files, the code files including production code files that operate on data files, the system comprising:
- a computer system having stored thereon and executing computer processes comprising:
- maintaining a first view having a production version of a dataset;
- creating a task for a developer, the task being associated with a second view;
- associating a first code file with the task for the first view, the first code file including code that modifies the dataset;
- creating a temporary version of the dataset in the first view;
- associating the temporary version of the dataset with the task;
- associating a second code file with the task for the second view, the second code file including an instruction to read the dataset from the first view without identifying a specific version of the dataset from the first view; and
- upon execution of the second code file in the second view, automatically reading from the temporary dataset associated with the task based on the association of the temporary dataset with the task such that the second code file in the second view modifies the temporary dataset.
6. The system of claim 5, wherein the computer processes further comprise:
- placing locks on the first code file in the first view and the second code file in the second view.
7. The system of claim 6, wherein placing the locks on the first and second code files comprises checking out the first and second code files from a source control system.
8. The system of claim 5, wherein the computer processes further comprise:
- receiving, from the developer, a request to commit changes made to the first and second code files;
- checking the first and second code files into a source control system;
- changing the temporary dataset to be a latest dataset; and
- terminating the task.
9. A computer program product comprising a tangible, non-transitory computer readable medium having stored thereon a computer program for code and data versioning for managing shared datasets in a collaborative data processing system including data files and code files, the data files including production data files, the code files including production code files that operate on data files, which, when run on a computer system, causes the computer system to execute computer processes comprising:
- maintaining a first view having a production version of a dataset;
- creating a task for a developer, the task being associated with a second view;
- associating a first code file with the task for the first view, the first code file including code that modifies the dataset;
- creating a temporary version of the dataset in the first view;
- associating the temporary version of the dataset with the task;
- associating a second code file with the task for the second view, the second code file including an instruction to read the dataset from the first view without identifying a specific version of the dataset from the first view; and
- upon execution of the second code file in the second view, automatically reading from the temporary dataset associated with the task based on the association of the temporary dataset with the task such that the second code file in the second view modifies the temporary dataset.
10. The computer program product of claim 9, wherein the computer processes further comprise:
- placing locks on the first code file in the first view and the second code file in the second view.
11. The computer program product of claim 10, wherein placing the locks on the first and second code files comprises checking out the first and second code files from a source control system.
12. The computer program product of claim 9, wherein the computer processes further comprise:
- receiving, from the developer, a request to commit changes made to the first and second code files;
- checking the first and second code files into a source control system;
- changing the temporary dataset to be a latest dataset; and
- terminating the task.
Type: Application
Filed: Nov 9, 2021
Publication Date: Mar 3, 2022
Inventors: Thejaswi Raje Gowda (Newton, MA), Amir Bar-Or (Newton, MA), Yan Ge (Waltham, MA)
Application Number: 17/522,528