SYSTEM AND METHOD FOR DISTRIBUTED, CO-LOCATED, AND SELF-ORGANIZING DATA STORAGE AND CLUSTER COMPUTATION FRAMEWORK FOR BATCH ALGORITHMS ON BIG DATASETS
A system and method for providing a distributed, co-located, self-organizing data storage and cluster computation framework for batch algorithms on big datasets. The system includes various subscriber devices reporting session information relevant to telecommunications companies to computing devices capable of combining records into trie organizational data structures for storage and optimization. Querying via user request and batch algorithms performed via parallel computing processes and index-on-index data keys enable users and algorithms to quickly determine subscribers of interest, based on numerous possible data points contained in subscriber records. The method of performing data receipt, re-organization, trie formation, trie ingestion at a server, and trie structural optimization for the enablement of user and algorithmic querying of the data stored thereon the server.
To the full extent permitted by law, the present United States Non-Provisional Patent Application hereby claims priority to and the full benefit of, U.S. Provisional Application No. 63/116,425, filed Nov. 20, 2020, entitled “ROBUST, EFFICIENT, ADAPTIVE STREAMING REPLICATION APPLICATION PROTOCOL WITH DANCING RECOVERY FOR HIGH VOLUME DISTRIBUTED SUBSCRIBER DATASETS”, which is incorporated herein by reference in its entirety.
FIELD OF THE DISCLOSUREThe present disclosure is directed to computer-to-computer data streaming. More specifically, the disclosure is directed to the provision and maintenance of a distributed computing network in receipt of large amounts of streaming data, and the efficient storage thereof.
The present disclosure is not limited to any specific file management system, subscriber or customer type, database structure, physical computing infrastructure, enterprise resource planning (ERP) system/software, or computer code language.
BACKGROUNDMany large businesses have a large volume of customers and/or subscribers. To accommodate the large volume of data associated with transactions related to their customers or subscribers, they may use one or more data stores sufficient to store a large volume of data concerning their customers or subscribers, their customers' or subscribers' activity or purchases, and other relevant data about their customers or subscribers. Day-to-day interactions and transactions may be recorded or collected, stored, processed, managed, or used to generate insights about the customers or subscribers. These data stores may often be repositories of information and data by which business and marketing operations may base their actions upon. For instance, an accounts receivable department may access a list of subscribers, each subscriber's invoice date, each subscriber's subscription rate, and each subscriber's method of payment on file in order to manually or automatically invoice all subscribers during a typical billing cycle. In another instance, a marketing department may access a list of subscribers and the length each subscriber has been a customer of the business in order to reward certain customers for their length of patronage. In yet another instance, if a customer acquisition and retention department would want to determine whether the department's customer acquisition and retention initiatives have been effective, it may query the data store(s) for the number of new customers over a period of time, the number of cancelled accounts over a period of time, and possibly the number of overall customers to calculate the overall churn rate of their subscriber base over a period of time. In yet another instance, data, and the underlying insights it can provide, may benefit from being available quickly upon the collection of the data. For instance, a cellular network provider may desire to offer promotions to local participating businesses upon a customer's or subscriber's arrival at a number of destinations. Offering instantly upon arrival, or shortly thereafter, may better encourage a subscriber to be influenced by a promotion or advertisement.
In general, such data may be stored and even analyzed using an Enterprise Resource Planning (ERP) system or platform. Over the years, ERP systems and platforms have evolved to either include or interface with various business platforms such as Customer Relationship Managers (CRMs), subscriber usage monitors, accounting software, distribution platforms, and business intelligence solutions. The data store and corresponding ERP system or platform may function as a transactional system, as online transaction processing databases, as an operational database management system, as a distributed database system offering similar functionality, an/or a combination of the like, whereby the transaction itself may be performed utilizing the ERP system or platform and the resulting data need not be stored on, recorded on, or otherwise copied to or from a separate a centralized data store. The data store and corresponding ERP system or platform may often but not always be stored in a relational database or table on a server connected to a network.
Since these databases usually store highly valuable and even business-critical data, it is important that the data is also saved redundantly somewhere else (i.e., backed up) and readily accessible to numerous business units within a company. IT best practices usually indicate that data be housed in at least two locations, geographically separated, and be backed up routinely, on a schedule (e.g., daily, weekly, etc.). It should be noted that, generally, backup procedures can require the active machine and/or backup machine to go “offline” during a backup process, making it inaccessible during such period of time. This generally means backups of active machines must be scheduled for a period of inactivity (e.g., evenings or weekends) or other provisions must be made to sustain access to the active machine or its equivalent, e.g., redundant master database(s). When dealing with data that is customer and/or subscriber created, rather than employee or agent created, any downtime of a machine may mean decreases in quality of service, inaccessibility, or lost customer/subscriber data. This usually means these systems feature multiple active machines who may shift resources during downtime of any single machine. However, “catching up” a machine after a period of downtime may present other challenges, especially in cases where machines are geographically separated or mediated by an unreliable network and machines thereon.
A continuous challenge to information technology professionals designing and implementing systems and methods of provisioning such utility to businesses has been the establishment of a system to receive such data streaming at a data store and efficiently organize it for later use. Since volume of data streaming for large companies having large user and/or subscriber bases may range into the billions of records per hour, additional challenges are presented when distributing streaming data across a distributed network (i.e., a network of data stores having applications which enable the seamless access and control of the entire network as if it were a single data store) as well as co-located systems integrated within a distributed network. The processing power required to receive, store, organize, retrieve, transmit, the like and/or combinations thereof, becomes excessively large at high data volumes, thereby requiring such distributed networks in order to balance the power load requirements across the network. Additionally relevant to the management of storage for large data volume streaming is that these files may need to be stored for long periods, meaning that over time, one method of organization may suddenly no longer be efficient to receive and store incoming streams. For instance, a music streaming service who may receive information relevant to its subscribers' listening activities, including subscriber ID, time, duration, content, and geographic location, may decide to organize its subscriber data using a hierarchy. This hierarchy may be sorted by geolocation, then time, then duration, then content, etc., down to subscriber ID. This hierarchy may be useful later when gathering information relevant to where certain content is consumed, as well as other business insights into user/subscriber activity. Such a hierarchy may also be chosen due to other decisions, such as transmitting, receiving and storing only low-resolution storage of geolocation (e.g., region-based rather than a more precise geolocation such as latitude and longitude). During periods of significant travel, such as Thanksgiving in the United States or Chinese New Year in China, such storage regime may suddenly skew data respective of normal user/subscriber activity. Reorganizing such data structures in response to such migrations may be highly cumbersome and even impossible without significant labor, system resources, and downtime. Other challenges exist with other hierarchies, as will be understood by those having ordinary skill in the art. Failing to include a hierarchy in the storage and organization of streaming large volumes of data may consume lower amounts of resources upon intake, but may fail to prove useful due to the system resources required to generate useful observations based on the data stored. While many other regimes may exist to address these concerns, no known method does so with the use of the self-organizing systems and methods for initial storage, efficiency assessment(s), and re-organizations thereof using the systems and methods as described herein.
Therefore, a need exists for a system and method for distributed, co-located, and self-organizing data storage and cluster computation framework for batch algorithms on big datasets, bearing some or all of the features identified as present in the general arts of mathematic, computer, and data science, though not actually implemented in such an arrangement to provide highly organized and highly accessible user data on said distributed and/or co-located databases of user-relevant information/data. The instant disclosure may be designed to address at least certain aspects of the problems or needs discussed above by providing such a system and method for the for distributed, co-located, and self-organizing data storage and cluster computation framework for batch algorithms on big datasets.
SUMMARYThe present disclosure may solve the aforementioned limitations of the currently available systems and methods of storing large data streams on intake, assessing its organizational storage, and re-organizing the data on demand by providing a system and method for distributed, co-located, and self-organizing data storage and cluster computation framework for batch algorithms on big datasets. By building an efficient self-optimizing-layout for storing and processing multi-dimensional structured datasets centered around an entity (such as a subscriber or user), such a system may be able to handle streaming workloads necessary for large organizational data. These workloads may exceed one or more billion records per hour, and the storage thereof may persist for long tenures, which may exceed several months. Additionally, such a system may be capable of co-locating data and organizing it in an efficient manner through use of trie (i.e., digital tree or prefix tree) encoding, thus eliminating redundant data and/or duplicate data, which may provide a framework for running parallel batch algorithms alongside pointed data access streaming using minimal inputs/outputs per second (TOPS). Each component and embodiment of the disclosed system and method for distributed, co-located, and self-organizing data storage and cluster computation framework for batch algorithms on big datasets may be recognized as a significant improvement over currently known methods of receiving, storing, and organizing large volume data streams, as may be understood by those having ordinary skill in the art.
Some aspects of the present system and method may relate or include the formation of tuples, either upon transmission, receipt, or ingestion and storage of user-relevant information. Tuples are a finite ordered list or sequence. In mathematics, tuples are normally written as a list of numbers between parentheses. A simple tuple example could include the side lengths of a quadrilateral such as (2, 4, 2, 4), which may equate to a 4×2 rectangle. Obviously, tuples can represent any sequence of data having “n” discrete values, where “n” is any non-negative integer. A 0-tuple exists in one form and is commonly referred to as an “empty tuple.” While it is known that tuples are widely used in computer and data science, ingesting user-relevant data as tuples is not known to be a common practice to those having ordinary skill in the art. Many programming languages offer an alternative to tuples, known as record types, which may instead feature unordered elements accessed by label. A few programming languages combine ordered tuple product types and unordered record types into a single construct, as in C structs and Haskell records. Relational databases may formally identify their rows (records) as tuples, or at least be recognized as such by those having ordinary skill in the art. Tuples of ordered pairs, nested sets, and functions also exist and may also be relevant to the utility of tuples to computer and data science. Since computer languages usually comprise in their most basic sense a binary code, segmented binary codes may contain information or data when read by machines configured to read binary using applications which impart meaning into binary language. This means that, using an example of user-relevant data, n types of data can be individual components of an n-tuple. A user's geography, subscriber ID, session duration, and activity may be plotted as (G, S, D, A), respectively, and an example quadruple (4-tuple) could be (latitude/longitude, subscriber ID, time in minutes, web/social/video), and/or their binary equivalents.
Related to the utility of tuples and necessary for a full understanding of the disclosed system and method may be tries. In computer science, a trie, or a digital tree or prefix tree, is a type of search tree, which is a tree-like data structure used for locating specific keys from within a set. These keys are often represented as strings between nodes. tries are best understood using characters in a written language like English. For instance, a top node may be the letters “Pa”, and strings may be “t” and “rrot.” This way, instead of storing “Pat” and “Parrot” in a database to access later, it may be stored as a trie with “Pa” being the top node, “t” and “rrot” being strings, which when read would form the full words. Importantly, this both reduces space required to store such data but also preserves, and even may be useful in identifying, relationships between those words. Abstracting this concept to user-relevant data can both reduce the space and/or bandwidth required to store, receive, or transmit user-relevant data upon transmission, receipt, or ingestion/storage, but also may be useful in categorizing users based on the factors identified in the most efficient trie structure. And while one trie arrangement may prove useful, most organized, or most efficient for one set of data, the same trie arrangement may fall short should different data be present or should the data change over time. Those skilled in the art may recognize features of these mathematical, data, and computer science schema, including tuple formation, trie formation, and trie optimization, as potential candidates for user-relevant database formation in order to both efficiently manage system and network resources, but by simultaneously offering useful information based on said trie formation and trie optimization as well as ease of query/access to that useful information.
In a potentially preferred exemplary embodiment, several improvements to an existing live streaming storage and organization regime may be required to receive the full benefit of the disclosed system and method for distributed, co-located, and self-organizing data storage and cluster computation framework for batch algorithms on big datasets. Briefly, these may include simple two-level hash-based data ingestion, high speed data ingestion, co-location of data inside tracks, self-optimizing layout systems and methods, and parallel batch processing algorithms. Alone or in combination, each of these improvements may contribute to improved archival performance, which will be better understood by those skilled in the art after a full review of this Summary along with the Drawings and Detailed Description below.
In one aspect, the disclosed system and method for distributed, co-located, and self-organizing data storage and cluster computation framework for batch algorithms on big datasets may include simple two-level hash-based data ingestion. Such simple two-level hash-based data ingestion may unambiguously map incoming records to a sector and a track within that sector. It may then perform heuristics-based clustering of records accrued within a track and organize the data therein as a “forest-of-tries”, which may span multiple entities, such as users and/or subscribers. This may result in a semi-optimal “data bag” that is smaller in size when compared storage/organization methods such as storing incoming records laid next to each other. Simple two-level hash-based data ingestion will be understood in greater detail by those having skill in the art after a more thorough review of the Drawings and Detailed Description.
In another aspect, the disclosed system and method for distributed, co-located, and self-organizing data storage and cluster computation framework for batch algorithms on big datasets may include high-speed data ingestion. By using parallel sequential writes of each staged forest-of-tries into pre-designated file locations derived from sector and track addresses, such high-speed data ingestion may be accomplished with limited organizational tradeoffs. In a related aspect, a method of co-locating data inside each track may further increase the efficiency of the disclosed system. In this related aspect, by merging forests-of-tries containing data belonging to multiple different entities/users/subscribers with a larger forest-of-tries organized around individual entities/users/subscribers, and then by re-assessing the clustering scheme and an attempting to organize the data using more dense tries, storage efficiency may be further improved without significant processing power tradeoffs and/or increases. Each of the disclosed high-speed data ingestion and co-location regimes, including the mechanisms, applications, systems, and machines involved will be understood in greater detail by those having ordinary skill in the art after a more thorough review of the Drawings and Detailed Description.
In yet another aspect, the system and method of the disclosure may include a self- optimizing layout. Such a layout may require gradual and continuous exploration for higher-density trie encoding at the entity/user/subscriber level. As the coined term herein named “shake-and-settle” may be understood, after a thorough review of the Drawings and Detailed Description herein, to include an algorithm having the capability of gradually “settling in” entire datasets through continuously “shaking” constituent entities. While purely metaphorical, the shaking of a jar of lake water or other water containing sediment, dirt, clay, sand, etc., which may cause layers of each to form through settling may be helpful to the understanding of such shake-and-settle methods of data organization which will be further understood by those having ordinary skill in the art after a more thorough review of the Drawings and Detailed Description.
In a potentially final aspect of the potentially preferred embodiment, the disclosed system and method may further include parallel batch processing algorithm(s). Such parallel batch processing algorithm(s) may possess on-demand pointed access to any related or unrelated entity of interest by relying on a two-level index yielding data in very few IOPS (˜2 IOPS: 1 for index, 1 for data). Such an aspect of the disclosed system and method for distributed, co-located, and self-organizing data storage and cluster computation framework for batch algorithms on big datasets may assist in running a class of link-analysis workloads at a much faster speed having greater efficiency while enabling analysis of multiple entities/subscribers/users in parallel at an incremental pro-rata cost with little-to-no upfront cost to the machines of the system(s) of the disclosure. This aspect/feature will also be understood by those having ordinary skill in the art when viewing the Drawings in light of the Detailed Description.
As a whole, the disclosed system and method may be thought of as a regime, technique, or system for the efficient ingestion, initial organization, storage, and reorganization of data streams in order to provision a high-efficiency query and/or analysis data structure for ease of use for data relevant to an organization. While the disclosed systems and methods may be most applicable to organizations of sufficient size and userbase to require such efficient ingestion, initial organization, storage, reorganization, etc., such implementations of the disclosed systems and methods may be of use to smaller streaming datasets as well.
Benefits of the disclosed system and method, which may be recognized by those having ordinary skill in the art may include reduction in net data volume during ingestion, streamlining ingestion of data, gradual optimization, and on-demand access to data for analysis. Reduction in net data volume (as measured in bytes) for ingestion may be accomplished through exploiting data repetition in records belonging to unrelated entities. This may be accomplished on-the-fly by utilization of encoding such ingested data into a forests-of-tries. By streamlining insertion of data through the leveraging of efficient sequential I/O and by achieving high-throughput by issuing multiple writes in parallel on independent files, which may thus enable the full capability of underlying I/O system(s)/subsystem(s). By gradually optimizing already co-located data at the level of individual entities/users/subscribers through exploration of higher density trie layouts in existing data, the reorganizing said data upon discovery of a better layout further benefits users accessing the data for analysis. This aspect possesses the further benefit of being a non-disruptive and continuous operation that may be capable of running in parallel on independent entities/user/subscribers at rest without increasing load on other processes and/or equipment. Further increasing this utility, each iteration of each reorganization may be run according a budget of prior-marked resources for discovery of denser layouts—the greater the resource budget, higher the chances of discovering a better trie layout. Those skilled in the art may understand the obvious benefit of such a reorganization technique as known low-streaming times may be scheduled for additional reorganization resources and/or observed low-streaming times may initiate additional reorganization resources. Finally, another of the numerous features and benefits of the disclosed systems and methods may be the enablement of on-demand access to data belonging to related entities/users/subscribers of interest by incurring a fixed, predictable I/O cost from batch processing algorithms working entity-after-entity. These benefits, as well as other benefits which may be present but not fully described herein, may be better understood by those having ordinary skill in the art after a thorough review of the included Drawings and detailed description.
The foregoing illustrative summary, as well as other exemplary objectives and/or advantages of the disclosure, and the manner in which the same are accomplished, are further explained within the following Detailed Description and its accompanying Drawings.
The present disclosure will be better understood by reading the Detailed Description with reference to the accompanying drawings, which are not necessarily drawn to scale, and in which like reference numerals denote similar structure and refer to like elements throughout, and in which:
It is to be noted that the drawings presented are intended solely for the purpose of illustration and that they are, therefore, neither desired nor intended to limit the disclosure to any or all of the exact details of construction shown, except insofar as they may be deemed essential to the claimed disclosure.
DETAILED DESCRIPTIONReferring now to
The present disclosure solves the aforementioned limitations of the currently available devices and methods of ingestion, initial storage, optimization, and retrieval by providing a system and method for distributed, co-located, and self-organizing data storage and cluster computation framework for batch algorithms on big datasets may include high-speed data ingestion.
In describing the exemplary embodiments of the present disclosure, as illustrated in
As will be appreciated by one of skill in the art, the present disclosure may be embodied as a method, data processing system, software as a service (SaaS) or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product on a computer-readable storage medium having computer-readable program code means embodied in the medium. Any suitable computer readable medium may be utilized, including hard disks, ROM, RAM, CD-ROMs, electrical, optical, magnetic storage devices and the like.
The present disclosure is described below with reference to flowchart illustrations of methods, apparatus (systems) and computer program products according to embodiments of the present disclosure. It will be understood that each block or step of the flowchart illustrations, and combinations of blocks or steps in the flowchart illustrations, can be implemented by computer program instructions or operations. These computer program instructions or operations may be loaded onto a general-purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions or operations, which execute on the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart block or blocks/step or steps.
These computer program instructions or operations may also be stored in a computer-usable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions or operations stored in the computer-usable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart block or blocks/step or steps. The computer program instructions or operations may also be loaded onto a computer or other programmable data processing apparatus (processor) to cause a series of operational steps to be performed on the computer or other programmable apparatus (processor) to produce a computer implemented process such that the instructions or operations which execute on the computer or other programmable apparatus (processor) provide steps for implementing the functions specified in the flowchart block or blocks/step or steps.
Accordingly, blocks or steps of the flowchart illustrations support combinations of means for performing the specified functions, combinations of steps for performing the specified functions, and program instruction means for performing the specified functions. It should also be understood that each block or step of the flowchart illustrations, and combinations of blocks or steps in the flowchart illustrations, can be implemented by special purpose hardware-based computer systems, which perform the specified functions or steps, or combinations of special purpose hardware and computer instructions or operations.
Computer programming for implementing the present disclosure may be written in various programming languages, database languages, and the like. However, it is understood that other source or object-oriented programming languages, and other conventional programming language may be utilized without departing from the spirit and intent of the present disclosure.
Referring now to
Processor 102 may, for example, be embodied as various means including one or more microprocessors with accompanying digital signal processor(s), one or more processor(s) without an accompanying digital signal processor, one or more coprocessors, one or more multi-core processors, one or more controllers, processing circuitry, one or more computers, various other processing elements including integrated circuits such as, for example, an ASIC (application specific integrated circuit) or FPGA (field programmable gate array), or some combination thereof. Accordingly, although illustrated in
Whether configured by hardware, firmware/software methods, or by a combination thereof, processor 102 may comprise an entity capable of performing operations according to embodiments of the present invention while configured accordingly. Thus, for example, when processor 102 is embodied as an ASIC, FPGA or the like, processor 102 may comprise specifically configured hardware for conducting one or more operations described herein. As another example, when processor 102 is embodied as an executor of instructions, such as may be stored in memory 104, 106, the instructions may specifically configure processor 102 to perform one or more algorithms and operations described herein.
The plurality of memory components 104, 106 may be embodied on a single computing device 10 or distributed across a plurality of computing devices. In various embodiments, memory may comprise, for example, a hard disk, random access memory, cache memory, flash memory, a compact disc read only memory (CD-ROM), digital versatile disc read only memory (DVD-ROM), an optical disc, circuitry configured to store information, or some combination thereof. Memory 104, 106 may be configured to store information, data, applications, instructions, or the like for enabling the computing device 10 to carry out various functions in accordance with example embodiments discussed herein. For example, in at least some embodiments, memory 104, 106 is configured to buffer input data for processing by processor 102. Additionally, or alternatively, in at least some embodiments, memory 104, 106 may be configured to store program instructions for execution by processor 102. Memory 104, 106 may store information in the form of static and/or dynamic information. This stored information may be stored and/or used by the computing device 10 during the course of performing its functionalities.
Many other devices or subsystems or other I/O devices 212 may be connected in a similar manner, including but not limited to, devices such as microphone, speakers, flash drive, CD-ROM player, DVD player, printer, main storage device 214, such as hard drive, and/or modem each connected via an I/O adapter. Also, although preferred, it is not necessary for all of the devices shown in
In some embodiments, some or all of the functionality or steps may be performed by processor 102. In this regard, the example processes and algorithms discussed herein can be performed by at least one processor 102. For example, non-transitory computer readable storage media can be configured to store firmware, one or more application programs, and/or other software, which include instructions and other computer-readable program code portions that can be executed to control processors of the components of system 201 to implement various operations, including the examples shown above. As such, a series of computer-readable program code portions may be embodied in one or more computer program products and can be used, with a computing device, server, and/or other programmable apparatus, to produce the machine-implemented processes discussed herein.
Any such computer program instructions and/or other type of code may be loaded onto a computer, processor or other programmable apparatuses circuitry to produce a machine, such that the computer, processor or other programmable circuitry that executes the code may be the means for implementing various functions, including those described herein.
Referring now to
Similar to user system 220, server system 260 preferably includes a computer-readable medium, such as random-access memory, coupled to a processor. The processor executes program instructions stored in memory. Server system 260 may also include a number of additional external or internal devices, such as, without limitation, a mouse, a CD-ROM, a keyboard, a display, a storage device and other attributes similar to computer system 10 of
System 201 is capable of delivering and exchanging data between user system 220 and a server system 260 through communications link 240 and/or network 250. Through user system 220, users can preferably communicate over network 250 with each other user system 220, 222, 224, and with other systems and devices, such as server system 260, to electronically transmit, store, manipulate, and/or otherwise use data exchanged between the user system and the server system. Communications link 240 typically includes network 250 making a direct or indirect communication between the user system 220 and the server system 260, irrespective of physical separation. Examples of a network 250 include the Internet, cloud, analog or digital wired and wireless networks, radio, television, cable, satellite, and/or any other delivery mechanism for carrying and/or transmitting data or other information, such as to electronically transmit, store, manipulate, and/or otherwise modify data exchanged between the user system and the server system. The communications link 240 may include, for example, a wired, wireless, cable, optical or satellite communication system or another pathway. It is contemplated herein that RAM 104, main storage device 214, and database 270 may be referred to herein as storage device(s) or memory device(s).
Referring now specifically to
Referring now specifically to
Turning now to
Turning now to
Having processed second exemplary type-1 trie 331b and third exemplary type-1 trie 331c into exemplary Alice subscriber trie 341a, exemplary Bob subscriber trie 341b, other subscriber tries for other subscribers/users (represented as ellipses . . . ), and exemplary Zoe subscriber trie 341z, we turn now to
Having stored exemplary Alice subscriber trie 341a, exemplary Bob subscriber trie 341b, other subscriber tries for other subscribers/users, and exemplary Zoe subscriber trie 341z on database 270, we turn now to
Turning now to
Turning back to
Turning back to
Having described potential systems for and methods of distributed, co-located, and self-organizing data storage and cluster computation framework, now the description turns to performing queries and/or batch algorithms on big datasets thereof. In summary thus far, 265 files may contain 65536 sectors on-disk in a trie-layout. Thereon the on-disk layout may further be a first level index as illustrated therein
Turning to
Turning now to
Having described various features of, steps in, systems of, machines of, embodiments (preferred and otherwise) of, and methods to provide a distributed, co-located, self-organizing data storage schema and/or design which may offer suitable cluster computational framework for batch algorithms on big datasets, it may be necessary to share with those skilled in the art various benefits of so organizing and processing big datasets in an established large-subscriber-volume data environment having incoming subscriber-data streams. In existing installations of the disclosed system and method, the established framework may be extensively used for sparse data-backed-profile-computation and distinct destination mapping for subscribers of telecommunications companies having millions (or tens/hundreds of millions) of subscribers. On a single, low-end machine having a Dual-Core Intel® i3 Processor with 2 disks and 12 GB of RAM, applicant has achieved co-location of 31 billion records, of which approximately 30 billion may persist as co-located type-1 and type-2 subscriber-based tries and 1 billion incoming records queued in RAM. This has been performed in approximately 2 hours and stored on a disk consuming only 360 GB of disk space. A throughput calculation approximated over 135,000 records per second per node. This may scale horizontally to achieve any desired throughput. Based on projections and laboratory testing, 1.1 trillion XDRs over 100 million subscribers at a daily load of 3 billion records accounting to 25 TB of stored data may be indexed using only 500 MB of RAM and 0.5 TB disk space using the indexing methods herein disclosed. Much of this efficiency and speed may be thanks to the dense trie organizational structure which may allow increases in efficiency for parallel computing processes using batch algorithms to obtain relevant data with respect to a query or algorithm.
With respect to the above description then, it is to be realized that the optimum methods, systems and their relationships, to include variations in systems, machines, size, materials, shape, form, position, function and manner of operation, assembly, order of operation, type of computing devices (mobile, server, desktop, etc.), type of network (LAN, WAN, internet, etc.), size and type of database, data-type stored therein, and use, are intended to be encompassed by the present disclosure.
It is contemplated herein that the system and method may be used to automate the creation and maintenance of a system and method for robust, efficient, adaptive streaming replication application protocol with dancing recovery for high-volume distributed live subscriber datasets, as well as automating a variety of other tasks. Furthermore, the systems and methods disclosed herein may be used to organize, retrieve, and otherwise manage a variety of data types, including but not limited to subscriber data, customer data, user data, employee data, sales data, lead data, logistic event data, order data, inventory information and data, the like and combinations thereof. It is further contemplated that a variety of computing devices may be deployed on the system and method by a variety of service providers, managers, users, owners, agents, companies and other enterprises. These may include but are not limited to personal computers, laptops, desktops, mobile devices, smart phones, tablets, servers, virtualized machines, the like and/or combinations thereof. It is further contemplated that specialized equipment may be deployed to further improve the disclosed system and method. This description includes all manners and forms of systems and method for the robust, efficient, adaptive streaming replication, in all manner of devices and software platforms capable of such adaptive streaming and replication of live databases in order to automate, streamline, decrease runtime, increase efficiency of, and/or make possible the dancing recovery protocol as herein described. While the implementation of the disclosed system and method may be most applicable and/or relevant to ERP databases, many platforms may benefit from the disclosed system and method, including but not limited to Customer Relationship Managers (CRMs), Information Technology (IT) support and implementation software, database management software, accounting software, other business and consumer software products which may store data across multiple live databases, archival databases, the like and/or combinations thereof
The illustrations described herein are intended to provide a general understanding of the structure of various embodiments. The illustrations are not intended to serve as a complete description of all of the elements and features of apparatus, processors, and systems that utilize the structures or methods described herein. Many other embodiments may be apparent to those of skill in the art upon reviewing the disclosure. Other embodiments may be utilized and derived from the disclosure, such that structural and logical substitutions and changes may be made without departing from the scope of the disclosure. Additionally, the illustrations are merely representational and may not be drawn to scale. Certain proportions within the illustrations may be exaggerated, while other proportions may be minimized. Accordingly, the disclosure and the figures are to be regarded as illustrative rather than restrictive.
The above disclosed subject matter is to be considered illustrative, and not restrictive, and the appended claims are intended to cover all such modifications, enhancements, and other embodiments, which fall within the true spirit and scope of the description. Thus, to the maximum extent allowed by law, the scope is to be determined by the broadest permissible interpretation of the following claims and their equivalents, and shall not be restricted or limited by the foregoing detailed description.
In the specification and/or figures, typical embodiments of the disclosure have been disclosed. The present disclosure is not limited to such exemplary embodiments. The use of the term “and/or” includes any and all combinations of one or more of the associated listed items. The figures are schematic representations and so are not necessarily drawn to scale. Unless otherwise noted, specific terms have been used in a generic and descriptive sense and not for purposes of limitation.
The foregoing description and drawings comprise illustrative embodiments. Having thus described exemplary embodiments, it should be noted by those skilled in the art that the within disclosures are exemplary only, and that various other alternatives, adaptations, and modifications may be made within the scope of the present disclosure. Merely listing or numbering the steps of a method in a certain order does not constitute any limitation on the order of the steps of that method. Many modifications and other embodiments will come to mind to one skilled in the art to which this disclosure pertains having the benefit of the teachings presented in the foregoing descriptions and the associated drawings. Although specific terms may be employed herein, they are used in a generic and descriptive sense only and not for purposes of limitation. Accordingly, the present disclosure is not limited to the specific embodiments illustrated herein but is limited only by the following claims.
Claims
1. A method for providing a distributed, co-located, self-organizing data storage and a cluster computational framework for batch algorithms on big datasets, the method comprising:
- performing a two-level hash-based data ingestion, said two-level hash-based data ingestion comprising: unambiguously mapping a plurality of incoming records to a sector and a track within said sector of a memory; performing a heuristics-based clustering of said plurality of incoming records upon receipt and accumulating said plurality of incoming records within a track of said memory; and organizing said plurality of records as a forest-of-tries spanning a plurality of entities associated with said plurality of incoming records resulting in a semi-optimal data arrangement that is a smaller size when compared to said plurality of incoming records laid next to each other;
- performing parallel sequential writes of each of said forest-of-tries into a plurality of pre-designated file locations using a high-speed data ingestion derived from an address of each of said sector and said track; and
- co-locating a data inside one of said track by merging said forests-of-tries containing said data belonging to different of said plurality of entities with a larger forest-of-tries organized around each of said plurality of entities.
2. The method of claim 1, further comprising a step of re-assessing a resultant clustering scheme and attempting to organize said data using a progressively more dense trie organizational structure.
3. The method of claim 2, further comprising a step of self-optimizing a layout, said layout becoming gradually and continuously denser via a successive trie organizational encoding.
4. The method of claim 3, wherein said successive trie organizational encoding is performed at a level of each of said plurality of entities.
5. The method of claim 4, wherein said successive trie organizational encoding does not affect a speed or a processing of any of said high-speed data ingestion and a plurality of read operations, thereby condensing gradually said plurality of data.
6. The method of claim 5, further comprising a step of parallel batch processing using an algorithm.
7. The method of claim 6, wherein said algorithm has an on-demand pointed access to all of a plurality of entities of interest.
8. The method of claim 7, wherein said plurality of entities of interest is a subset of said plurality of entities.
9. The method of claim 8, wherein said parallel batch processing via said algorithm relies on a two-level index yielding said plurality of data.
10. The method of claim 9, further comprising a step of running a class of link-analysis workloads by analyzing said plurality of entities of interest in parallel at an incremental pro-rata cost minimally above a nominal fixed upfront cost.
11. A system for providing a distributed, co-located, self-organizing data storage and a cluster computational framework for batch algorithms on big datasets, the system comprising:
- a client computing device having at least a first processor, a first memory, and a first connection to a network; and
- a server computing device having at least a second processor, a second memory, a first non-transitory computer readable medium, and a second connection to said network;
- wherein said client computing device is configured to: unambiguously map a plurality of incoming records to a sector and a track within said sector of said first memory; perform a heuristics-based clustering of said plurality of incoming records as they accrue within said track; and organize said plurality of incoming records as a forest-of-tries spanning a plurality of entities, each of said plurality of entities associated with said plurality of incoming records, thereby resulting in a semi-optimal data arrangement having a smaller in size than said plurality of incoming records laid next to each other.
12. The system of claim 11, wherein said client computing device is further configured to transmit said forest-of-tries via said first network connection to said server computing device via said second network connection.
13. The system of claim 12, wherein said server computing device, in receipt of said forest-of-tries is configured to perform parallel sequential writes of each of said forest-of-tries onto a plurality of pre-designated file locations using a high-speed data ingestion derived from an address of each of said sector and said track, said parallel sequential writes occur on said first non-transitory computer readable medium.
14. The system of claim 13, wherein said server computing device is further configured to co-locate a data inside one of said track by merging said forests of tries containing said data belonging to a second plurality of entities having a larger forest-of-tries organized around each of said second plurality of entities, thereby having generated said larger forest-of-tries comprising said plurality of entities and said second plurality of entities.
15. The system of claim 14, wherein said server computing device is further configured to perform a step of re-assessing a resultant clustering scheme and a step of attempting to organize said data using a progressively more dense trie organizational structure.
16. The system of claim 15, wherein said server computing device is further configured to self-optimize a layout, said layout is gradually and continuously denser via a successive trie organizational encoding.
17. The system of claim 16, wherein said server computing device is further configured to perform said successive trie organizational encoding at a level of each of a larger plurality of entities corresponding to said larger forest-of-tries and containing records related to said plurality of entities and said second plurality of entities.
18. The system of claim 17, wherein said successive trie organizational encoding as performed on said server computing device is further configured to operate without affecting a speed or a processing of any of said high-speed data ingestion and a plurality of read operations, thereby condensing gradually said plurality of data stored on said first non-transitory computer readable medium.
19. The system of claim 18, wherein said server computing device via said second processor is configured to batch process using an algorithm and a parallel computing process.
20. The system of claim 19, wherein said algorithm has an on-demand pointed access to all of a plurality of entities of interest, said plurality of entities of interest is a subset of said larger plurality of entities.
Type: Application
Filed: Nov 22, 2021
Publication Date: May 26, 2022
Inventors: Arun K. Krishna (Bangalore), Pramod K. Prabhakar (Bangalore)
Application Number: 17/532,040