SYSTEM AND METHOD FOR DISTRIBUTED, CO-LOCATED, AND SELF-ORGANIZING DATA STORAGE AND CLUSTER COMPUTATION FRAMEWORK FOR BATCH ALGORITHMS ON BIG DATASETS

Info

Publication number: 20220164372
Type: Application
Filed: Nov 22, 2021
Publication Date: May 26, 2022
Inventors: Arun K. Krishna (Bangalore), Pramod K. Prabhakar (Bangalore)
Application Number: 17/532,040

Abstract

A system and method for providing a distributed, co-located, self-organizing data storage and cluster computation framework for batch algorithms on big datasets. The system includes various subscriber devices reporting session information relevant to telecommunications companies to computing devices capable of combining records into trie organizational data structures for storage and optimization. Querying via user request and batch algorithms performed via parallel computing processes and index-on-index data keys enable users and algorithms to quickly determine subscribers of interest, based on numerous possible data points contained in subscriber records. The method of performing data receipt, re-organization, trie formation, trie ingestion at a server, and trie structural optimization for the enablement of user and algorithmic querying of the data stored thereon the server.

Description

Description

CROSS REFERENCE TO RELATED APPLICATIONS

To the full extent permitted by law, the present United States Non-Provisional Patent Application hereby claims priority to and the full benefit of, U.S. Provisional Application No. 63/116,425, filed Nov. 20, 2020, entitled “ROBUST, EFFICIENT, ADAPTIVE STREAMING REPLICATION APPLICATION PROTOCOL WITH DANCING RECOVERY FOR HIGH VOLUME DISTRIBUTED SUBSCRIBER DATASETS”, which is incorporated herein by reference in its entirety.

FIELD OF THE DISCLOSURE

The present disclosure is directed to computer-to-computer data streaming. More specifically, the disclosure is directed to the provision and maintenance of a distributed computing network in receipt of large amounts of streaming data, and the efficient storage thereof.

The present disclosure is not limited to any specific file management system, subscriber or customer type, database structure, physical computing infrastructure, enterprise resource planning (ERP) system/software, or computer code language.

BACKGROUND

Many large businesses have a large volume of customers and/or subscribers. To accommodate the large volume of data associated with transactions related to their customers or subscribers, they may use one or more data stores sufficient to store a large volume of data concerning their customers or subscribers, their customers' or subscribers' activity or purchases, and other relevant data about their customers or subscribers. Day-to-day interactions and transactions may be recorded or collected, stored, processed, managed, or used to generate insights about the customers or subscribers. These data stores may often be repositories of information and data by which business and marketing operations may base their actions upon. For instance, an accounts receivable department may access a list of subscribers, each subscriber's invoice date, each subscriber's subscription rate, and each subscriber's method of payment on file in order to manually or automatically invoice all subscribers during a typical billing cycle. In another instance, a marketing department may access a list of subscribers and the length each subscriber has been a customer of the business in order to reward certain customers for their length of patronage. In yet another instance, if a customer acquisition and retention department would want to determine whether the department's customer acquisition and retention initiatives have been effective, it may query the data store(s) for the number of new customers over a period of time, the number of cancelled accounts over a period of time, and possibly the number of overall customers to calculate the overall churn rate of their subscriber base over a period of time. In yet another instance, data, and the underlying insights it can provide, may benefit from being available quickly upon the collection of the data. For instance, a cellular network provider may desire to offer promotions to local participating businesses upon a customer's or subscriber's arrival at a number of destinations. Offering instantly upon arrival, or shortly thereafter, may better encourage a subscriber to be influenced by a promotion or advertisement.

In general, such data may be stored and even analyzed using an Enterprise Resource Planning (ERP) system or platform. Over the years, ERP systems and platforms have evolved to either include or interface with various business platforms such as Customer Relationship Managers (CRMs), subscriber usage monitors, accounting software, distribution platforms, and business intelligence solutions. The data store and corresponding ERP system or platform may function as a transactional system, as online transaction processing databases, as an operational database management system, as a distributed database system offering similar functionality, an/or a combination of the like, whereby the transaction itself may be performed utilizing the ERP system or platform and the resulting data need not be stored on, recorded on, or otherwise copied to or from a separate a centralized data store. The data store and corresponding ERP system or platform may often but not always be stored in a relational database or table on a server connected to a network.

Since these databases usually store highly valuable and even business-critical data, it is important that the data is also saved redundantly somewhere else (i.e., backed up) and readily accessible to numerous business units within a company. IT best practices usually indicate that data be housed in at least two locations, geographically separated, and be backed up routinely, on a schedule (e.g., daily, weekly, etc.). It should be noted that, generally, backup procedures can require the active machine and/or backup machine to go “offline” during a backup process, making it inaccessible during such period of time. This generally means backups of active machines must be scheduled for a period of inactivity (e.g., evenings or weekends) or other provisions must be made to sustain access to the active machine or its equivalent, e.g., redundant master database(s). When dealing with data that is customer and/or subscriber created, rather than employee or agent created, any downtime of a machine may mean decreases in quality of service, inaccessibility, or lost customer/subscriber data. This usually means these systems feature multiple active machines who may shift resources during downtime of any single machine. However, “catching up” a machine after a period of downtime may present other challenges, especially in cases where machines are geographically separated or mediated by an unreliable network and machines thereon.

A continuous challenge to information technology professionals designing and implementing systems and methods of provisioning such utility to businesses has been the establishment of a system to receive such data streaming at a data store and efficiently organize it for later use. Since volume of data streaming for large companies having large user and/or subscriber bases may range into the billions of records per hour, additional challenges are presented when distributing streaming data across a distributed network (i.e., a network of data stores having applications which enable the seamless access and control of the entire network as if it were a single data store) as well as co-located systems integrated within a distributed network. The processing power required to receive, store, organize, retrieve, transmit, the like and/or combinations thereof, becomes excessively large at high data volumes, thereby requiring such distributed networks in order to balance the power load requirements across the network. Additionally relevant to the management of storage for large data volume streaming is that these files may need to be stored for long periods, meaning that over time, one method of organization may suddenly no longer be efficient to receive and store incoming streams. For instance, a music streaming service who may receive information relevant to its subscribers' listening activities, including subscriber ID, time, duration, content, and geographic location, may decide to organize its subscriber data using a hierarchy. This hierarchy may be sorted by geolocation, then time, then duration, then content, etc., down to subscriber ID. This hierarchy may be useful later when gathering information relevant to where certain content is consumed, as well as other business insights into user/subscriber activity. Such a hierarchy may also be chosen due to other decisions, such as transmitting, receiving and storing only low-resolution storage of geolocation (e.g., region-based rather than a more precise geolocation such as latitude and longitude). During periods of significant travel, such as Thanksgiving in the United States or Chinese New Year in China, such storage regime may suddenly skew data respective of normal user/subscriber activity. Reorganizing such data structures in response to such migrations may be highly cumbersome and even impossible without significant labor, system resources, and downtime. Other challenges exist with other hierarchies, as will be understood by those having ordinary skill in the art. Failing to include a hierarchy in the storage and organization of streaming large volumes of data may consume lower amounts of resources upon intake, but may fail to prove useful due to the system resources required to generate useful observations based on the data stored. While many other regimes may exist to address these concerns, no known method does so with the use of the self-organizing systems and methods for initial storage, efficiency assessment(s), and re-organizations thereof using the systems and methods as described herein.

Therefore, a need exists for a system and method for distributed, co-located, and self-organizing data storage and cluster computation framework for batch algorithms on big datasets, bearing some or all of the features identified as present in the general arts of mathematic, computer, and data science, though not actually implemented in such an arrangement to provide highly organized and highly accessible user data on said distributed and/or co-located databases of user-relevant information/data. The instant disclosure may be designed to address at least certain aspects of the problems or needs discussed above by providing such a system and method for the for distributed, co-located, and self-organizing data storage and cluster computation framework for batch algorithms on big datasets.

SUMMARY

The present disclosure may solve the aforementioned limitations of the currently available systems and methods of storing large data streams on intake, assessing its organizational storage, and re-organizing the data on demand by providing a system and method for distributed, co-located, and self-organizing data storage and cluster computation framework for batch algorithms on big datasets. By building an efficient self-optimizing-layout for storing and processing multi-dimensional structured datasets centered around an entity (such as a subscriber or user), such a system may be able to handle streaming workloads necessary for large organizational data. These workloads may exceed one or more billion records per hour, and the storage thereof may persist for long tenures, which may exceed several months. Additionally, such a system may be capable of co-locating data and organizing it in an efficient manner through use of trie (i.e., digital tree or prefix tree) encoding, thus eliminating redundant data and/or duplicate data, which may provide a framework for running parallel batch algorithms alongside pointed data access streaming using minimal inputs/outputs per second (TOPS). Each component and embodiment of the disclosed system and method for distributed, co-located, and self-organizing data storage and cluster computation framework for batch algorithms on big datasets may be recognized as a significant improvement over currently known methods of receiving, storing, and organizing large volume data streams, as may be understood by those having ordinary skill in the art.

Some aspects of the present system and method may relate or include the formation of tuples, either upon transmission, receipt, or ingestion and storage of user-relevant information. Tuples are a finite ordered list or sequence. In mathematics, tuples are normally written as a list of numbers between parentheses. A simple tuple example could include the side lengths of a quadrilateral such as (2, 4, 2, 4), which may equate to a 4×2 rectangle. Obviously, tuples can represent any sequence of data having “n” discrete values, where “n” is any non-negative integer. A 0-tuple exists in one form and is commonly referred to as an “empty tuple.” While it is known that tuples are widely used in computer and data science, ingesting user-relevant data as tuples is not known to be a common practice to those having ordinary skill in the art. Many programming languages offer an alternative to tuples, known as record types, which may instead feature unordered elements accessed by label. A few programming languages combine ordered tuple product types and unordered record types into a single construct, as in C structs and Haskell records. Relational databases may formally identify their rows (records) as tuples, or at least be recognized as such by those having ordinary skill in the art. Tuples of ordered pairs, nested sets, and functions also exist and may also be relevant to the utility of tuples to computer and data science. Since computer languages usually comprise in their most basic sense a binary code, segmented binary codes may contain information or data when read by machines configured to read binary using applications which impart meaning into binary language. This means that, using an example of user-relevant data, n types of data can be individual components of an n-tuple. A user's geography, subscriber ID, session duration, and activity may be plotted as (G, S, D, A), respectively, and an example quadruple (4-tuple) could be (latitude/longitude, subscriber ID, time in minutes, web/social/video), and/or their binary equivalents.

Related to the utility of tuples and necessary for a full understanding of the disclosed system and method may be tries. In computer science, a trie, or a digital tree or prefix tree, is a type of search tree, which is a tree-like data structure used for locating specific keys from within a set. These keys are often represented as strings between nodes. tries are best understood using characters in a written language like English. For instance, a top node may be the letters “Pa”, and strings may be “t” and “rrot.” This way, instead of storing “Pat” and “Parrot” in a database to access later, it may be stored as a trie with “Pa” being the top node, “t” and “rrot” being strings, which when read would form the full words. Importantly, this both reduces space required to store such data but also preserves, and even may be useful in identifying, relationships between those words. Abstracting this concept to user-relevant data can both reduce the space and/or bandwidth required to store, receive, or transmit user-relevant data upon transmission, receipt, or ingestion/storage, but also may be useful in categorizing users based on the factors identified in the most efficient trie structure. And while one trie arrangement may prove useful, most organized, or most efficient for one set of data, the same trie arrangement may fall short should different data be present or should the data change over time. Those skilled in the art may recognize features of these mathematical, data, and computer science schema, including tuple formation, trie formation, and trie optimization, as potential candidates for user-relevant database formation in order to both efficiently manage system and network resources, but by simultaneously offering useful information based on said trie formation and trie optimization as well as ease of query/access to that useful information.

In a potentially preferred exemplary embodiment, several improvements to an existing live streaming storage and organization regime may be required to receive the full benefit of the disclosed system and method for distributed, co-located, and self-organizing data storage and cluster computation framework for batch algorithms on big datasets. Briefly, these may include simple two-level hash-based data ingestion, high speed data ingestion, co-location of data inside tracks, self-optimizing layout systems and methods, and parallel batch processing algorithms. Alone or in combination, each of these improvements may contribute to improved archival performance, which will be better understood by those skilled in the art after a full review of this Summary along with the Drawings and Detailed Description below.

In one aspect, the disclosed system and method for distributed, co-located, and self-organizing data storage and cluster computation framework for batch algorithms on big datasets may include simple two-level hash-based data ingestion. Such simple two-level hash-based data ingestion may unambiguously map incoming records to a sector and a track within that sector. It may then perform heuristics-based clustering of records accrued within a track and organize the data therein as a “forest-of-tries”, which may span multiple entities, such as users and/or subscribers. This may result in a semi-optimal “data bag” that is smaller in size when compared storage/organization methods such as storing incoming records laid next to each other. Simple two-level hash-based data ingestion will be understood in greater detail by those having skill in the art after a more thorough review of the Drawings and Detailed Description.

In another aspect, the disclosed system and method for distributed, co-located, and self-organizing data storage and cluster computation framework for batch algorithms on big datasets may include high-speed data ingestion. By using parallel sequential writes of each staged forest-of-tries into pre-designated file locations derived from sector and track addresses, such high-speed data ingestion may be accomplished with limited organizational tradeoffs. In a related aspect, a method of co-locating data inside each track may further increase the efficiency of the disclosed system. In this related aspect, by merging forests-of-tries containing data belonging to multiple different entities/users/subscribers with a larger forest-of-tries organized around individual entities/users/subscribers, and then by re-assessing the clustering scheme and an attempting to organize the data using more dense tries, storage efficiency may be further improved without significant processing power tradeoffs and/or increases. Each of the disclosed high-speed data ingestion and co-location regimes, including the mechanisms, applications, systems, and machines involved will be understood in greater detail by those having ordinary skill in the art after a more thorough review of the Drawings and Detailed Description.

In yet another aspect, the system and method of the disclosure may include a self- optimizing layout. Such a layout may require gradual and continuous exploration for higher-density trie encoding at the entity/user/subscriber level. As the coined term herein named “shake-and-settle” may be understood, after a thorough review of the Drawings and Detailed Description herein, to include an algorithm having the capability of gradually “settling in” entire datasets through continuously “shaking” constituent entities. While purely metaphorical, the shaking of a jar of lake water or other water containing sediment, dirt, clay, sand, etc., which may cause layers of each to form through settling may be helpful to the understanding of such shake-and-settle methods of data organization which will be further understood by those having ordinary skill in the art after a more thorough review of the Drawings and Detailed Description.

In a potentially final aspect of the potentially preferred embodiment, the disclosed system and method may further include parallel batch processing algorithm(s). Such parallel batch processing algorithm(s) may possess on-demand pointed access to any related or unrelated entity of interest by relying on a two-level index yielding data in very few IOPS (˜2 IOPS: 1 for index, 1 for data). Such an aspect of the disclosed system and method for distributed, co-located, and self-organizing data storage and cluster computation framework for batch algorithms on big datasets may assist in running a class of link-analysis workloads at a much faster speed having greater efficiency while enabling analysis of multiple entities/subscribers/users in parallel at an incremental pro-rata cost with little-to-no upfront cost to the machines of the system(s) of the disclosure. This aspect/feature will also be understood by those having ordinary skill in the art when viewing the Drawings in light of the Detailed Description.

As a whole, the disclosed system and method may be thought of as a regime, technique, or system for the efficient ingestion, initial organization, storage, and reorganization of data streams in order to provision a high-efficiency query and/or analysis data structure for ease of use for data relevant to an organization. While the disclosed systems and methods may be most applicable to organizations of sufficient size and userbase to require such efficient ingestion, initial organization, storage, reorganization, etc., such implementations of the disclosed systems and methods may be of use to smaller streaming datasets as well.

Benefits of the disclosed system and method, which may be recognized by those having ordinary skill in the art may include reduction in net data volume during ingestion, streamlining ingestion of data, gradual optimization, and on-demand access to data for analysis. Reduction in net data volume (as measured in bytes) for ingestion may be accomplished through exploiting data repetition in records belonging to unrelated entities. This may be accomplished on-the-fly by utilization of encoding such ingested data into a forests-of-tries. By streamlining insertion of data through the leveraging of efficient sequential I/O and by achieving high-throughput by issuing multiple writes in parallel on independent files, which may thus enable the full capability of underlying I/O system(s)/subsystem(s). By gradually optimizing already co-located data at the level of individual entities/users/subscribers through exploration of higher density trie layouts in existing data, the reorganizing said data upon discovery of a better layout further benefits users accessing the data for analysis. This aspect possesses the further benefit of being a non-disruptive and continuous operation that may be capable of running in parallel on independent entities/user/subscribers at rest without increasing load on other processes and/or equipment. Further increasing this utility, each iteration of each reorganization may be run according a budget of prior-marked resources for discovery of denser layouts—the greater the resource budget, higher the chances of discovering a better trie layout. Those skilled in the art may understand the obvious benefit of such a reorganization technique as known low-streaming times may be scheduled for additional reorganization resources and/or observed low-streaming times may initiate additional reorganization resources. Finally, another of the numerous features and benefits of the disclosed systems and methods may be the enablement of on-demand access to data belonging to related entities/users/subscribers of interest by incurring a fixed, predictable I/O cost from batch processing algorithms working entity-after-entity. These benefits, as well as other benefits which may be present but not fully described herein, may be better understood by those having ordinary skill in the art after a thorough review of the included Drawings and detailed description.

The foregoing illustrative summary, as well as other exemplary objectives and/or advantages of the disclosure, and the manner in which the same are accomplished, are further explained within the following Detailed Description and its accompanying Drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

The present disclosure will be better understood by reading the Detailed Description with reference to the accompanying drawings, which are not necessarily drawn to scale, and in which like reference numerals denote similar structure and refer to like elements throughout, and in which:

FIG. 1A is a block diagram of a computer system of the present disclosure;

FIG. 1B is a block diagram of a communications system implemented by the computer system in FIG. 1;

FIG. 2 is a block diagram of an exemplary multi-track co-location data structure and/or storage system/schema;

FIG. 3 is a block/flow diagram of an exemplary intake ingestion scheme illustrating the transformation of session information for multiple subscriber devices into a trie for storage on a database;

FIG. 4 is another block/flow diagram of an exemplary intake ingestion scheme, illustrating in further specific detail the trie evaluation, organization, and/or formation;

FIG. 5 is a block diagram showing a step to condense tries;

FIG. 6 is a block diagram showing the storage of subscriber-centric tries;

FIG. 7 is a block diagram showing further condensation of tries;

FIG. 8 is a block diagram showing exemplary user interactions with data stored as organizational tries;

FIG. 9 is a block diagram showing trie organizational structure in relation to indices;

FIG. 10 is a flow diagram of an exemplary user interaction with a database having trie organizational data structure; and

FIG. 11 is a flow diagram summarizing the interplay between client-side and server-side processes of the disclosed method.

It is to be noted that the drawings presented are intended solely for the purpose of illustration and that they are, therefore, neither desired nor intended to limit the disclosure to any or all of the exact details of construction shown, except insofar as they may be deemed essential to the claimed disclosure.

DETAILED DESCRIPTION

Referring now to FIGS. 1-11, in describing the exemplary embodiments of the present disclosure, specific terminology is employed for the sake of clarity. The present disclosure, however, is not intended to be limited to the specific terminology so selected, and it is to be understood that each specific element includes all technical equivalents that operate in a similar manner to accomplish similar functions. Embodiments of the claims may, however, be embodied in many different forms and should not be construed to be limited to the embodiments set forth herein. The examples set forth herein are non-limiting examples and are merely examples among other possible examples.

The present disclosure solves the aforementioned limitations of the currently available devices and methods of ingestion, initial storage, optimization, and retrieval by providing a system and method for distributed, co-located, and self-organizing data storage and cluster computation framework for batch algorithms on big datasets may include high-speed data ingestion.

In describing the exemplary embodiments of the present disclosure, as illustrated in FIGS. 1A-1B. specific terminology is employed for the sake of clarity. The present disclosure, however, is not intended to be limited to the specific terminology so selected, and it is to be understood that each specific element includes all technical equivalents that operate in a similar manner to accomplish similar functions. The claimed invention may, however, be embodied in many different forms and should not be construed to be limited to the embodiments set forth herein. The examples set forth herein are non-limiting examples, and are merely examples among other possible examples.

As will be appreciated by one of skill in the art, the present disclosure may be embodied as a method, data processing system, software as a service (SaaS) or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product on a computer-readable storage medium having computer-readable program code means embodied in the medium. Any suitable computer readable medium may be utilized, including hard disks, ROM, RAM, CD-ROMs, electrical, optical, magnetic storage devices and the like.

The present disclosure is described below with reference to flowchart illustrations of methods, apparatus (systems) and computer program products according to embodiments of the present disclosure. It will be understood that each block or step of the flowchart illustrations, and combinations of blocks or steps in the flowchart illustrations, can be implemented by computer program instructions or operations. These computer program instructions or operations may be loaded onto a general-purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions or operations, which execute on the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart block or blocks/step or steps.

These computer program instructions or operations may also be stored in a computer-usable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions or operations stored in the computer-usable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart block or blocks/step or steps. The computer program instructions or operations may also be loaded onto a computer or other programmable data processing apparatus (processor) to cause a series of operational steps to be performed on the computer or other programmable apparatus (processor) to produce a computer implemented process such that the instructions or operations which execute on the computer or other programmable apparatus (processor) provide steps for implementing the functions specified in the flowchart block or blocks/step or steps.

Accordingly, blocks or steps of the flowchart illustrations support combinations of means for performing the specified functions, combinations of steps for performing the specified functions, and program instruction means for performing the specified functions. It should also be understood that each block or step of the flowchart illustrations, and combinations of blocks or steps in the flowchart illustrations, can be implemented by special purpose hardware-based computer systems, which perform the specified functions or steps, or combinations of special purpose hardware and computer instructions or operations.

Computer programming for implementing the present disclosure may be written in various programming languages, database languages, and the like. However, it is understood that other source or object-oriented programming languages, and other conventional programming language may be utilized without departing from the spirit and intent of the present disclosure.

Referring now to FIG. 1A, there is illustrated a block diagram of a computing system 10 that provides a suitable environment for implementing embodiments of the present disclosure. The computer architecture shown in FIG. 1A is divided into two parts—motherboard 100 and the input/output (I/O) devices 200. Motherboard 100 preferably includes subsystems and/or processor(s) to execute instructions such as central processing unit (CPU) 102, a memory device, such as random-access memory (RAM) 104, input/output (I/O) controller 108, and a memory device such as read-only memory (ROM) 106, also known as firmware, which are interconnected by bus 110. A basic input output system (BIOS) containing the basic routines that help to transfer information between elements within the subsystems of the computer is preferably stored in ROM 106, or operably disposed in RAM 104. Computing system 10 further preferably includes I/O devices 202, such as main storage device 214 for storing operating system 294 and instructions or application program(s) 206, and display 208 for visual output, and other I/O devices 212 as appropriate. Main storage device 214 preferably is connected to CPU 102 through a main storage controller (represented as 108) connected to bus 110. Network adapter 210 allows the computer system to send and receive data through communication devices or any other network adapter capable of transmitting and receiving data over a communications link that is either a wired, optical, or wireless data pathway. It is recognized herein that central processing unit (CPU) 102 performs instructions, operations or commands stored in ROM 106 or RAM 104.

Processor 102 may, for example, be embodied as various means including one or more microprocessors with accompanying digital signal processor(s), one or more processor(s) without an accompanying digital signal processor, one or more coprocessors, one or more multi-core processors, one or more controllers, processing circuitry, one or more computers, various other processing elements including integrated circuits such as, for example, an ASIC (application specific integrated circuit) or FPGA (field programmable gate array), or some combination thereof. Accordingly, although illustrated in FIG. 1A as a single processor, in some embodiments, processor 102 comprises a plurality of processors. The plurality of processors may be embodied on a single computing device or may be distributed across a plurality of computing devices collectively configured to function as the computing device 10. The plurality of processors may be in operative communication with each other and may be collectively configured to perform one or more functionalities of the computing device 10 as described herein. In an example embodiment, processor 102 is configured to execute instructions stored in memory 104, 106 or otherwise accessible to processor 102. These instructions, when executed by processor 102, may cause the computing device 10 to perform one or more of the functionalities of the computing device 10 as described herein.

Whether configured by hardware, firmware/software methods, or by a combination thereof, processor 102 may comprise an entity capable of performing operations according to embodiments of the present invention while configured accordingly. Thus, for example, when processor 102 is embodied as an ASIC, FPGA or the like, processor 102 may comprise specifically configured hardware for conducting one or more operations described herein. As another example, when processor 102 is embodied as an executor of instructions, such as may be stored in memory 104, 106, the instructions may specifically configure processor 102 to perform one or more algorithms and operations described herein.

The plurality of memory components 104, 106 may be embodied on a single computing device 10 or distributed across a plurality of computing devices. In various embodiments, memory may comprise, for example, a hard disk, random access memory, cache memory, flash memory, a compact disc read only memory (CD-ROM), digital versatile disc read only memory (DVD-ROM), an optical disc, circuitry configured to store information, or some combination thereof. Memory 104, 106 may be configured to store information, data, applications, instructions, or the like for enabling the computing device 10 to carry out various functions in accordance with example embodiments discussed herein. For example, in at least some embodiments, memory 104, 106 is configured to buffer input data for processing by processor 102. Additionally, or alternatively, in at least some embodiments, memory 104, 106 may be configured to store program instructions for execution by processor 102. Memory 104, 106 may store information in the form of static and/or dynamic information. This stored information may be stored and/or used by the computing device 10 during the course of performing its functionalities.

Many other devices or subsystems or other I/O devices 212 may be connected in a similar manner, including but not limited to, devices such as microphone, speakers, flash drive, CD-ROM player, DVD player, printer, main storage device 214, such as hard drive, and/or modem each connected via an I/O adapter. Also, although preferred, it is not necessary for all of the devices shown in FIG. 1A to be present to practice the present disclosure, as discussed below. Furthermore, the devices and subsystems may be interconnected in different configurations from that shown in FIG. 1A, or may be based on optical or gate arrays, or some combination of these elements that is capable of responding to and executing instructions or operations. The operation of a computer system such as that shown in FIG. 1A is readily known in the art and is not discussed in further detail in this application, so as not to overcomplicate the present discussion.

In some embodiments, some or all of the functionality or steps may be performed by processor 102. In this regard, the example processes and algorithms discussed herein can be performed by at least one processor 102. For example, non-transitory computer readable storage media can be configured to store firmware, one or more application programs, and/or other software, which include instructions and other computer-readable program code portions that can be executed to control processors of the components of system 201 to implement various operations, including the examples shown above. As such, a series of computer-readable program code portions may be embodied in one or more computer program products and can be used, with a computing device, server, and/or other programmable apparatus, to produce the machine-implemented processes discussed herein.

Any such computer program instructions and/or other type of code may be loaded onto a computer, processor or other programmable apparatuses circuitry to produce a machine, such that the computer, processor or other programmable circuitry that executes the code may be the means for implementing various functions, including those described herein.

Referring now to FIG. 1B, there is illustrated a diagram depicting an exemplary system 201 in which concepts consistent with the present disclosure may be implemented or performed. Examples of each element within the communication system 201 of FIG. 1B are broadly described above with respect to FIG. 1A. In particular, the server system 260 and user system 220 have attributes similar to computer system 10 of FIG. 1A and illustrate one possible implementation of computer system 10. Communication system 201 preferably includes one or more user systems 220, 222, 224, one or more server system 260, and network 250, which could be, for example, the Internet, public network, private network or cloud. User systems 220-224 each preferably include a computer-readable medium, such as random-access memory, coupled to a processor. The processor, CPU 102, executes program instructions or operations stored in memory. Communication system 201 typically includes one or more user system 220. For example, user system 220 may include one or more general-purpose computers (e.g., personal computers), one or more special purpose computers (e.g., devices specifically programmed to communicate with each other and/or the server system 260), a workstation, a server, a device, a digital assistant or a “smart” cellular telephone or pager, a digital camera, a component, other equipment, or some combination of these elements that is capable of responding to and executing instructions or operations.

Similar to user system 220, server system 260 preferably includes a computer-readable medium, such as random-access memory, coupled to a processor. The processor executes program instructions stored in memory. Server system 260 may also include a number of additional external or internal devices, such as, without limitation, a mouse, a CD-ROM, a keyboard, a display, a storage device and other attributes similar to computer system 10 of FIG. 1A. Server system 260 may additionally include a secondary storage element, such as database 270 for storage of data and information. Server system 260, although depicted as a single computer system, may be implemented as a network of computer processors. Memory in server system 260 contains one or more executable steps, program(s), algorithm(s), or application(s) 206 (shown in FIG. 1A). For example, the server system 260 may include a web server, information server, application server, one or more general-purpose computers (e.g., personal computers), one or more special purpose computers (e.g., devices specifically programmed to communicate with each other), a workstation or other equipment, or some combination of these elements that is capable of responding to and executing instructions or operations.

System 201 is capable of delivering and exchanging data between user system 220 and a server system 260 through communications link 240 and/or network 250. Through user system 220, users can preferably communicate over network 250 with each other user system 220, 222, 224, and with other systems and devices, such as server system 260, to electronically transmit, store, manipulate, and/or otherwise use data exchanged between the user system and the server system. Communications link 240 typically includes network 250 making a direct or indirect communication between the user system 220 and the server system 260, irrespective of physical separation. Examples of a network 250 include the Internet, cloud, analog or digital wired and wireless networks, radio, television, cable, satellite, and/or any other delivery mechanism for carrying and/or transmitting data or other information, such as to electronically transmit, store, manipulate, and/or otherwise modify data exchanged between the user system and the server system. The communications link 240 may include, for example, a wired, wireless, cable, optical or satellite communication system or another pathway. It is contemplated herein that RAM 104, main storage device 214, and database 270 may be referred to herein as storage device(s) or memory device(s).

Referring now specifically to FIG. 2, therein illustrated is a block diagram of an exemplary multi-track co-location data structure and/or storage system/schema, which may be relevant to computing system 10 and/or machines of system 201 such as database 270. FIG. 2, in its simplest sense may provide an overview of a physical data layout of database 270 using a 256×256 layout having 256 files obtained from a first-level hash and 256 sectors inside each file obtained from a second-level hash. Therein illustrated is file number 1 having header 291 and tracks 290. Disposed from the left to right may be 255 additional files, totaling 256 files. Data inside sectors may be organized as tracks 290, and tracks 290 may come in multiple predefined sizes which can grow or shrink depending on actual data occupancy. Entities may be hashed to a sector using a Modulo Partitioning function, which in some implementations of the disclosed system and method may be the default. This may be replaced/modified/adjusted by any special purpose variants that may take into account the current occupancies, as well as the various loads which may be required at various sectors. Data entering computing system 10 in the disclosed system and method may be channelized into LO HB files and, as may be understood by those having ordinary skill in the art, it may prove useful to always write such incoming data sequentially. LO files may be subjected to a co-location merge phase, which as is well understood in the art, may be in principle a sorted-merge activity that reads files, sector by sector and stitches them in 0(n) time. The above header 291, its corresponding tracks 290, the corresponding files, as well as the features and benefits, and potential implementations may be even better understood by those having skill in the art from a review of the remaining FIGS. 3-11, in addition to the accompanying Detailed Description. Particularly, computing system 10 as well as its corresponding header 291, tracks 290, and files may be featured in duplicate or across multiple equivalent computing device 10 instances on network 250 and/or system 201, and their combination(s), may be understood as a virtualized distributed database, as well as other known combinations of redundant, co-located, and/or distributed computing networks.

Referring now specifically to FIG. 3, therein illustrated is a block chart having a flow diagram of an exemplary intake ingestion scheme. Basic components, which may or may not be required depending on the users/systems/subscribers/customers/content being monitored, studied, or stored, are exemplary only. A system and method according to the disclosure may be and likely is more complicated than may be illustrated in FIGS. 1A, 1B, and 3, and may involve multiple (or numerous) towers, user devices, networks, servers, users, the like, and/or combinations thereof. Beginning with various subscriber/user interaction(s) with various telecommunications infrastructure, first subscriber device 311, second subscriber device 312, third subscriber device 313, and fourth subscriber device 314 are therein illustrated. Dashed lines linking subscriber devices 311-314 may represent connections to each of telecommunication towers A1-A2, which may cause session information for each of subscriber devices 311-314 to be obtained via a machine on network 250. Each of subscriber devices 311-314 may interact (e.g., transmit and receive data) with telecommunication towers A1 and A2. Such telecommunication towers A1 and A2 may then interface with other devices on network 250 to achieve various tasks for subscribers using subscriber devices 311-314, such as placing/receiving a call, sending/receiving a text message, streaming video content for consumption by a subscriber, broadcasting video content from a subscriber to other devices, surfing the web, playing an online game, interacting with various social media platforms, the like and/or combinations thereof. As may be understood by those skilled in the art of telecommunications, each of these activities may be classified as a “session”. Each session may have attributes relevant to a telecommunications company, in that they may be interested as to how subscribers use their network. The information sent/received by the subscriber may pass along transmission lines L1 and L2, which may be physical direct links between computing devices 10 on network 250, or may be mediated and indirect links to process incoming requests and deliver corresponding results to users/subscribers. Additionally, data relevant to the subscriber and to each subscriber session may be tracked and/or monitored by some or many device(s) on network 250, which may then be packaged, along potentially other data relevant to other subscribers, as a packet, by for instance server system 260, and then sent for storage on database 270. As is illustrated therein FIG. 3, first server link 302a and second server link 302b may carry transmission of the session information, which may be received by server system 260 (or first server 206a and second server 206b connected by third server link 302c), which may further package and/or process subscriber session relevant data into a data packet, then proceed to transmit to database 270 for storage and/or further processing and access. The packaging and processing process performed by various machines on server system 260 are covered in greater detail below. In receipt of session information relevant to subscriber devices 311-314, possibly via transmission from telecommunication towers A1-A2 to first server 260a via links L1-L2 and/or first server link 302a or second server link 302b then transmission via third server link 302c, second server 260b may proceed to “ingest” or facilitate the ingestion of session information for storage on database 270. Example users IDs K1-K4 as illustrated in multisession tuple intake 301 may represent subscriber devices 311-314, respectively. Data, such as a1, a2, b1, b3 and the like may represent unique and/or discrete information with regard to an individual user session. For instance, by way of example and not limitation, “a1-a6” may represent a quality of service ranked A-F on a letter-grading scale where a1 is an A-grade quality of service, a2 is a B-grade of quality of service, and so on. Another such non-limiting example may be b1 representing the state of New York as a geographical region and b2 may indicate the state of Colorado, and so on. Other examples relevant to telecommunication subscribers may include, by way of example and not limitation, traffic type (e.g., social media, web browsing, audio streaming), volume (e.g., 300 KB), ping, latitude/longitude, tower information, the like and/or combinations thereof. A processing step, which may be type-1 trie evaluation step 321, may count the number of discrete values in each of the tuples therein multisession tuple intake 301 to determine the category of information, or tuple, with the lowest number of discrete values. As therein illustrated in type-1 trie evaluation step 321, a determination has been made that tuple C contains the fewest number of discrete values because only c1 and c2 are present in multisession tuple intake 301, meaning 2 discrete values. Other tuples, for example tuple A, tuple B, and tuple D, contain higher values of four, three, and five, respectively as therein illustrated in trie evaluation step FIG. 3. Now, turning to first exemplary type-1 trie 331a, therein illustrated is one visual presentation of an ordered trie. As illustrated, this may be transmitted to database 270 for storage in such a format. As illustrated, two main tries exist branching from c1 and c2, the multisession tuple intake 301 tuple having the lowest number of discrete values. Then, b1 and b5 branch from each of c1 and c2, and so on. Subscriber information may subsist at the lowest order trie branch, as therein illustrated.

Turning now to FIG. 4, therein illustrated another block/flow diagram of an exemplary intake ingestion scheme, illustrating in further detail type-1 trie evaluation step 321, second exemplary trie organization step 331b, and/or as may be understood by those having ordinary skill in the art, type-1 trie formation. Here, multisession tuple intake 301 is presented as an organized table having a header with subscriber, quality of service (QoS), location (Loc.), traffic type (Traf.), and volume (Vol.) information. Alice, Bob, Chris, and Dick are therein illustrated having 2, 1, 1, 1, and 1, respectively, unique subscriber sessions on multisession tuple intake 301. Quality of service is again rated on an A-F scale, location is simplified to include state abbreviations (e.g., NY is New York State, CO is Colorado, and so on), traffic type is abbreviated website information (e.g., YT is YouTube and FB is Facebook), and volume may refer to a rounded or non-rounded upload and/or download information regarding the session (e.g., 300 MB) At type-1 trie evaluation step 321, a system of the disclosure literally counts the discrete values for each column in multisession tuple intake 301. Type-1 trie visualization 331v may be illustrated for demonstration purposes only as this branching format is a common visual representation of tries. Second exemplary type-1 trie 331b may be yet another visual representation of data, according to the system and method of the disclosure. Once formed into second exemplary type-1 trie 331b, this data may be output to database 270. It should be noted that by organizing multisession tuple intake 301 into a format such as second exemplary type-1 trie 331b may reduce overall storage volume required on database 270, and may simplify the querying and/or algorithmic processing of the data therein.

Turning now to FIG. 5, therein illustrated is a block diagram showing a combination step of multiple multisession tuple intakes 301 performed on combined data streams, having been processed into each of second exemplary type-1 trie 331b and third exemplary type-1 trie 331c for organization as exemplary Alice subscriber trie 341a, which may correspond to the subscriber, Alice. Stored on database 270, second exemplary type-1 trie 331b and third exemplary type-1 trie 331c may be loaded into RAM 104 via type-1 trie reading and memory uptake step 501. It should be noted that RAM 104 is in communication with CPU 102 in order to perform certain tasks upon data thereon RAM 104. Having organized tries related to Alice and other subscribers, and having combined second exemplary type-1 trie 331b and third exemplary type-1 trie 331c for subscriber focused evaluation at type-1 subscriber trie evaluation step 322, which has determined traffic type to be the lowest with regard to discrete values therein a trie related to Alice based on a combination of second exemplary type-1 trie 331b and third exemplary type-1 trie 331c. Exemplary Alice subscriber trie 341a may then be organized accordingly, retaining traffic type as a high-order trie, just below the subscriber in hierarchy. The same can be performed across all users, such as Bob, Chris, and David. As may be evident to those having ordinary skill in the art, at type-1 subscriber trie evaluation step 322, discrete values for session volume with respect to Alice represent the second lowest discrete values with respect to data regarding Alice, but session value maintains its position as the lowest order trie, as it had in second exemplary type-1 trie 331b and third exemplary type-1 trie 331c. This incongruency in storage efficiency may be related to the use of “greedy algorithms” which are any algorithms that follow the problem-solving heuristic of making the locally optimal choice at each stage. Processing power may be conserved by using such an algorithm at the expense of increased RAM 104 usage and database 270 storage. Alternatively, one of ordinary skill in the art may choose to implement other algorithms for trie formation, thereby preferencing storage and memory efficiency over processing efficiency.

Having processed second exemplary type-1 trie 331b and third exemplary type-1 trie 331c into exemplary Alice subscriber trie 341a, exemplary Bob subscriber trie 341b, other subscriber tries for other subscribers/users (represented as ellipses . . . ), and exemplary Zoe subscriber trie 341z, we turn now to FIG. 6. Therein illustrated is subscriber trie storage on database 270. As generated by CPU 102 and in RAM 104, exemplary Alice subscriber trie 341a, exemplary Bob subscriber trie 341b, other subscriber tries for other subscribers/users, and exemplary Zoe subscriber trie 341z may be transferred for persistent storage at Alice trie write 502a, Bob trie write 502b, other subscriber trie writes 502c and Zoe trie write 502z. Therein illustrated are copies of exemplary Alice subscriber trie 341a, exemplary Bob subscriber trie 341b, other subscriber tries for other subscribers/users, and exemplary Zoe subscriber trie 341z transmitted from RAM 104 onto database 270 where they may persist as subscriber tries for later processing, querying, and/or analysis. While subscriber tries have been illustrated and described as a useful example of systems and methods for performing a type-1 or type-2 trie schema, the systems and methods as herein described is not so limited. Additional options may exist and/or be adapted based on business considerations. Exemplary alternatives to subscriber type-1 or 2 tries may include but are not limited to households, accounts, specific locations of interest (e.g., mall or airport), and heavy traffic areas. Additionally, while numerous variables may be present in an n-tuple ordered data set, systems and methods implementing the system and method of the disclosure may elect to choose to maintain only a limited subset of the n-tuples, thereby reducing the overall complexity of the data stored and transmitted among network 250.

Having stored exemplary Alice subscriber trie 341a, exemplary Bob subscriber trie 341b, other subscriber tries for other subscribers/users, and exemplary Zoe subscriber trie 341z on database 270, we turn now to FIG. 7. One or more subscriber tries may be loaded into RAM 104 at subscriber reading step 503. As illustrated therein FIG. 7, exemplary Alice subscriber trie 341a, exemplary Bob subscriber trie 341b, other subscriber tries for other subscribers/users, and exemplary Zoe subscriber trie 341z has been loaded into RAM 104 at subscriber reading step 503. Now, evaluation to determine whether and how a type-2 trie may be formed in order to better condense each of exemplary Alice subscriber trie 341a, exemplary Bob subscriber trie 341b, other subscriber tries for other subscribers/users, and exemplary Zoe subscriber trie 341z at type-2 subscriber trie evaluation step 323. As illustrated therein type-2 subscriber evaluation step 323, volume has been identified as a better organizational strategy for Alice specifically, as mentioned above. Having determined this strategy at type-2 subscriber evaluation step 323, CPU 102 may proceed to restructure exemplary Alice subscriber trie 341a into exemplary Alice type-2 subscriber trie 371a, which may be recognized by those of ordinary skill in the art as the most condensed trie possible for session data related to Alice. Having processed exemplary Alice subscriber trie 341a and having condensed it into exemplary Alice type-2 subscriber trie 371a processed further exemplary Bob subscriber trie 341b, other subscriber tries for other subscribers/users, and exemplary Zoe subscriber trie 341z via CPU 102, exemplary Alice type-2 subscriber trie 371a may be transmitted at type-2 storage step 504a and each respective type-2 subscriber trie may be transmitted at type-2 storage step 504 for persistent condensed storage on database 270. Note that though exemplary Alice subscriber trie 341a was determined at type-2 subscriber trie evaluation step 323 to organize around volume, the same may not be true for other subscribers having other data and/or attributes. For example, exemplary Bob subscriber trie 341b may instead remain as stored originally since location and volume contain the same number of discrete values. At a type-2 subscriber trie created for Bob, type-2 subscriber evaluation step 323 may determine to choose either of location or volume as a higher order trie. Similar distinctions may be present among other subscribers having other mixes of associated data, thereby structuring type-2 tries heterogeneously during storage on database 270. For avoidance of doubt, type-2 subscriber evaluation step 323 may be performed multiple times across a portion or all of user data stored on database 270 at a schedule, upon a decrease in processor use, or based upon a call or query from, to, or upon devices connected to network 250 and storing thereon non-transitory computer readable media structured data in a type-1 or type-2 trie schema. This process of re-arrangement may be thought of as “shaking” each user/subscriber trie and allowing it to “settle”, thereby increasing the density of data stored thereon database 270. As may be understood by those having ordinary skill in the art, those arrows extending from type-2 storage step 504 (and type-2 storage step 504a) are drawn in a format which represents a tighter and/or denser storage arrangement, which may be understood to have multiple advantages over previous storage arrangements, especially at big dataset scale, where small improvements to storage organization may increase greatly the overall storage, access, query and algorithmic performance.

Turning now to FIG. 8, therein illustrated is network 250 having connections to database 270 via server system 260 and further having direct and/or indirect connections to, by way of example and not limitation, additional exemplary database 270a, user system 220 (illustrated herein as a laptop/notebook computing device), user device 224 (illustrated herein as a mobile touch-screen computing device, e.g., tablet or smartphone), user device 222 (illustrated herein as a desktop computing device), distributed cloud computing/storage system 280, the like and/or combinations thereof. As may be understood by those having ordinary skill in the art, having processed and condensed subscriber relevant data across subscriber information in database 270 into type-2 subscriber tries, exemplary Alice type-2 subscriber trie 371a, exemplary Bob type-2 subscriber trie 371b, exemplary other user type-2 subscriber tries, and exemplary Zoe type-2 subscriber trie 371z may persist on database 270 for access by devices on network 250. As will be described in further detail below, devices may be specially programmed to request specific subscriber information from database 270 according to organizational indexes, trie structure, and other organizational schema according to the system and methods described herein. Specifically, according to FIG. 8, it may further be assumed that user devices 220, 222, and 224 and additionally additional exemplary database 270a and distributed cloud computing/storage system 280 may have the capability to access and request said user/subscriber data. Having made such a request for comprehensive data related to all subscriber data stored thereon database 270, one of such devices on network 250 may then cause exemplary Alice type-2 subscriber trie 371a, exemplary Bob type-2 subscriber trie 371b, exemplary other user type-2 subscriber tries, and exemplary Zoe type-2 subscriber trie 371z to be read into RAM 104 for processing by CPU 102. As therein illustrated, exemplary Alice type-2 subscriber trie 371a may first be read into RAM 104 for processing step 324, which may return exemplary Alice type-2 subscriber trie 371a as a structured table according to methods known by those of ordinary skill in the art. Once processed at processing step 324, each requested table related to each type-2 subscriber trie may be pushed onto network 250 for delivery to additional exemplary database 270a, user system 220 (illustrated herein as a laptop/notebook computing device), user device 224 (illustrated herein as a mobile touch-screen computing device, e.g., tablet or smartphone), user device 222 (illustrated herein as a desktop computing device), distributed cloud computing/storage system 280, the like and/or combinations thereof, depending on which user device or other device had requested said subscriber/user data, via each communications link 240 as therein illustrated.

Turning back to FIG. 2, a block diagram of an exemplary multi-track co-location data structure and/or storage system/schema was therein illustrated and described. Additional description may be warranted before describing in greater detail user and machine-based queries of structured subscriber trie data stored thereon database 270. First it is important to note that RAM 104 or database 270 may each maintain data so structured into 256×256 data arrangements. On RAM 104 of, for instance, a client on network 250 connected to database 270, may be maintained a total of 65536 in-memory bins as may be understood by those having ordinary skill in the art. On receipt, incoming records may be maintained thereon RAM 104 in a merely hashed or tuple state. In-memory bin sizes may be adaptive, meaning their size may be increased for bins receiving high-volume incoming streams and may be decreased for bins receiving low-volume incoming streams. Accordingly, a metric may be imposed in order to facilitate orderly shrinking and/or growing of in-memory bin size. One such method which may be preferred is ranking bins according to the rate of data streaming into them, increasing the size of the greatest rates of streams and decreasing the size for the lowest rates of streams. An example threshold may be growing those of 99^thpercentile speed and shrinking those of first percentile speeds (i.e., growing the top 1% and shrinking the bottom 1%). Fill rates for in-memory bins may be measured in many ways, including but not limited to fill rate in hundreds of bytes per second (e.g., 900 hundreds-bytes per second or 900 centi-bytes per second). During periods where streams are incoming, the size of in-memory bins may range from, for example, 1-8 KB and may be stored onto database 270 in bins ranging from, for example, 8-512 KB. As bins fill within RAM 104, they may be offloaded as type-1 tries onto database 270 to create a forest of type-1 tries thereon. Then, those skilled in the art may further recognize the opportunity to co-locate type-1 tries alongside more condensed type-2 tries within database 270, whether they be organized as subscriber type-2 tries or otherwise. Then, those type-1 tries collocated with type-2 tries in database 270 may be processed into type-2 tries using the process described in FIGS. 5-7 above by, in summary, reading type-1 tries into RAM 104, transforming into type-2 tries thereon RAM 104 via CPU 102, sorting by first order on (for instance) subscriber ID, re-reading type-2 tries on a subscriber-by-subscriber basis—while simultaneously reading type-2 tries into ram, iterating—then performing a sorted merge of type-2 tries thereof in order to create a consolidated type-2 trie for each subscriber (or other point of choosing). This process may be repeated on co-located data to constantly merge records streamed onto a system of the disclosure, condense, and make available for ordered querying by creating a layout which becomes gradually and continuously denser via a successive trie organizational encoding, thereby condensing the data for easier querying as discussed further below.

Turning back to FIG. 4, therein illustrated and described included type-1 trie visualization 331v. Herein illustrated in FIG. 9 is an example first-level indexing of a portion of that type-1 trie visualization 331v, or YouTube type-1 trie visualization 331w. By taking node composition marked therein as order/rank/trie-step 1-6, bits 0-4 may represent subscriber, QoS, location, traffic type, and volume, and may generate a structured index as illustrated in node layout 391. As illustrated, bits locations in sequence may match the header found in exemplary incoming data stream 301a. Then further transformation of YouTube type-1 trie visualization 331w, may be performed when laid down on disk, such as database 270. Having created first level index 392, further steps may be performed in order to query via user or machine. As annotated, since the level-1 trie as visualized in YouTube type-1 trie visualization 331w begins with traffic type, trie-step 1 may be ordered for traffic type thereon first level index 392. Then trie-step 2 may be New York-related data, then an individual record for Alice, which includes QoS and Volume information, at trie-step 3, then an individual record for Dick a trie-step 4, which contains information about QoS and Volume, data related to Illinois at trie-step 5 and finally another individual session record for Alice at trie-step 6, which also includes information about the session's QoS and Volume. As may be understood by those having ordinary skill in the art, a level or series of levels may be further important to understanding this trie organizational structure. A level may be so understood as a measure of distance from root (at level 1), in this example, we have only 3 levels—YouTube (or traffic type) is at Level 1, New York and Illinois (or region) are at Level 2 and the remaining data elements (A), (D) and (A) are at Level 3 of the trie organizational structure. The node numbering 1-2-3-4-5-6 then may be understood to depict depth-first traversal in the trie-organizational structure/design and the physical order for the nodes when serialized to disk. Bit location 0 may indicate a subscriber ID, bit location 1 may indicate quality of service, bit location 2 may indicate location information, bit location 3 may indicate traffic type, and bit location 4 may indicate session volume, as therein illustrated on node layout 391. Then, for each level a node may be ordered such as depicted in first level index 392, which, if data is present on the node with relation to the bit categories of node layout 391, a “1” bit is indicated and if no data is present on the node with relation to the bit categories of node layout 391, a “0” bit is indicated. This index may represent the on-disk structure which may allow reconstruction logically, may exploit I/O wherever possible, and may exploit sequential read-writes in order to write large chunks of data at once and generate as little disk fragmentation as possible.

Having described potential systems for and methods of distributed, co-located, and self-organizing data storage and cluster computation framework, now the description turns to performing queries and/or batch algorithms on big datasets thereof. In summary thus far, 265 files may contain 65536 sectors on-disk in a trie-layout. Thereon the on-disk layout may further be a first level index as illustrated therein FIG. 9 and described above. It is important for those having ordinary skill in the art to then recognize that each sector may be read and each key may be read, thereby having 1 trie rooted at every subscriber (or other data point of interest). Then, as will be further described below and through review of FIGS. 10-11, an 18-byte record may be emitted for each trie, containing 2-bytes (which may be reserved), 8-bytes which may correspond to an identification of the record, and another 8-bytes for the offset in the file. This 18-byte record may then be buffered and appended to an algorithmic index file, which may also persist on the disk or on database 270, which may serve as a primary index for algorithms batch-processing the data thereon. Simultaneously, an on-sector change may occur for sectors of interest to the algorithm and/or query. An index may be laid atop the existing index, which upon streaming completion for all sectors may generate a 655360-byte (65536×10 bytes) secondary index for algorithms, meaning all data thereon database 270 stored in trie-formatting and having an index-on-index reference may load such an index-on-index into RAM 104, consuming only 256×256.

Turning to FIG. 10, therein illustrated is an example user query and computer-implemented steps to return said query to said user. At query initiation step 1001, a user may request, via a user device (e.g., user devices 220, 222, 224), information relevant to a subscriber ID. The placeholder subscriber ID as therein illustrated at query initiation step 1001 is ‘xyz’. Using the systems and methods as herein described, a file number ‘F’ and a sector number ‘S’ unique to subscriber ID xyz may be obtained by the computer-implemented system and method at file/sector step 1002. Then, using the secondary index-on-index in memory (see FIG. 9 and the description above), offset for ‘F’ and ‘S’ may be obtained to look up the key related to said subscriber ID xyz, the resulting ‘R’ obtained at first result step 1003. Then, at second result step 1004 may be performed which involves a second secondary index-on-index look up to obtain result R′. At primary index step 1005, the primary index may be either read into memory or persist there already, which may be searched for data offset for user key ‘xyz’. At this stage, the offset may be ‘P’. Finally, a type-n trie may be read from co-located file for sector ‘S’ and offset ‘P’, which may be transmitted to users once “unfolded” into a machine-readable table at reading and transmitting step 1006. As may be understood by those having ordinary skill in the art, this type of querying is made possible because each user is hashed to a particular sector of database 270. Since automatic processes import streaming data as pre-formatted tries, they may be stored and co-located in database 270 with denser tries, and be automatically integrated into the denser tries. Business use cases may dictate various cycles for scheduling integration of type-1 tries into type-2 tries as may be understood by those having skill in the art.

Turning now to FIG. 11, therein illustrated is a summarized process for performing a potentially preferred embodiment of the method herein disclosed. At streaming step 1101, records incoming as streaming data may be buffered into a queue for processing on client side 1110, preferably by reading onto RAM 104 as described above. As may be understood by those having ordinary skill in the art, buffering and queues may be separated at specific, easily digestible volumes such as, by way of example and not limitation, 8 KB. Then, at type-1 trie formation step 1102, an optimal storage arrangement or trie arrangement is determined on client side 1110, preferably by reading and processing each streaming record in the buffer and processing via CPU 102 in communication with RAM 104. Further at type-1 trie formation step 1102, the type-1 tries are formed and/or built, which may be packaged and streamed at type-1 trie streaming step 1103 from client side 1110 to server side 1120. In receipt of incoming type-1 trie streams, streaming/buffering/ingestion step 1104 may occur on server side 1120, said incoming type-1 trie streams may be from multiple sources as indicated by multiple arrows arriving at streaming/buffering/ingestion step 1104. As may be understood by those having ordinary skill in the art, buffering and queuing at this stage in streaming/buffering/ingestion step 1104, and the transition to merging step 1105 may occur at pre-determined, specific volumes such as, by way of example and not limitation, 64 KB. Now in receipt of what may be understood as a forest of type-1 tries, they may be merged on server side 1120 to yield an incrementally denser type-1 trie at merging step 1105, which then may be serialized and persisted to disk using sequential I/O at serialization step 1106.

Having described various features of, steps in, systems of, machines of, embodiments (preferred and otherwise) of, and methods to provide a distributed, co-located, self-organizing data storage schema and/or design which may offer suitable cluster computational framework for batch algorithms on big datasets, it may be necessary to share with those skilled in the art various benefits of so organizing and processing big datasets in an established large-subscriber-volume data environment having incoming subscriber-data streams. In existing installations of the disclosed system and method, the established framework may be extensively used for sparse data-backed-profile-computation and distinct destination mapping for subscribers of telecommunications companies having millions (or tens/hundreds of millions) of subscribers. On a single, low-end machine having a Dual-Core Intel® i3 Processor with 2 disks and 12 GB of RAM, applicant has achieved co-location of 31 billion records, of which approximately 30 billion may persist as co-located type-1 and type-2 subscriber-based tries and 1 billion incoming records queued in RAM. This has been performed in approximately 2 hours and stored on a disk consuming only 360 GB of disk space. A throughput calculation approximated over 135,000 records per second per node. This may scale horizontally to achieve any desired throughput. Based on projections and laboratory testing, 1.1 trillion XDRs over 100 million subscribers at a daily load of 3 billion records accounting to 25 TB of stored data may be indexed using only 500 MB of RAM and 0.5 TB disk space using the indexing methods herein disclosed. Much of this efficiency and speed may be thanks to the dense trie organizational structure which may allow increases in efficiency for parallel computing processes using batch algorithms to obtain relevant data with respect to a query or algorithm.

With respect to the above description then, it is to be realized that the optimum methods, systems and their relationships, to include variations in systems, machines, size, materials, shape, form, position, function and manner of operation, assembly, order of operation, type of computing devices (mobile, server, desktop, etc.), type of network (LAN, WAN, internet, etc.), size and type of database, data-type stored therein, and use, are intended to be encompassed by the present disclosure.

It is contemplated herein that the system and method may be used to automate the creation and maintenance of a system and method for robust, efficient, adaptive streaming replication application protocol with dancing recovery for high-volume distributed live subscriber datasets, as well as automating a variety of other tasks. Furthermore, the systems and methods disclosed herein may be used to organize, retrieve, and otherwise manage a variety of data types, including but not limited to subscriber data, customer data, user data, employee data, sales data, lead data, logistic event data, order data, inventory information and data, the like and combinations thereof. It is further contemplated that a variety of computing devices may be deployed on the system and method by a variety of service providers, managers, users, owners, agents, companies and other enterprises. These may include but are not limited to personal computers, laptops, desktops, mobile devices, smart phones, tablets, servers, virtualized machines, the like and/or combinations thereof. It is further contemplated that specialized equipment may be deployed to further improve the disclosed system and method. This description includes all manners and forms of systems and method for the robust, efficient, adaptive streaming replication, in all manner of devices and software platforms capable of such adaptive streaming and replication of live databases in order to automate, streamline, decrease runtime, increase efficiency of, and/or make possible the dancing recovery protocol as herein described. While the implementation of the disclosed system and method may be most applicable and/or relevant to ERP databases, many platforms may benefit from the disclosed system and method, including but not limited to Customer Relationship Managers (CRMs), Information Technology (IT) support and implementation software, database management software, accounting software, other business and consumer software products which may store data across multiple live databases, archival databases, the like and/or combinations thereof

The illustrations described herein are intended to provide a general understanding of the structure of various embodiments. The illustrations are not intended to serve as a complete description of all of the elements and features of apparatus, processors, and systems that utilize the structures or methods described herein. Many other embodiments may be apparent to those of skill in the art upon reviewing the disclosure. Other embodiments may be utilized and derived from the disclosure, such that structural and logical substitutions and changes may be made without departing from the scope of the disclosure. Additionally, the illustrations are merely representational and may not be drawn to scale. Certain proportions within the illustrations may be exaggerated, while other proportions may be minimized. Accordingly, the disclosure and the figures are to be regarded as illustrative rather than restrictive.

The above disclosed subject matter is to be considered illustrative, and not restrictive, and the appended claims are intended to cover all such modifications, enhancements, and other embodiments, which fall within the true spirit and scope of the description. Thus, to the maximum extent allowed by law, the scope is to be determined by the broadest permissible interpretation of the following claims and their equivalents, and shall not be restricted or limited by the foregoing detailed description.

In the specification and/or figures, typical embodiments of the disclosure have been disclosed. The present disclosure is not limited to such exemplary embodiments. The use of the term “and/or” includes any and all combinations of one or more of the associated listed items. The figures are schematic representations and so are not necessarily drawn to scale. Unless otherwise noted, specific terms have been used in a generic and descriptive sense and not for purposes of limitation.

The foregoing description and drawings comprise illustrative embodiments. Having thus described exemplary embodiments, it should be noted by those skilled in the art that the within disclosures are exemplary only, and that various other alternatives, adaptations, and modifications may be made within the scope of the present disclosure. Merely listing or numbering the steps of a method in a certain order does not constitute any limitation on the order of the steps of that method. Many modifications and other embodiments will come to mind to one skilled in the art to which this disclosure pertains having the benefit of the teachings presented in the foregoing descriptions and the associated drawings. Although specific terms may be employed herein, they are used in a generic and descriptive sense only and not for purposes of limitation. Accordingly, the present disclosure is not limited to the specific embodiments illustrated herein but is limited only by the following claims.

Claims

1. A method for providing a distributed, co-located, self-organizing data storage and a cluster computational framework for batch algorithms on big datasets, the method comprising:

performing a two-level hash-based data ingestion, said two-level hash-based data ingestion comprising: unambiguously mapping a plurality of incoming records to a sector and a track within said sector of a memory; performing a heuristics-based clustering of said plurality of incoming records upon receipt and accumulating said plurality of incoming records within a track of said memory; and organizing said plurality of records as a forest-of-tries spanning a plurality of entities associated with said plurality of incoming records resulting in a semi-optimal data arrangement that is a smaller size when compared to said plurality of incoming records laid next to each other;

performing parallel sequential writes of each of said forest-of-tries into a plurality of pre-designated file locations using a high-speed data ingestion derived from an address of each of said sector and said track; and

co-locating a data inside one of said track by merging said forests-of-tries containing said data belonging to different of said plurality of entities with a larger forest-of-tries organized around each of said plurality of entities.

2. The method of claim 1, further comprising a step of re-assessing a resultant clustering scheme and attempting to organize said data using a progressively more dense trie organizational structure.

3. The method of claim 2, further comprising a step of self-optimizing a layout, said layout becoming gradually and continuously denser via a successive trie organizational encoding.

4. The method of claim 3, wherein said successive trie organizational encoding is performed at a level of each of said plurality of entities.

5. The method of claim 4, wherein said successive trie organizational encoding does not affect a speed or a processing of any of said high-speed data ingestion and a plurality of read operations, thereby condensing gradually said plurality of data.

6. The method of claim 5, further comprising a step of parallel batch processing using an algorithm.

7. The method of claim 6, wherein said algorithm has an on-demand pointed access to all of a plurality of entities of interest.

8. The method of claim 7, wherein said plurality of entities of interest is a subset of said plurality of entities.

9. The method of claim 8, wherein said parallel batch processing via said algorithm relies on a two-level index yielding said plurality of data.

10. The method of claim 9, further comprising a step of running a class of link-analysis workloads by analyzing said plurality of entities of interest in parallel at an incremental pro-rata cost minimally above a nominal fixed upfront cost.

11. A system for providing a distributed, co-located, self-organizing data storage and a cluster computational framework for batch algorithms on big datasets, the system comprising:

a client computing device having at least a first processor, a first memory, and a first connection to a network; and

a server computing device having at least a second processor, a second memory, a first non-transitory computer readable medium, and a second connection to said network;

wherein said client computing device is configured to: unambiguously map a plurality of incoming records to a sector and a track within said sector of said first memory; perform a heuristics-based clustering of said plurality of incoming records as they accrue within said track; and organize said plurality of incoming records as a forest-of-tries spanning a plurality of entities, each of said plurality of entities associated with said plurality of incoming records, thereby resulting in a semi-optimal data arrangement having a smaller in size than said plurality of incoming records laid next to each other.

12. The system of claim 11, wherein said client computing device is further configured to transmit said forest-of-tries via said first network connection to said server computing device via said second network connection.

13. The system of claim 12, wherein said server computing device, in receipt of said forest-of-tries is configured to perform parallel sequential writes of each of said forest-of-tries onto a plurality of pre-designated file locations using a high-speed data ingestion derived from an address of each of said sector and said track, said parallel sequential writes occur on said first non-transitory computer readable medium.

14. The system of claim 13, wherein said server computing device is further configured to co-locate a data inside one of said track by merging said forests of tries containing said data belonging to a second plurality of entities having a larger forest-of-tries organized around each of said second plurality of entities, thereby having generated said larger forest-of-tries comprising said plurality of entities and said second plurality of entities.

15. The system of claim 14, wherein said server computing device is further configured to perform a step of re-assessing a resultant clustering scheme and a step of attempting to organize said data using a progressively more dense trie organizational structure.

16. The system of claim 15, wherein said server computing device is further configured to self-optimize a layout, said layout is gradually and continuously denser via a successive trie organizational encoding.

17. The system of claim 16, wherein said server computing device is further configured to perform said successive trie organizational encoding at a level of each of a larger plurality of entities corresponding to said larger forest-of-tries and containing records related to said plurality of entities and said second plurality of entities.

18. The system of claim 17, wherein said successive trie organizational encoding as performed on said server computing device is further configured to operate without affecting a speed or a processing of any of said high-speed data ingestion and a plurality of read operations, thereby condensing gradually said plurality of data stored on said first non-transitory computer readable medium.

19. The system of claim 18, wherein said server computing device via said second processor is configured to batch process using an algorithm and a parallel computing process.

20. The system of claim 19, wherein said algorithm has an on-demand pointed access to all of a plurality of entities of interest, said plurality of entities of interest is a subset of said larger plurality of entities.