SYSTEM AND METHOD FOR ANALYZING AND CORRECTING RETAIL DATA
A computer system and method is disclosed that analyzes and corrects retail data. The system and method includes several client workstations and one or more servers coupled together over a network. A database stores various data used by the system. A business logic server uses competitive and complementary fusion to analyze and correct some of the data sources stored in database server. The data fusion process itself is an iterative one—utilizing both competitive and complementary fusion methods. In competitive fusion, two or more data sources that provide overlapping attributes are compared against each other. More accurate/reliable sources are used to correct less accurate/reliable sources. In complementary fusion, relationships modeled where data sources overlap are projected to areas of the data framework in which fewer sources exist—enhancing the accuracy/reliability of those fewer sources even in the absence of the other sources upon which the models were based.
The present invention relates to computer software, and more particularly, but not exclusively, relates to systems and methods for analyzing and correcting retail data.
The measurement of sales in retail channels can be done via a variety of methods. Initially, sample-based audits of consumer purchases at check-out were extensively utilized—but were costly and subject to significant potential inaccuracies. With the advent and accuracy improvement in scanner-based point of sale (POS) data, tracking services such as those offered by Information Resources, Inc. (IRI), and A.C. Nielsen (ACN) are able to provide highly-granular (in terms of item, venue, and time), highly-accurate measurement of sales in several retail channels—including food/grocery, drug, mass merchandise, convenience, and military commissary. These POS-based offerings can be sample-based—i.e., rely on a statistically determined subset of the target population—or census-based—i.e., use all available data from all available venues.
While POS-based measurement offerings do an excellent job of reporting “what” sold, they provide little insight into “why” something sold—since they provide no consumer-level data. To fill this need, market research companies such as IRI and ACN have recruited national consumer panels—in which panelists report their households' purchases on a regular basis. This longitudinal sample allows the development of much deeper consumer insights (e.g., brand switching, trial and repeat, etc.).
However, consumer panels are not without their problems. As with any sample-based survey, consumer panels are subject to two types of errors—i.e., sampling errors and biases—where the total error is given by the sum: (Total Error)2=(Sampling Error)2+(Bias)2.
Sampling errors are those errors attributable to the normal (random) variation that would be expected due to the fact that, by the very act of sampling, measurements are not being taken from the entire population. Sampling errors can be reduced by increasing the sample size since the standard deviation of the sampling distribution (often referred to as the “standard error”) decreases with the square root of the sample size.
Biases are systematic errors that affect any sample taken by a particular sampling method. Because these errors are systematic, they are not affected by the size of the sample. Examples of panel biases include, but are not limited to
-
- Recruitment bias—in which households recruited to participate in the panel are not representative of the target population (e.g., the overall population of the United States);
- Self-selection bias—in which households who choose to participate in the panel have slightly different buying habits than the average household (e.g., an orientation toward using promotions or adopting new products);
- Panelist turnover bias—in which the reporting effectiveness (accuracy and consistency) of panelists may vary over the time period in which they participate in the panel;
- Hereditary bias—in which individuals within a household share a tendency toward certain behaviors or medical conditions;
- Compliance bias—in which certain purchases or purchase occasions are consistently underreported by panelists;
- Item placement bias—in which panelists report products purchased that have not been accurately captured and/or classified in the hierarchy maintained by the data collector; and
- Projection bias—in which the weighting or projection system cannot fully adjust all geo-demographics or is stressed by over- or under-sampled segments of the target population.
While both bias and sampling error are present in consumer panel data, for panels of a size significant enough to be of use in tracking consumer purchases (e.g., the IRI and ACN panels), the vast majority of the error that is present is due to bias. Further, since bias is unaffected by sample size, the negative impact of bias relative to the negative impact of sampling error worsens as the panel size increases.
The negative impact of bias is substantially larger than that of sampling error for most products. Increasing the size of the sample (i.e., the size of the panel) will reduce only the sampling error and may, in fact, worsen any bias that may be present. Given the sizes of today's consumer panels, there is limited advantage to be gained by increasing the size of the panel—since over 90% of the total error is often due to non-sampling errors (i.e., bias).
There has been little progress in the area of developing a systematic method of identifying and quantifying these biases. Further advancements are needed in this area.
Another area of concern in retail sales measurement is “coverage”. Coverage includes both the number of channels in which measurements are reported and the business usefulness of those measurements. While Information Resources, Inc.'s (IRI's) point-of-sale (POS) based services provide excellent coverage of the Food/Grocery, Drug, Mass (excluding WALMART®), Convenience, and Military channels, these channels may account for only 50% of a manufacturer's sales—and as little as 20% of its sales growth. Non-tracked, growth channels—e.g., Club, Dollar, WALMART®—are, thus, becoming an increasingly important part of manufacturers' businesses while at the same time having little data available in the way of actionable sales measurement information. Further advancements are also needed in this area.
SUMMARYOne form of the present invention is a unique system for analyzing and correcting retail data.
Other forms include unique systems and methods to identify, quantify, and correct consumer panel biases. Yet another form includes unique systems and methods to model relationships where data sources overlap to project values in areas in which fewer sources exist.
Another form includes operating a computer system that has several client workstations and servers coupled together over a network. At least one server is a database server that stores sale data for various data sources, product identifier and attribute categorizations, calculated factors, and other data. External sources can be used to feed the data store on a scheduled or on-demand basis. At least one server is a server that contains business logic for analyzing and correcting some of the data sources stored in database server. Some client workstations can be used to administer settings used in process of analyzing and correcting the data sources. Other client workstations can be used to view the corrected and/or uncorrected data in a multi-dimensional format using a graphical user interface.
Another form includes providing a computer system that uses multiple data sources to support inferences that would not be feasible based upon any single data source when used alone. Sales are positioned along product, venue, and time dimension hierarchies. Characteristics of the data source determine the level of aggregation at which the data can be positioned in the framework. For example, POS data may be available weekly in a particular channel; however, direct store delivery (DSD) data may be available at a daily level, and still other measures may be available only at a monthly or quarterly level. The situation is similar along the product and venue dimensions—ranging from the specificity of the sale of a particular UPC-coded item at a particular store to the generality of total category sales within a channel (across all geographies).
Once this data framework is populated, the data fusion process itself is an iterative one, utilizing both competitive and complementary fusion methods. In “competitive fusion”, two or more data sources that provide overlapping measurements along at least one dimension are compared (“competed”) against each other at some level of aggregation along the product, venue, and time dimensions. More accurate/reliable sources are used to correct less accurate/reliable sources. In “complementary fusion”, relationships modeled where data sources overlap are projected to areas of the data framework in which fewer (or even a single) sources exist—enhancing the accuracy/reliability of those fewer (or single) sources even in domains where data from of the other sources upon which the models were based do not exist. The process is iterative in that the competitive and complementary fusion methodologies can be repeated at varying level of aggregation of the data framework.
Another form includes providing a method for identifying and quantifying biases in consumer panel data so that the inherent utility of the consumer panel data may be enhanced. This method is termed competitive fusion. At least two data sources are used, with at least one assumed to be more accurate than the other—e.g., scanner-based POS data and consumer panel purchase data. The data sources are aligned along a common framework (i.e., data model or hierarchy) along the dimensions of product (item), venue (channel and/or geography), and/or time, with aggregation along these dimensions as necessary. The attributes associated with the framework are identified along which the framework may be characterized. The data sources are compared along these attributes—quantifying the impact of the attributes on the less-accurate data source.
After these biases have been identified and quantified, the usefulness of the consumer panel data may be enhanced. The effect of the biases may be corrected for via modeling; i.e., the raw data may be adjusted to reduce or eliminate the effect of the biases. Furthermore, as appropriate, panel management practices may be changed in order to remove or lessen the source of bias in the panel itself.
Yet another form of the present invention includes providing a method for using complementary fusion to “project” the results and relationships from the competitive fusion method onto consumer panel data in a channel with incomplete/less data than desired (e.g. data from WALMART®) to help enhance the accuracy of the Panel data source. At this point, competitive fusion may be used again in several possible ways and at several levels of aggregation along the venue, time, and/or product dimensions in order to develop independent estimates against which the complementary-fused estimate may be competed:
-
- Publicly available data about the incomplete channel (e.g., channel reports, reported sales and financials, store databases, geo-demographics, etc.) may be used to develop an independent venue (channel) estimate.
Publicly available data about the category of interest (e.g., category studies, industry reports, reported sales/financials, etc.) may be used to develop an independent category estimate.
-
- Private data from manufacturer-partners (e.g., shipment data, delivery data, retailer-supplied data, etc.) may be used to develop independent channel and category estimates. Due to the potentially sensitive nature of some of these data sources, this competitive fusion may be performed inside a manufacturer's facility—as an auxiliary input to the baseline model.
- Private data from retailer-partners within a Collaborative Retail Exchange may be used in some venues to develop independent channel and category estimates.
Yet other forms, embodiments, objects, advantages, benefits, features, and aspects of the present invention will become apparent from the detailed description and drawings contained herein.
For the purposes of promoting an understanding of the principles of the invention, reference will now be made to the embodiment illustrated in the drawings and specific language will be used to describe the same. It will nevertheless be understood that no limitation of the scope of the invention is thereby intended. Any alterations and further modifications in the described embodiments, and any further applications of the principles of the invention as described herein are contemplated as would normally occur to one skilled in the art to which the invention relates.
One embodiment of the present invention includes a unique system for identifying, quantifying, and correcting consumer panel biases, and then using overlapping areas of the data sources to project values in areas where fewer or less complete sources exist.
Computers 21 include one or more processors or CPUs (50a, 50b, 50c, 50d, and 50e, respectively) and one or more types of memory (52a, 52b, 52c, 52d, and 52e, respectively). Each memory 52a, 52b, 52c, 52d, and 52e includes a removable memory device. Each processor may be comprised of one or more components configured as a single unit. Alternatively, when of a multi-component form, a processor may have one or more components located remotely relative to the others. One or more components of each processor may be of the electronic variety defining digital circuitry, analog circuitry, or both. In one embodiment, each processor is of a conventional, integrated circuit microprocessor arrangement, such as one or more PENTIUM III or PENTIUM 4 processors supplied by INTEL Corporation of 2200 Mission College Boulevard, Santa Clara, Calif. 95052, USA.
Each memory (removable or generic) is one form of computer-readable device. Each memory may include one or more types of solid-state electronic memory, magnetic memory, or optical memory, just to name a few. By way of non-limiting example, each memory may include solid-state electronic Random Access Memory (RAM), Sequentially Accessible Memory (SAM) (such as the First-In, First-Out (FIFO) variety or the Last-In-First-Out (LIFO) variety), Programmable Read-Only Memory (PROM), Electronically Programmable Read-Only Memory (EPROM), or Electrically Erasable Programmable Read-Only Memory (EEPROM); an optical disc memory (such as a DVD or CD ROM); a magnetically encoded hard disc, floppy disc, tape, or cartridge media; or a combination of any of these memory types. Also, each memory may be volatile, nonvolatile, or a hybrid combination of volatile and nonvolatile varieties.
Although not shown in
Computer network 22 can be in the form of a wired or wireless Local Area Network (LAN), Municipal Area Network (MAN), Wide Area Network (WAN) such as the Internet, a combination of these, or such other network arrangement as would occur to those skilled in the art. The operating logic of system 20 can be embodied in signals transmitted over network 22, in programming instructions, dedicated hardware, or a combination of these. It should be understood that more or fewer computers 21 can be coupled together by computer network 22.
In one embodiment, system 20 operates at one or more physical locations where business logic server 24 is configured as a server that hosts and runs application business logic 33, database server 25 is configured as a database 34 that stores reference data 35 (e.g. product identifiers 36a, attributes 36b, and a dictionary 36c), at least two retail data sources (such as point-of-sale and panel data) 38, calculated factors 39, and other data 40. In one embodiment, external data 26 is imported to database server 25 from a mainframe extract file that is generated on a periodic basis. Various other scenarios are also possible for using and importing external data to database server 25. In another embodiment, external data sources are not used. In one embodiment, database 34 of database server 25 is a relational database and/or a data warehouse. Alternatively or additionally, database 34 can be a series of files, a combination of database tables and external files, calls to external web or other services that return data, and various other arrangements for accessing data for use in a program as would occur to one of ordinary skill in the art. Client workstations 30 are configured for providing one or more user interfaces to allow a user to modify settings used by business logic 33 and/or to view the retail data sources 38 of database 34 in a multi-dimensional format. Typical applications of system 20 would include more or fewer client workstations of this type at one or more physical locations, but three have been illustrated in
Referring also to
In one form, procedure 150 is at least partially implemented in the operating logic of system 20. Procedure 150 begins with business logic server 24 identifying at least two data sources, with at least one data source being more accurate than another (stage 152). At least one data source (see e.g. 36 in
Various examples herein illustrate using consumer panel purchase data as the target data source to be corrected. However, the current invention can be used with other data sources, such as sample-based or survey-based data sources whose overall accuracy is limited by the presence of biases, to name a few non-limiting examples.
The product characteristics of the data sources should ideally be available at the item level, where “item” is by UPC, SKU, or another unique product identifier. In terms of the venue characteristics of the data sources, they should ideally be available at the retailer and market level, where “retailer” is a store (or chain of stores) within a particular retail channel and “market” is a geographic construct (e.g., Chicago area). In terms of the time characteristics of the data sources, they should ideally be available at the weekly level (or even daily in some cases), although monthly data (or 4-week “quad” data) or various other time frames are also acceptable. Where these levels of granularity are not possible, more aggregated levels of the product (e.g., “brand”), venue (e.g., “food” or “mass” channel for retailer and/or “region” or “total U.S.” for market), and/or time (e.g., quarterly or annual data) dimensions may be used.
After the data sources have been identified (stage 152), they are next aligned along a common framework (stage 154), such as along the item, venue, and/or time dimensions. Depending upon the characteristics (and quality) of the data sources, some aggregation along these dimensions may be required in order for the alignment to be possible. For example, UPC-level POS data may need to be aggregated at the SKU or even brand level in order to be aligned with data from other sources (particularly in the cases in which venue-specific UPCs are involved). Similarly, store-level data may need to be aggregated at the local market or even regional level in order to be aligned with consumer panel purchase data. Finally, weekly (or even daily) POS data may need to be aggregated at the 4-week quad level in order to be aligned with shipment/delivery data. Various other arrangements for aligning the data along a common framework are also possible.
In one embodiment, the item structure is provided by a multiple-level hierarchy, in which UPCs are the lowest level and are aggregated along category-related characteristics. Venue structure is provided along both geographical and channel dimensions, with FIPS-code-level transactions being aligned along market and regions and store locations being part of a sub-chain, chain, and parent store hierarchy. Time structure is presently provided at the weekly level at the lowest level of aggregation, with daily data being aggregated at the weekly level before placement into the structure, although a daily data compatible structure or other variation is also possible.
As a result of aligning the data sources along a common framework (stage 154), overlapping attribute segments of at least one dimension are available to use for data comparison and correction. Certain attributes associated with the data sources are identified along which more detailed comparisons may be made. In one embodiment, product attributes are available in from reference data 35 of database 34. For example, one or more pieces of information from product identifier 36a, attributes 36b, and dictionary 36c references can be used to access or modify attributes, attribute hierarchies, and mappings. These attributes represent category-specific dimensions along which products in that category may be characterized (e.g., diet vs. regular in carbonated soft drinks, active ingredient in internal analgesics, product size in most categories). The term attribute used herein is meant in the generic sense to cover various types of descriptors.
Business logic server 24 compares the data sources and calculates factors for the attributes of at least one element of the common framework (stage 158). Each segment of a given attribute will have its own factor, as described in detail herein. The presence of attribute-related bias may be identified by comparison of the data sources. In the examples illustrated herein, volumetric comparisons are made (e.g., equivalent units); however, various other measures (e.g., dollar sales, actual units) could also be utilized, as long as the same type of measure is being used for the comparison. For example, it would not be useful to compare dollar sales to actual units, but it would be useful to compare dollars to dollars. The comparison itself is between the value of the target data source (e.g., projected panel volume) and that of the reference data source (e.g., POS data). This comparison can be by way of two-sample inference, regression analysis, or other statistical tests appropriate for determining whether any differences between the two data sources are associated with the attributes along which they have been characterized at a statistically significant level. Where such differences (biases) are identified, they are quantified, and factors are calculated for use in bias correction/adjustment.
The factors are used to correct bias in the less accurate data source (stage 160), which in this example is consumer panel data. By using the factors to correct the bias in the less accurate “target” data source, the effect of these biases is reduced or eliminated. These biases can be corrected by adjusting the raw data, or by way of post-adjustment.
In “complementary fusion”, the factors are also used to supplement the data that is incomplete in the less complete data source (stage 162), such as consumer panel data. Incomplete data is used in a general sense to mean that less data was provided than desired or that the data is less accurate than desired, to name a few non-limiting examples. Where highly accurate data (e.g. POS data) is not provided, less accurate data (e.g. panel data) becomes more important to analyze and correct. Relationships modeled where data sources overlap are projected to areas of the data framework in which fewer (or even a single) sources exist, enhancing the accuracy and reliability of those fewer (or single) sources even in domains where data from of the other sources upon which the models were based do not exist.
Users and/or reports can access database 34 from one of client workstations 30 to view/analyze the corrected and adjusted data (stage 164). Users and/or reports can also access database 34 from one of client workstations 30 to view and/or modify settings used by system 20 to make data corrections. The steps are repeated as desired (stage 166). The process then ends at stage 168.
In one embodiment, a parameter specification for the number of weeks used in calculating the factors is thirteen, and the minimum week range included in database 34 is then set to be thirteen weeks prior to the update week. Database 34 may be built and maintained using various data sources and can include various types of data, as would occur to one of ordinary skill in the art. In one embodiment, system 20 supports the option to pull the desired period (e.g. all thirteen weeks) of the data sources 38, append the recent period (e.g. four weeks) needed since the last factor update to the existing database 34, and/or be able to recreate the data a week at a time. In such a scenario, for space conservation, the system can optionally drop the same number of weeks from the start week of database 34 as were appended to the end week. For example, if the option was chosen to append the four weeks needed since the last factor update, the system should drop the four oldest weeks from the existing database 34 when appending the four new weeks.
The received updates to reference data 35 and/or data sources 38 are stored in database 34 (stage 174). At some point in time, such as on a scheduled or as-requested basis, the system determines that data adjustments should be made to correct bias (decision point 175). Application business logic 33 ensures reference data 35 and data sources 38 are up to date, and if not, updates them accordingly (stage 176). Optionally, reference data 35 is reviewed to ensure that the default attributes for the current category will be appropriate for the client or scenario, and adjustments are made to reference data 35 as appropriate (stage 177). As one non-limiting example, attribute segments may be reviewed and translated to more succinct segmentations that better classify the product identifiers. Other variations are also possible.
A product-identifier-to-attribute-segment mapping is prepared for the product identifiers (e.g. UPC's) (stage 178). If the attributes are determined to be irrelevant, they can be removed from further consideration in this process. The attribute table 36b is a reference table that maps each product identifier 36a to a set of attribute variables. While UPC's are described as a common product identifier, other identifiers could also be used. For example, not every dataset has a UPC, but may have a product identifier at a higher, lower, or equivalent level. Rules are used to determine supportable attribute segments and relevant attributes. In one embodiment, if segment assignment is missing then the UPC is assigned to a new segment “not supportable.” All segments with less than a 5% share are assigned to “not supportable.” Furthermore, in one embodiment, if the final “not supportable” category accounts for >50% of the category share, then the attribute is designated as “irrelevant.” Other ways for determining relevance can also be used, or relevance can simply be ignored. Stage 178 can be repeated to arrive at the final level of segments to use (rolled-up or drilled-down) as appropriate.
Continuing with
An initial factor of 1.0 is assigned to all attribute segment (stage 212). Continuing with
In one embodiment, if two or more segments for the current attribute were nonsignificant (stage 220), then the significant factors (that remain) will need to be re-aligned to account for non-significant segment factors being removed (stage 222). At the product identifier-level target (POS) data, each volume is multipled by the factor for the corresponding segment (stage 224). Again, other mathematical variations could also be used. The factors for each attribute segment are then saved to factor data store 39 of database 34 (stage 226). If another attribute is present (decision point 228), the next attribute is made the current attribute (stage 230) and stages 214-226 are repeated. These stages are repeated until all attributes are processed. Continuing with
FIGS. 7A-&C are first, second, and third parts of a process flow diagram for the system of
Continuing with
Continuing with
A hypothetical example will now be described in
Turning to
As shown in
Alternatively or additionally, once data fusion has been performed as described herein, the updated data can be used by various systems, users, and/or reports as appropriate.
In one embodiment of the present invention, a method is disclosed comprising identifying a plurality of data sources, wherein at least a first data source is more accurate than a second data source; identifying a plurality of overlapping attribute segments to use for comparing the data sources; calculating a factor as a function of each of the plurality of overlapping attribute segments; and using the factors to update a first group of values in the second data source to reduce bias.
In another embodiment of the present invention, a method is disclosed comprising receiving point-of-sale data and panel data on a periodic basis; identifying a plurality of product identifiers and a plurality of attributes to analyze; retrieving and summarizing the point-of-sale data and the panel data by the plurality of product identifiers, the plurality of attributes, and a plurality of corresponding attribute segments for a specified time period; calculating a factor for each attribute segment of a particular attribute; and applying the factors for the particular attribute segment to the panel data to correct panel bias.
In yet another embodiment, a method is disclosed comprising receiving point-of-sale data and panel data on a periodic basis; identifying a plurality of product identifiers and a plurality of attributes to analyze; retrieving and summarizing the point-of-sale data and the panel data by the plurality of product identifiers, the plurality of attributes, and a plurality of corresponding attribute segments for a specified time period; calculating a plurality of factors, wherein one factor is calculated for each attribute segment of the plurality of attributes; and applying the factors to the second data source to reduce bias; and applying the factors to the second data source to reduce incompleteness.
In yet a further embodiment, a method is disclosed comprising identifying a plurality of product identifiers and a plurality of attributes to analyze for at least two data sources, wherein at least a first data source is more accurate than a second data source; retrieving and summarizing the first data source and the second data source by the plurality of product identifiers, the plurality of attributes, and a plurality of corresponding attribute segments for a specified time period; calculating a plurality of factors, wherein one factor is calculated for each attribute segment of the plurality of attributes; applying the factors to the second data source to reduce bias; and applying the factors to a different or overlapping dataset of the second data source to reduce incompleteness.
In another embodiment, a system is disclosed that comprises one or more servers being operable to store retail data from at least two data sources, store product identifier and attribute categorizations, and store a plurality of factor calculations; wherein the at least two data sources includes a first data source that is more accurate than a second data source; and wherein one or more of said servers contains business logic that is operable to identify and retrieve a plurality of overlapping attribute segments to use for comparing the at least two data sources, compare each of the overlapping attribute segments, calculate a factor for each of the overlapping attribute segments, and use the factors to update a first group of values in the second data source to reduce bias.
In yet a further embodiment, an apparatus is disclosed that comprises a device encoded with logic executable by one or more processors to: identify and retrieve a plurality of overlapping attribute segments to use for comparing at least two data sources, wherein the at least two data sources includes a first data source that is more accurate than a second data source, compare each of the overlapping attribute segments, calculate a factor for each of the overlapping attribute segments, and use the factors to update a first group of values in the second data source to reduce bias.
A person of ordinary skill in the computer software art will recognize that the client and/or server arrangements, user interface screen content, and data layouts could be organized differently to include fewer or additional options or features than as portrayed in the illustrations and still be within the spirit of the invention.
While the invention has been illustrated and described in detail in the drawings and foregoing description, the same is to be considered as illustrative and not restrictive in character, it being understood that only the preferred embodiment has been shown and described and that all equivalents, changes, and modifications that come within the spirit of the inventions as described herein and/or by the following claims are desired to be protected.
Claims
1. A method comprising:
- identifying a plurality of data sources, wherein at least a first data source is more accurate than a second data source;
- identifying a plurality of overlapping attribute segments to use for comparing the data sources;
- calculating a factor as a function of each of the plurality of overlapping attribute segments; and
- using the factors to update a first group of values in the second data source to reduce bias.
Type: Application
Filed: Oct 29, 2007
Publication Date: Jul 3, 2008
Inventors: MICHAEL W. KRUGER (GLENVILLE, IL), CHERYL G. BERGEON (ARLINGTON HEIGHTS, IL), ARVID C. JOHNSON (FRANKFORT, IL)
Application Number: 11/926,347
International Classification: G06F 17/30 (20060101);