Method and System for Content Based Demographics Prediction for Websites
Systems and methods for predicting characteristics of a web user, determining a combination of websites to obtain a target demographic mix, determining a set of keywords to buy to obtain a target demographic mix, selecting websites from market research and designing websites to appeal to an audience with desired demographic characteristics. Systems and methods may include determining features of web-pages of ad-carrying, target websites, applying prediction models to the determined features of the ad-carrying, target websites to predict values of demographic attributes of the ad-carrying, target websites, receiving one or more inputs including a target demographic mix, receiving a number that indicates an amount of visitors of the ad-carrying, target websites, and determining a combination of websites that provide target demographic mix based on the predicted values of the demographic attributes and number of visitors of the ad-carrying, target websites.
Latest nXn Tech, LLC Patents:
- METHOD AND SYSTEM FOR REAL-TIME LOCATION AND INQUIRY BASED INFORMATION DELIVERY
- Display screen with an advanced search graphical user interface
- Display screen with a results graphical user interface
- Display screen with a timeline graphical user interface
- Display screen with a patent record graphical user interface
The present application claims priority under 35 U.S.C. 119(e) of U.S. Provisional Patent Application No. 61/139,422, filed on Dec. 19, 2008, and U.S. Provisional Patent Application No. 61/233,789, filed on Aug. 13, 2009, the disclosures of which are expressly incorporated by reference herein in their entirety.
BACKGROUNDDemographics play an important role in web advertising, web searching and generally the personalization of web applications. Applications like web search engines might adjust the ranking of search results based on the demographic attributes of a user like age, gender and occupation. Another important domain where demographics play an important role is online advertising. With the growth of web usage, online advertising is growing rapidly in recent years. In particular, contextual advertising is becoming popular. Behavior targeting using demographic attributes helps advertisers to target specific users with demographic relevant advertisements.
One approach to obtain demographics of a website is through panel studies similar to that of TV program rating. In this approach, panels with known demographic information are recruited and their browsing histories are recorded. These browsing histories of panels with different demographic attributes are used to compute demographics of websites. However, this approach requires impractically large sizes of panels to guarantee any reasonable coverage of websites. Additionally, if a site is not visited by any of the panels, then the demographics of the website cannot be estimated.
Another approach to obtain demographics of a particular website is by using information provided by that website's registered visitors or by asking some of its visitors to participate in online surveys. These techniques capture information only about the limited subset of visitors that have chosen to register and/or participate in the surveys. In addition, since not all segments of a website's visitors are equally likely to participate in the above activities, the resulting information is subjected to a sampling bias. Furthermore, since each individual can potentially register and/or take the surveys multiple times, the demographics obtained via this approach may not be accurate. Additionally, since the information provided by the visitors during registration or during their participation in surveys can potentially be used to describe and/or identify them, their use for any other purpose other than the one intended, represents a potential intrusion upon a user's expectation of privacy.
Another approach is to build a computational or statistical model to predict a website's demographic information. The existing approaches for building such models use data obtained by tracking users' browsing behavior across different websites, information about the content of the web-pages that the users visit, and information associated with the users' profile. The profile of a user (or a group of users) is often constructed by integrating various elements across different websites and contains information related to any data provided during registration, web-pages viewed, products purchased, advertisement clicked, etc. With the growing concern regarding privacy on the Internet, people are reluctant to share their personal data, and therefore, the applicability of existing approaches relying on such personal data can be limited.
Due to the combination of the above factors, and other factors, the methods in use today for characterizing the audience characteristics of websites are limited in their accuracy, their ability to cover a large number of websites with substantial audience traffic, and the failure to protect a user's right to information privacy.
SUMMARYSystems and methods provide many advantages over the prior art. Embodiments include a system and method of predicting characteristics of a user. Such a system or method may receive current online session browsing history of a user. The browsing history identifies websites visited by a user during current online session and the identified websites include known websites and unknown websites. The system or method may further retrieve known demographic attributes data of known websites included in identified websites, determine features of web-pages of unknown websites, and apply prediction models to the determined features of the unknown websites to predict values of unknown demographic attributes of the unknown websites.
Embodiments further include a system and method of determining a combination of websites to obtain a target demographic mix. The system or method may determine features of web-pages of target websites, apply prediction models to the determined features of the target websites to predict values of demographic attributes of the target websites, receive one or more inputs including a target demographic mix, receive a number that indicates an amount of visitors of the target websites, and determine a combination of websites that provide target demographic mix based on the predicted values of the demographic attributes and number of visitors of the target websites.
Embodiments further include a system and method of determining a set of keywords to buy to obtain a target demographic mix. The system or method may receive one or more inputs including a target demographic mix, identifies one or more sets of website combinations to reach the target demographic mix, and analyzes a subset of web-pages of the one or more sets of website combinations to determine a set of terms that occur in the web-pages. The identifying identifies the one or more sets of website combinations using extracted features of web-pages to predict demographic attribute values.
Likewise, embodiments include a system and method of selecting websites from market research. The system or method may receive one or more inputs including a target demographic mix, determine features of web-pages of target websites, apply prediction models to the determined features of the target websites to predict values of demographic attributes of the target websites, receive a number that indicates an amount of visitors of the target websites, and determine a combination of websites that provide target demographic mix based on the predicted values of the demographic attributes and number of visitors of the target websites.
Additionally, embodiments include a system and method of designing websites to appeal to an audience with desired demographic characteristics. The system or method may include receiving set of desired values for one or more demographic attributes and determining a correlation between one or more features of web-pages and set of training websites that have desired values for the one or more demographic attributes. The determining includes identifying combination of features that would result in a prediction of the desired values and the method designs a website to include the identified combination of features.
The detailed description will refer to the following drawings, wherein like numerals refer to like elements, and wherein:
Described herein are systems for and methods of making content-based demographic predictions for websites. Embodiments predict demographic attributes of websites based solely on the content of the websites. As used herein in embodiments of the systems and methods, a website's content may include many features that may be extracted from the website's web-pages, including the textual features of the website's web-pages, the structural features of the website's web-pages, the type and category of the website, the intra- and inter-web-page and website linkage structure, the features of web-page(s) and website(s) linked to by the website's web-pages, the hyper-text markup language (“HTML”) of the website's web-pages, the HTML of a subset of the web-pages that link to the website's web-pages (in-links) (both from the same website or different websites), and the HTML of the web-pages that are linked by the website's web-pages (out-links). The predicted demographic attributes of a website are the expected demographic attributes of the users of the website, typically expressed as a percentage of users that have a particular demographic characteristic or fall within a particular demographic (e.g., a prediction that 55% of a website's users will be male, 45% female). Note, throughout this application, the persons that visit a website or a web-page are referred to as users, visitors, people, persons, audience, and in other manners. It is to be understood that these terms are used interchangeably and should be understood to mean the persons, whether individually or collectively and in the broadest sense, that access one or more web-pages of a website or a specific web-page, e.g., via navigating to the URL of the web-page(s) on an Internet browser on a computer, mobile device, etc.
Embodiments avoid disadvantages of the prior art, including without limitation the prior art disadvantages of relying on or requiring the use of data obtained directly or indirectly from the visitors of a website to predict the demographic attributes of the website. Instead, the systems and methods predict the demographic attributes of websites using only the content of the web-pages of the websites and without using the browsing behavior or browsing history of the websites' visitors or the visitors' click-through data.
Any demographic attribute may be predicted using embodiments of the systems and methods described herein. Gender, age distribution, income distribution, nationality, language, etc., are all examples of demographic attributes that may be predicted. Even though the systems and methods described herein may be used to predict a wide range of demographic attributes, examples provided herein focus on methods to predict the gender and age distribution of a website's audience/users. As used herein, the gender attribute specifies the male and female percentages of a website's audience, whereas the age attribute provides a break-down of a website's audience in different age groups. Table 1 below shows the five age groups that used in the examples provided herein.
Embodiments of the system and method for predicting a website's demographic attributes follow a supervised learning framework. Within this framework, a set of websites with known demographic attributes are used as a training set, a set of features for these websites or for a subset of the web-pages wherein is extracted, and a model is learned or developed to predict the demographic attributes of a website based on these features. Features for the training websites and for the websites or web-pages whose demographic attributes are being predicted (target websites or web-pages) are extracted from the content of the web-pages of these websites. The prediction model is applied to the extracted features, in effect comparing features of target websites or web-pages to features from training websites and predicting demographic attributes based thereon.
A key characteristic of the underlying prediction problem is that demographic attributes that need to be predicted are probability distributions that take a discrete set of values. This is different from most traditional value estimation problems that focus on building models to estimate a single value. Note that for those demographic attributes that take only two values (e.g., gender), the distribution prediction problem can be transformed to a single-value prediction problem, by predicting only one of the two values and estimating the other from that prediction. For example, if x % is the percentage of a website's audience that is male, then the percentage of the female audience can be estimated as (100-x) %.
Embodiments of the system and method may perform two overall activities. First, embodiments may use standard regression-based techniques to estimate each discrete value of a demographic attribute by treating the prediction problem as an independent single-value estimation problem. In embodiments, prediction models may be generated, e.g., using regression-based techniques, and then predictions generated by inputting target website content features into the prediction models. The prediction models may be generated using various techniques, including without limitation support vector regression, linear regression, logistic regression, non-linear regression, nonparametric regression, probabilistic estimations and Markov random fields.
Second, embodiments may use these individual predictions as input to a second learning problem whose goal is to estimate the overall distribution of the demographic attribute. In these embodiments, the individual models may be estimated using regression-based techniques (e.g., support vector regression), whereas the individual estimations may be coupled using an approach that is designed to predict a multi-dimensional vector such as the matrix approximation, as described below.
In embodiments described herein, the prediction models are generated using support vector regression (SVR). Support vector machines (SVMs) are an implementation of SVR that may be used to generate prediction models. A specific implementation of SVM, known as “SVMlight,” may be used to generate the prediction models and predict the demographic attribute values in embodiments. Such an implementation is described in, e.g., http://svmlightjoachims.org/. See also, e.g., Joachims, T., Text Categorization with Support Vector Machines Learning with Many Relevant Features, In Proceedings of the 10th European Conference on Machine Learning (ECML), Chemnitz, Germany, 137142 (1998), which is incorporated by reference herein, for a general description of SVM.
With reference now to
Method 10 identifies and selects websites with known demographic attributes, block 102. The websites with known demographic attributes may be thought of or referred to as training websites. The content of a subset of the web-pages from these training websites, and the training websites' demographic attributes data, are used by method 10 to develop the prediction model. The training websites may be identified based on input from a user, automated analysis of a set of websites with known demographic attributes received from commercial providers of such data for websites (including, for example, without limitation, Nielson Online (see http://en-us.nielsen.com/tab/product_families/nielsen_netratings), Alexa (see www.alexa.com/topsites), Quantcast (see http://www.quantcast.com) and Comscore (see http://www.comscore.com)), a combination of these or other manners. The training websites for use in method 10 may be selected based on various factors, such as size of a website's audience, the website's gender or age (or other demographic) distribution (e.g., to attempt to achieve a balance of training websites), the reliability, if known, of demographic attributes data, etc. For example, method 10 may select a group of 450 websites, with a balanced distribution of gender and age demographics, of the top 2000 most visited websites provided by a commercial provider or providers of website demographic attributes data.
If not already gathered or obtained as part of selecting 102 the training websites (e.g., if training websites were selected at least in part based on actual demographic distributions, then such demographic attributes may have been fetched as part of selecting 102), the demographic attributes data of the identified and selected training websites is gathered or obtained, block 104. The demographic attributes data may be gathered or obtained from various sources; for example, the demographic attributes data may be obtained from commercial providers of demographic data for websites such as Nielson Online, Alexa, Quantcast, and Comscore. The demographic attributes data may include data for just one demographic attribute, such as age, or for a plurality of demographic attributes.
Method 10 determines features of web-pages of the training websites from the content of the web-pages, block 106. In embodiments, determining 106 may obtain the content of the web-pages and then extract features from the obtained content. Different features may be extracted from the same content. As noted above, a website's content may encompass many features that may be extracted from the web-pages of the website.
In embodiments, web-page features include a plurality of types of web-page features that may be extracted from the training website web-pages. The determining 106 may determine features from all or a subset of the web-pages of the training websites. One such feature captures a web-page's textual content (e.g., terms), while another feature captures the web-page's structure (e.g., organization, style, sections (e.g., forums, FAQs, etc.)). Other features that may be determined include the type and category of the website (e.g., corporate site, entertainment site, issue site, shopping site, social networking site, blogging site, health site, etc.), the intra- and inter-web-page and website linkage structure (e.g., links between web-page sections, links between the web-page and another web-page for the same website, links between the web-page and a web-page of another website), the features of the web-page(s) and website(s) linked to by the web-page, the HTML of the website's web-pages, the HTML of a subset of the web-pages that link to the website's web-pages (in-links) (both from the same website or different websites), and the HTML of the web-pages that are linked by the website's web-pages (out-links). In embodiments, a web-graph including the training websites and websites with web-pages in-linked or out-linked from training website web-pages may be generated. A web graph is a set of vertices u and v and edges, the vertices corresponding to websites and one of the edges being a directed edge (u, v) between two websites if there are web-pages in the website corresponding to vertex u that link to web-pages in the website corresponding to vertex v. Websites that are linked to by training websites, or which link to training websites, may be referred to as web-graph neighbors of the training websites. In embodiments described herein, these features are determined entirely by analyzing the web-pages themselves and do not rely on any information about the users visiting the corresponding web-pages and websites. This is done by design, as one of the primary features of embodiments of the systems and methods is the accurate prediction of the demographic characteristics of a website's audience without relying on any data that directly or indirectly intrudes on the website's users' private information. The determination 106 of the content is described in more detail below.
The content of the web-pages may be obtained using a web-page crawler, robot or similar feature extraction tool or process, such as, e.g., the Heritrix Crawler (see crawler.archive.org). The same or other tools may extract the desired features from the obtained content. The content that may be obtained by the web-page crawler or other tool may include the in-linking and out-linking features described above. For example, a web-page crawler or other tool may identify links to other web-pages from a training website web-page, follow the links to the linked web-pages, and obtain the HTML from the linked web-pages and extract different features from this HTML. The web-page crawler may also obtain in-linking features in a similar manner.
Using the determined features and obtained demographic attributes, the prediction model is developed, block 108. In embodiments, SVR is used to develop a function ƒ (a prediction) given the following inputs: the determined features of a subset of the web-pages of the training websites, a web-graph that contains both the training websites and other websites that are not part of the training set, and the obtained/gathered demographic attribute(s) data of the training websites. The function ƒ will predict/estimate the demographic attribute(s) of a website (or web-page). During prediction, the function ƒ (the prediction model) will predict/estimate a discrete value demographic attribute of a target website (or web-page) based on the input of the determined content of the target website(s) (or web-page(s)). Developing the prediction model and using it to predict/estimate a demographic attribute are described in more detail below.
With continuing reference to
With reference now to
With reference to
Method 200 may apply a weight to the retrieved terms, block 206. Method 200 may use a standard Term Frequency Inverse Document Frequency (TF-IDF) term weighting scheme that assigns a weight to each term that is linearly related to the term's occurrence frequency in the web-page and inversely related to the number of web-pages in the website on which the term occurs, to weigh 206 the retrieved terms. The TF-IDF term weight is a statistical measure used to evaluate how important a word is to a document (e.g., a web-page) in a collection or corpus (e.g., all of the web-pages in the website). For each term, the term frequency is the number of times that word appears in a web-page, whereas the term's document frequency is the number of web-pages in which the term occurs. The importance increases proportionally to the term's term frequency but is offset by the term's document frequency (i.e., terms that appear in many web-pages become less important). In an embodiment of this method, the size of the document collection (i.e., web-pages) used in the IDF component when determining 106 the content of the training websites is equal to the number of web-pages across the entire set of training websites. In another embodiment, the size of the document collection is normalized so that each website contributes an equal weight to the overall collection. The normalization may be done by assigning a weight to each web-page of the ith website that is 1/ni, where ni is the number of web-pages from the ith website that exist in the collection.
With continuing reference to
A challenge associated with extracting the textual features of modern web-pages is that in addition to the portions of the web-pages that contains information specific to those web-pages, web-pages also contain additional information that is irrelevant to the information that they provide. Such examples include but are not limited to headers, footers, navigation panels, and advertisements. Quite often, the portion of a web-page's text and HTML elements that is directly related to the web-page's specific information is much smaller than that occupied by the irrelevant portions. To address this problem, embodiments of method 200 of determining the web-page textual features may identify a web-page's specific information by collectively analyzing the entire set of web-pages that were obtained from the same website, determining the irrelevant information or form HTML elements and removing the irrelevant information or form HTML elements from consideration.
With reference now to
A graphical representation of the DOM of the example table is:
After constructing 222 a DOM tree, similar to the DOM tree shown above, for all the web-pages on the website, the method may analyze the DOM trees, block 224, and eliminates all the paths from the leaves to the root of the DOM tree that occur in some defined number (e.g., at least ten (10)) or percentage (e.g., five percent (5%)) of the DOM trees (i.e., in the defined number of web-pages on the website), block 226. The motivation behind this approach is that elements of each web-page that are common across different web-pages will correspond to non-web-page specific content, such as web-page template terms, and, therefore, may be eliminated. By eliminating paths from the leaves to the root of the DOM tree, such text that is common and not web-page specific text may be eliminated. A sufficiently high-defined number or percentage is used to avoid inadvertently eliminating relevant terms. The text associated with the leaf nodes of a web-page's DOM tree that are not pruned, and the terms within that text, may then be used to generate 204 the term vector T of the web-page's vector-space representation, as illustrated in
In addition to the above web-page-specific textual content, embodiments of method 200 of determining web-page features may also use the semi-structured nature of HTML documents to emphasize terms that occur in certain HTML tags on the web-pages. With reference now to
In addition to determining textual features of web-pages, determining web-page features according to embodiments of systems and methods of making content-based demographic predictions for websites may also include determining structural features of the web-pages. Specifically, systems and methods of making content-based demographic predictions for websites may also extract features from web-pages of a website that capture the web-page structure by focusing, among others, on characteristics that relate to the web-page's style and organization. In embodiments, the structure of each web-page may be measured in terms of the web-page's visual appearance. The visual appearance of a web-page greatly influences the way a user interacts with the web-page and the type of users that the web-page attracts. As a result, the existence of certain structural elements can provide valuable clues as to the demographics of a web-page's users (or its indented users).
Accordingly, with reference now to
Embodiments of method of determining the structural features of a web-page extract these features from the entire web-page and not only from the portions of the web-page that were used to derive textual features (e.g., see
Other content features of a web-page may also be determined by embodiments described herein. For example, embodiments may determine the type and/or category of a web-page. Such information may be determined from third-party services that categorize websites or web-pages, from metadata, web-page title or other textual features on the web-page or through other techniques known to those of skill in the art. In addition, embodiments may determine additional features for a web-page by analyzing the content of the web-pages that link-to that web-page or the web-pages that are being linked-by the web-page. These features may be textual features extracted by using method 200 of
With reference now to
With reference now to
Developing 108 a prediction model, as seen in
The prediction model for the determined discrete demographic attribute value may be developed 306 using a regression approach, such as SVR, to estimate a regression model for the determined discrete demographic attribute value, as described above. The regression model may be estimated based on the content representations of the training websites (e.g., THS feature vectors or a subset thereof (e.g., only the T, H, or S vectors, or a combination of two of these vectors)). Likewise, embodiments of the systems and methods of making content-based demographic predictions may build 306 the prediction models using as training instances the training websites or the individual web-pages of the training websites. These two types of models will be referred to as website-level models and web-page-level models, respectively.
With reference now to
Embodiments that develop 306 website-level prediction models may compute the feature representation for a training website by combining the feature representations gathered, e.g., by embodiments described with reference to
For training websites, known demographic attribute(s) data for each website may be appended or otherwise linked to the feature vector Si for the website, block 358. For example, the attributes data for each training website may be placed in a vector and a matrix or table is generated with all of the attributes data vectors for the training websites, where the ith row (or column) of the table includes the demographic attributes for the ith website in the training set. In yet another embodiment, the prediction model may be only generated for one demographic attribute; accordingly, only the demographic attribute data for the one demographic attribute for which a training model is being developed may be appended or linked 358 to the feature vector for the training website.
It is also noted that certain training websites may be determined or thought to be more or less relevant for the prediction model. For example, a training website may be determined or thought to be more or less relevant because of the size of the training website's audience. In other words, a larger training website audience may make the website's content and demographic attribute data more relevant for the prediction model. Likewise, a training website may be considered more or less relevant to the determination of a specific demographic attribute. Moreover, a training website may be considered more or less relevant based on the number of other training websites or web-pages that link to it. Accordingly, the feature representation (e.g., feature vector Si) of a training website may be assigned a weight based on its relevancy, block 360. The weighting may affect how much the training website feature vector impacts the prediction model and may, therefore, e.g., be input into the regression process.
To complete building the website-level prediction model, the preceding blocks 352 to 360 may be repeated for each training website, block 362. Once the feature vectors Si for each training website has been generated, the prediction model for the discrete value of the demographic attribute may be developed using SVR with the feature vectors Si, the web graph and the linked 358 demographic attribute data, block 364. As noted above, the prediction model may be developed 364 using the SVMlight implementation of SVR.
Embodiments that develop 306 web-page-level prediction models may use the features extracted from a subset of the web-pages of each training website as the training instances of these web-page level models. During training, the value of the training website's demographic attribute under consideration is used as the value for that attribute for all of its web-pages (i.e., all web-pages are assigned the same value). Accordingly, in these embodiments, the feature vector of each web-page is linked to the row of the table of demographic attributes data of the corresponding website. Then, the prediction model is generated using the feature vector for each web-page and the linked demographic attribute data for that web-page. For example, for the prediction model used to predict the percentage of users that are kids (ages 3-12, Table 1), the value of the target variable for the SVR model for all the web-pages of a certain website will be the percentage of users that are kids for that website. During prediction, the SVR models are used to estimate that values for the different demographic attributes for all the web-pages of a website. These web-page-level predictions are then combined to obtain the prediction at the website level. For example, the percentage of users that are kids may be determined by averaging the corresponding prediction for all the web-pages of a website. Embodiments may also use information about the web-pages from the training website and/or other websites that link to the various web-pages of the target website in determining how the web-page-based predictions may be combined. In these embodiments, predictions of web-pages that are linked to by a larger number of other web-pages will be given a higher weight than other linked web-pages. For example, if {p1, . . . pk} are the predictions for the k web-pages of website and ni is the number of in-links of the ith page, then the prediction p for the website may be given by
where δ is a constant to account small sample sizes (e.g., it can be set to a small percentage of the number of training websites).
Embodiments of method 300 may implement a cascade learning system or similar learning system (see discussion of cascading classifiers below), to develop and refine the prediction model. Accordingly, in such embodiments, prediction models developed 306 as described above may be referred to as first-level models which may be further refined into second-level prediction models. With reference again to
These predictions p may be used to build a second-level prediction model, block 310. Building 310 the second-level prediction model may be based upon and be similar to approaches used to build cascading classifiers that are used extensively in bioinformatics. See, e.g., George Karypis, YASSPP: Better Kernels and Coding Schemes Lead to Improvements in Protein Secondary Structure Prediction, In Journal of Proteins, August 2006, Volume 64-3, pages 575-586, and Huzefa Rangwala and George Karypis, Building Multiclass Classifiers for Remote Homology Detection and Fold Recognition, In Journal of BMC Bioinformatics, 2006, vol., 7, page 455, which are hereby incorporated by reference. The second-level model may be built 310 in a similar manner as described above with reference to the developing 306. To build 310 the second-level prediction model, the following inputs, among others, may be used: the various features (e.g., as represented, for example, in feature vectors THS) used to develop the first-level model, the predicted discrete value p of the demographic attribute for each (or a subset) of the web-graph neighbors, and the known discrete demographic attribute value for each (or a subset) of the training websites. In other words, the second-level prediction model may be built from the same input used to develop the first-level model plus the predictions p generated 308 as described above, in affect utilizing a feedback loop to refine the first-level prediction model. Embodiments may repeat this feedback loop to further refine the prediction model. The second-level prediction model, by incorporating predicted demographic attribute value information from the neighboring websites, relies on the principle of homophily as websites that cater to similar audiences will tend to be connected to each other. Embodiments for building this second-level model may use regression-based techniques (e.g., SVR), relational estimation methods (e.g., graphical models, relational Markov networks, Markov random fields, relaxation labeling, iterative estimation), and others.
With continuing reference to
Method 300 produces a prediction model for each discrete value of a demographic attribute with more than two discrete values (e.g., age). When content features of web-pages of a target website (or content features of a target web-page) are input into each prediction model, the prediction model estimates a probability for the discrete value of the demographic attribute. In other words, the prediction model estimates the probability that a target website (or target web-page) visitor has the discrete value for the demographic attribute (e.g., probability that the visitor is a teenager—has an age that fits within the teenager discrete value (see Table 1)).
With continuing reference to
With reference now to
With reference again to
Once the prediction model (e.g., the first-level prediction model, second-level prediction model and prediction distribution model) is developed and acceptable, embodiments of the systems and methods of making content-based demographic predictions for websites may identify 112 the target website(s) (or web-page(s)) for prediction, obtain 114 the content of the target website(s) (or web-page(s)), and predict 116 the demographic(s) of the target website(s) (or web-page(s)). As noted above, obtaining 114 the content of a target website may be performed in accordance with the method 200 described in
With reference now to
Method 400 may apply the second-level prediction model to compute a prediction of the determined discrete demographic attribute value for each target website, block 406. The prediction p of the determined discrete demographic attribute value and the feature representations of the extract features of the target website(s) are input into the second-level prediction model to compute a refined prediction p for the target website(s).
For demographic attributes with two discrete values (e.g., gender=male or female), method 400 may compute the value for the other discrete value, block 407, as described below. For demographic attributes with more than two discrete values (e.g., age=kid, teenager, young adult, adult, old), method 400 may repeat blocks 404-406 using prediction models for each of the other discrete values to output a prediction p for the remaining discrete demographic attribute values, block 408.
With continuing reference to
With continuing reference to
Note that, in embodiments, the above approach is only used for demographic attributes that take more than two values (i.e., the age demographic attribute). For variables that take only two values (i.e., the gender demographic attribute), the systems and methods described herein may only train a single SVR model that is designed to predict one of those values. If p1 is the prediction obtained by that model, then when 0≦p1≦1, the value of the other attribute is p2=1−p1. When p1<0, {p1, p2}={0, 1} and when p1>1, {p1, p2}={1, 0}. Consequently, computing 406 p2 for the second discrete value may be simply be performed by subtracting the first value prediction, p1, from 1.
In embodiments, method 400 may output a mix of demographic attributes predictions. In other words, method 400 may output predictions for a plurality of different demographic attributes. Consequently, method 400 may repeat blocks 404-412 for additional demographic attributes to output a mix of demographic attribute predictions, block 414. Moreover, the prediction model and the predictions may be achieved at the web-page level, as described above. Accordingly, method 400 may repeat blocks 402-414 for each web-page of the target website, block 416. Method 400 may also include combining the web-page level predictions to produce target website predictions.
With reference now to
System 500 includes website identifier 502, feature extractor 504, prediction modeler 506, and audience demographic estimator 508. Website identifier 502 may identify and select training websites. Website identifier 502 may identify and select training websites as described above with reference to identifying 102 in
With continuing reference to
The following describes an experimental evaluation of embodiments of systems for and methods of making content-based demographic predictions for websites.
Training Website Data Set. The set of training websites were identified as follows. First, the top 2000 websites from Alexa's one million most visited domains was selected, and their demographic information of their visitors as they relate to gender and age was obtained from Quantcast, which is a commercial provider of website demographic data. A subset of 450 websites was selected from that list so that the selected training websites would provide a balanced coverage of the age and gender distribution. For gender, this was done by dividing the male distribution into 10 equal sized buckets and an equal random sample was picked from each bucket. For age, websites were sorted based on each of the age groups and an equal number of top sites were picked from each group. This list of 450 websites was then crawled using the open source Heritrix crawler, and a maximum of 1000 web-pages were fetched from each website in a breadth-first fashion. The set of crawled pages was subsequently pruned to eliminate web-pages with less than 100 words. Furthermore, any websites with fewer than 50 web-pages remaining were also eliminated from the set of training websites. Note that a website can have a small number of web-pages because either the crawler failed to fetch (e.g., pages generated by scripts that the crawler could not handle) or the web-pages fetched contained a small number of words. These steps reduced the total number of websites to 128, which is the set of training websites used in the evaluation.
Evaluation Methodology. For all evaluations, the training website data set was divided into five folds at the website level and a five-fold cross validation was performed. This website level partitioning ensures that the web-pages from a given website are never in both the training and the test sets.
For the distribution prediction approaches based on the pseudo inverse method (see above), matrix W was estimated from P by using a cross-validation approach during training. Specifically, the training set was itself split into five folds and each four-size subset of these folds was used to estimate an SVR prediction model and predict the left-out fold. The resulting set of predictions formed matrix P and was used to estimate W. During the actual prediction, a domain was then predicted using the five different SVR models that were estimated during the within-training five-fold cross-validation, the predictions of the five SVR models were averaged, and then matrix W was used to predict the final distribution.
A SVMlight implementation of SVR was used to perform the learning (generation of the prediction model) and prediction. The prediction model generation was performed using a linear kernel function. For the models that were trained on individual web-pages (see above), in order to ensure that each domain contributed equally during training, a mis-prediction weight of 1/ne was assigned to the individual web-pages of the ith domain, where ni is the number of web-pages of that domain. These weights ensured that the sum of the weights of the training instances for each domain were the same. The width of the regression tube in the SVR (w parameter in SVMlight) was set to 0.025, which was determined after performing a limited set of experiments using different values of w from the set {0.05, 0.025, 0.0125, 0.00625}.
Evaluation Metrics. The evaluation used two different metrics to measure the performance of the predictions computed by the different methods (see below). The first measured the accuracy of the overall predicted discrete distribution, whereas the second measured the accuracy of the individual values of the discrete distribution. The accuracy of the distribution was measured using the root mean squared error (RMSE). The accuracy of the prediction for a specific value of a discrete distribution was measured using absolute error (AE). For all these metrics, the reported results corresponded to the averages over all the websites across the five-fold cross validation.
Baseline Predictions. In order to get a better sense about the quality of the prediction results produced by embodiments described herein, another approach in which the predictions for each variable (each demographic attribute) was computed as the average of the corresponding values in the training set. For example, the percentage of users that belong to the teen group (Table 1) was obtained by computing the average percentage of users that belong to the teen group in the training set. The same 5-fold cross-validation approach used in the evaluation of the prediction models as described above, was used to split the data set into training and test groups in order to obtain the predictions of this averaging model. This is referred to below as the baseline model.
Results. In this section, the results of the experimental evaluation of embodiments for predicting the gender and age distributions of a website's audience are presented. Performance of Different Features. Table 3 below shows the performance achieved by embodiments described herein for the gender and age prediction tasks for some of the features described above. Specifically, this table shows the average RMSE achieved by the T, TH, and THS features for both the web-page and website level models. Table 3 also shows the average RMSE values obtained by the baseline model described above.
Overall the actual prediction error (as measured by the average RMSE) is quite low. For the gender prediction problem, the best average RMSE value is 0.089, whereas for the age prediction problem, the best average RMSE value is 0.116. Moreover, these RMSE's are considerably lower that the corresponding values of 0.165 and 0.141 that were obtained by the baseline model. These results suggest that a website's content provide strong information for predicting the demographic attributes of the website and that the overall prediction error between the two tasks is both low and not significantly different. This is in contrast to the results obtained by earlier studies, see Jian Hu, Hua-Jun Zeng, Hua Li, Cheng Niu, Zheng Chen, Demographic Prediction Based on User's Browsing Behavior, Proceedings of the 16th international conference on World Wide Web, May 8-12, 2007, Banff, Alberta, Canada, in which it was observed that predicting the age distribution of a web-page's visitors is considerably harder than predicting the gender distribution.
Tables 4 and 5 further analyzes the prediction results obtained by the different features for the age prediction task by showing the average AE for each of the five age groups in our dataset. These results were obtained by using the models trained and applied at the page level. These results show that errors achieved for each of the age groups does vary across the age groups, with the “Young Adults” achieving the worse AE of 0.138 and the “Kid” group achieving the lowest of 0.027. However, even in the case of the worst performing age group, the actual AE is relatively low.
Performance of Model Granularity. Table 6 below compares the performance of the two different levels of granularity described above (website and web-page) at which the models may be learned or applied in embodiments described herein. Specifically, this table shows the average RMSEs that were obtained by the methods that predict at either the web-page or website levels using models that were trained using either of these two levels.
These results show that for a given prediction granularity level, the best result is achieved by using the model that was trained on the same level of granularity. That is, web-page level predictions perform best for models trained on web-pages, where as website level predictions perform best for models trained on websites. These results indicate that the two models are intrinsically different, and that the best prediction performance is achieved when the test/target data has the same characteristics as the data used for training.
Comparing the relative performance of the website and web-page level models, it may be seen that for both prediction tasks, the models trained and applied at the website level achieve better results than those achieved by the corresponding web-page models. Moreover, for both prediction tasks, the relative performance advantage of the website level models is quite substantial. These results suggest that by representing the web-pages of an entire website into a single training instance better captures the website's overall characteristics, leading to better models and more accurate predictions. Moreover, the additional advantage of this approach over the web-page level models is that they are computational less expensive for both model learning and prediction.
Determining Anomaly Websites
As described above with reference to
Predicting Characteristics of a User
With reference to
A similar approach is used to compute the distribution of other demographic attributes. As the user continues to browse websites, the prediction may be updated, block 610.
Determining Combination of Websites and Web-Pages to Reach a Target Demographic Mix
Embodiments of the systems and methods described herein may be used to determine a combination of websites (and/or web-pages) and the number of impressions of advertisements that should be used for an advertisement campaign in order to reach a set of users/visitors that have a target demographic mix. A target demographic mix is a set of users/visitors with desired demographic attribute values. For example, a target demographic segment, T1, may be a set of users/visitors that are male, young adult and earning in excess of $150,000 a year. Another target demographic segment, T2, may be a set of users/visitors that are female, adult and with kids. A target demographic mix is the percentage distribution of target demographic segments in an advertisement campaign. For example, a target demographic mix for one advertisement campaign, that wants to reach 100,000 users/visitors, has 70% of users (70,000) belonging to T1, and 30% of users (30,000) belonging to T2. Embodiments of the systems and methods determine a set of websites and associated number of impressions of advertisements for each website, such that 70,000 users/visitors belonging to T1 see the advertisement and 30,000 users/visitors belonging to T2 see the advertisement.
With reference now to
Method 700 may receive or otherwise obtain the number of visitors for the ad-carrying target websites, block 706. Likewise, the method 700 may receive a selection or input of secondary objectives, block 708. The method 700 may then determine a combination of websites (or web-pages) that provide the target demographic mix, block 710. The determining 710 may process, e.g., using an optimization method, the predicted demographic attributes of the ad-carrying target websites, the received inputs, and the number of visitors for the ad-carrying target websites to determine the combination of websites (or web-pages) that provide the target demographic mix.
In embodiments, method 700 may determine 710 the combination of websites (or web-pages) that provide the target demographic mix while also meeting or minimizing the secondary objectives. Accordingly, method 700 may utilize or include an optimization method to determine 710 the combination of websites that can be used to achieve the desired demographic attributes values mix while minimizing or meeting one or more secondary objectives. The secondary objectives may include without limitation the total advertising cost, the total time that is required to reach the audience with the target demographic mix, the number of ads that may be placed, etc. The optimization method may be implemented in a number of ways and can include, but is not limited to, discrete optimization, continuous optimization, exact methods, and heuristics methods such as simulated annealing or genetic algorithms. The determining 710 may produce an optimized list of websites on which an advertiser may place ads for an ad campaign. Method 700 may monitor the results of such an ad campaign, receiving and tracking inputs including the number of visits and any relevant information about the characteristics of the ad campaign audience, and may dynamically re-optimize an initial solution to ensure that the initial constraints are still satisfied while still minimizing or best meeting the secondary set of objectives.
Accordingly, based on the demographic attribute values predicted according to embodiments described herein, specific websites and web-pages can be recommended for an ad campaign to achieve the target demographic mix. Notably, these predictions are made based upon the analysis of the content of the websites, and without the use of data representing specific potential customers, offering a true “user data free” method of targeted ad placement. As such, the prediction for and recommendation of target websites is based purely on content of web-pages, via estimated or gathered audience characteristics (in a group level, but not in a specific user level) of similar, known websites.
Keywords to Buy for a Target Demographic Mix
Systems and methods described herein may also be used to determine a set of keywords to bid on in order for a keyword-based online advertising campaign (e.g., similar to the AdSense keyword-based advertising bid and placement provided by Google) to reach a set of website visitors that have a desired target demographic mix, as defined above. With reference now to
Selection of Websites for Market Research Whose Visitors have a Target Demographic Mix
Systems and methods described herein may also be used to select a set of websites whose audience subsets will be targeted for market research purposes. With reference now to
Planning Tool for Ad Networks
Systems and methods described herein may also be used to determine the websites with which ad networks should establish ad placement relations in order to achieve an audience with a desired target demographic mix for a forecasted demand. Embodiments of these methods may take as input the desired target demographic mix of the forecasted demand. Embodiments may then identify multiple sets of website combinations to reach the target demographic mix, using a method similar to that described above (e.g., see
Website Design Tool for Designing Websites that Appeal to an Audience with Desired Demographic Characteristics
Systems and methods described herein may also be used to determine how a website should be designed or re-designed or what new websites should be designed in order to appeal to an audience with a desired set of demographic characteristics. With reference now to
Hardware Implementation
As stated above, the methods described above may each be implemented as one or more computerized systems. The systems and methods may be implemented as computer applications, engines, computer application modules, specific purpose computers, software running on general purpose computers, and various combinations of these and other known manners of implementing computerized methods. Likewise, the methods may be fully or partially computer implemented.
With reference now to
Input device 1108 may include any device for entering information into computer system 1100, such as a keyboard, mouse, cursor-control device, touch-screen, microphone, digital camera, video recorder or camcorder. Display device 1110 may include any type of device for presenting visual information such as, for example, a computer monitor or flat-screen display. Output device 1112 may include any type of device for presenting a hard copy of information, such as a printer, and other types of output devices include speakers or any device for providing information in audio form.
Computer system 1100 may store a database structure in secondary storage 1104, for example, for storing and maintaining information need or used by the application(s). Also, processor 1106 may execute one or more software applications in order to provide the functions described in this specification, specifically in the methods described above, and the processing may be implemented in software, such as software modules, for execution by computers or other machines. The processing may provide and support web-pages and other GUIs. The GUIs may be used to enter inputs or view outputs of the systems and methods described herein. The GUIs may be formatted, for example, as web-pages in HyperText Markup Language (HTML), Extensible Markup Language (XML) or in any other suitable form for presentation on a display device.
With continuing reference to
Although computer system 1100 is depicted with various components, one skilled in the art will appreciate that the servers can contain additional or different components. In addition, although aspects of an implementation consistent with the above are described as being stored in memory, one skilled in the art will appreciate that these aspects can also be stored on or read from other types of computer program products or computer-readable media, such as secondary storage devices, including hard disks, floppy disks, or CD-ROM; or other forms of RAM or ROM. The computer-readable media may include instructions for controlling a computer system 1100 to perform a particular method, such as the methods described herein.
The terms and descriptions used herein are set forth by way of illustration only and are not meant as limitations. Those skilled in the art will recognize that many variations are possible within the spirit and scope of the invention as defined in the following claims, and their equivalents, in which all terms are to be understood in their broadest possible sense unless otherwise indicated.
Claims
1. A method of predicting characteristics of a web user comprising:
- receiving online session browsing history of a user, wherein browsing history identifies websites visited by a user during one or more online sessions, wherein identified websites include known websites and unknown websites;
- retrieving known demographic attributes data of known websites included in identified websites, wherein the known demographic attributes data includes known values of demographic attributes of the known websites;
- determining features of web-pages of unknown websites; and
- applying prediction models to the determined features of the unknown websites to predict values of unknown demographic attributes of the unknown websites.
2. The method of claim 1 further comprising combining the known values of the known demographic attributes and the predicted values of the unknown demographic attributes.
3. The method of claim 2 wherein the combining combines the known values and the predicted values to predict a distribution of demographic attributes of the user.
4. The method of claim 3 further comprising:
- receiving updated browsing history of the user; and
- updating the predicted distribution based on the received updated browsing history of the user.
5. A computer-readable medium comprising instructions stored thereon that may be executed by a computer for performing the method of claim 1.
6. A method of determining a combination of websites to obtain a target demographic mix, comprising:
- determining features of web-pages of target websites;
- applying prediction models to the determined features of the ad-carrying, target websites to predict values of demographic attributes of the target websites;
- receiving one or more inputs including a target demographic mix;
- receiving a number that indicates an amount of visitors of the target websites; and
- determining a combination of websites that provide target demographic mix based on the predicted values of the demographic attributes and number of visitors of the target websites.
7. The method of claim 6 wherein the determining a combination of websites processes the predicted values of the demographic attributes and number of visitors of the target websites using an optimization method.
8. The method of claim 6 further comprising receiving one or more secondary objectives.
9. The method of claim 8 wherein the one or more secondary objectives include one or more secondary objectives chosen from a list consisting of: total advertising cost, total time that is required to reach audience with the target demographic mix, and a number of ads that may be placed.
10. The method, of claim 8 wherein the determining a combination of websites processes the predicted values of the demographic attributes, the number of visitors of the ad-carrying, target websites and the one or more secondary objectives using an optimization method.
11. The method of claim 8 wherein the one or more secondary objectives include: total advertising cost, total time that is required to reach a forecasted target audience, prior partnership information, and competitor information.
12. A computer-readable medium comprising instructions stored thereon that may be executed by a computer for performing the method of claim 6.
13. A method of determining a set of keywords to buy to obtain a target demographic mix comprising:
- receiving one or more inputs including a target demographic mix;
- identifying one or more sets of website combinations to reach the target demographic mix, wherein the identifying identifies the one or more sets of website combinations using extracted features of web-pages to predict demographic attribute values; and
- analyzing a subset of web-pages of the one or more sets of website combinations to determine a set of terms that occur in the web-pages.
14. The method of claim 13 wherein the analyzing utilizes feature selection and optimization methods to determine the set of terms.
15. The method of claim 14 further comprising receiving one or more objectives.
16. The method of claim 15 wherein the feature selection and optimization methods optimize the set of terms to best meet the one or more objectives
17. The method of claim 15 further comprising weighting scores given to determined terms to give higher importance to one or more objectives.
18. The method of claim 13 wherein the identifying one or more sets of website combinations to reach the target demographic mix comprises:
- determining features of web-pages of target websites;
- applying prediction models to the determined features of the target websites to predict values of demographic attributes of the target websites;
- receiving one or more inputs including a target demographic mix;
- receiving a number that indicates an amount of visitors of the target websites; and
- determining a combination of websites that provide target demographic mix based on the predicted values of the demographic attributes and number of visitors of the target websites.
19. A computer-readable medium comprising instructions stored thereon that may be executed by a computer for performing the method of claim 13.
20. A method of selecting websites from market research, comprising:
- receiving one or more inputs including a target demographic mix;
- determining features of web-pages of target websites;
- applying prediction models to the determined features of the target websites to predict values of demographic attributes of the target websites;
- receiving a number that indicates an amount of visitors of the target websites; and
- determining a combination of websites that provide target demographic mix based on the predicted values of the demographic attributes and number of visitors of the target websites.
21. The method of claim 20 further comprising:
- monitoring results of a market research campaign; and
- re-optimizing the determined combination of websites based on the monitored results.
22. A computer-readable medium comprising instructions stored thereon that may be executed by a computer for performing the method of claim 20.
23. A method of designing websites to appeal to an audience with desired demographic characteristics, comprising:
- receiving set of desired values for one or more demographic attributes;
- determining a correlation between one or more features of web-pages and set of training websites that have desired values for the one or more demographic attributes, wherein the determining a correlations includes identifying combination of features that would result in a prediction of the desired values.
24. The method of claim 23 further comprising designing a website to include the identified combination of features.
25. The method of claim 23 wherein the determining includes analyzing a prediction model developed using features extracted from a subset of web-pages of the training websites and obtained demographic attributes data of the training websites, wherein the prediction models may be used to predict one or more values for target demographic attributes.
26. A computer-readable medium comprising instructions stored thereon that may be executed by a computer for performing the method of claim 23.
Type: Application
Filed: Dec 21, 2009
Publication Date: Jun 24, 2010
Applicant: nXn Tech, LLC (Bloomington, MN)
Inventors: George Karypis (Minneapolis, MN), Eui-Hong Han (Woodbury, MN)
Application Number: 12/643,904
International Classification: G06Q 10/00 (20060101); G06F 15/173 (20060101); G06Q 30/00 (20060101);