DOCUMENT PROCESSING DEVICE AND DOCUMENT PROCESSING METHOD
An object of the present invention is to provide a document processing device and document processing method that can provide a search result satisfactory to a user with respect to WWW documents in which a number of links among WWW documents is low and a number of accesses by users is low. An access pattern collection unit 101 generates an access user vector uj of one WWW document Dj and an access user vector uje of another document Dje. A user similarity computing unit 105 computes a document similarity sim (uj, uje) which indicates a user similarity between the WWW document Dj and WWW document Dje. A keyword vector smoothing unit 106 acquires a smoothed keyword weight vector w′j by correcting a keyword weight vector wj in one document, using the computed document similarity sim (uj, uje). An rearranging unit 110 calculates an evaluation value B_SCORE for input information for searching, based on the smoothed keyword weight vector w′j.
Latest NTT DoCoMo, Inc. Patents:
1. Field of the Invention
The present invention relates to a document processing device and document processing method for searching web data.
2. Related Background Art
Since the mid-1990s, opening WWW documents on the Internet is explosively increasing, and value thereof in the information industry is increasing. A WWW document is positioned in a logical information storage position on the Internet, called a URL (Uniform Resource Locator), and a structured data base is constructed by mutually referring to this URL. A search service to efficiently search this structured data base and provide [the required information] to a user is critical, and a search engine is considered as a system to execute this service.
Description about a search engine is made in Data of the Technology Trend Group Planning and Research Division, Patent Administration Dept., Japan Patent Office: “Theme title: Creation of standard technologies on search engines”, an overview of technical trends of WWW search engines, [online], [searched on Jan. 29, 2008], Internet <URL: http://www.jpo.go.jp/shiryou/s_sonota/hyoujun_gijutsu/search_engine/douko.htm>, specifically that a “search engine is handling information space which is enormous and constantly changing, so it must have the following functions which are different from conventional search technology, and research and development are progressing to implement and advance these functions:
function to efficiently collect information dispersed on the WWW
function to extract keywords from information described freely in an undefined format in HTML, and search this information at high-speed
interface function for each search
function to rank enormous search results efficiently.”
In Data of the Technology Trend Group Planning and Research Division, Patent Administration Dept., Japan Patent Office: “Theme title: Creation of standard technologies on search engines”, an overview of technical trends of WWW search engines, [online], [searched on Jan. 29, 2008], Internet <URL: http://www.jpo.go.jp/shiryou/s_sonota/hyoujun_gijutsu/search_engine/douko.htm>, the following description is included. This search engine is comprised of such components as a “WWW robot, collected text group, indexer, search index file, search server and browser.” The WWW robot has a function to “(1) collect information” from the world of the Internet web. The collected WWW pages are stored in the collected text group, and “(2) data analysis (pre-processing)” is performed before transferring the data to the indexer. Index files for a full text search or category search are generated in the components of the indexer and search index file, and a basic data base for “(3) search processing” is operated. Information on input and output is exchanged among the search server, client and browser, where many “(4) input/output interfaces” intervene and function.
The user sends a search request to a search server 505 via a web server 506 using a web browser of a terminal 507. The search server 505 performs search processing, referring to the index file 503, and outputs the result to the terminal 507, whereby the terminal can acquire the search result.
By this processing, the user receives an enormous amount of search results. Therefore it is demanded to grasp the search result efficiently. Here a prior art on “a function to efficiently rank the enormous amount of search results” will be described. This function is normally implemented by combining conformity and significance. Conformity is a scale that measures a degree of matching the intention of the search, such as whether the word searched by the user is included frequently [in a WWW document], or whether [the WWW document] matches the search history of the user. Significance is a scale that measures a degree of the beneficial information generally read by many individuals included in a WWW document.
For example, U.S. Pat. No. 6,112,202 Description and “Technical trends of WWW search engines” by Masanori Harada, Technical Report of IEICE, SSE2000-228, pp. 17-22, 2001 describe HITS, which is one ranking search method that implements both conformity and significance. HITS searches web pages including a keyword representing a topic, detecting the authority and hub from a web graph near a web page having a high conformity of the searched web pages. Authority is a scale indicating a web page which is referred to by many hubs in the web graph, and which receives high evaluations. Hub is a scale indicating a web page which corresponds to links, referring to many authorities in the web graph. In HITS, the authority score and hub score of each web page in the web graph is calculated by iterated calculation, and web pages are output in the sequence of the authority score. Thereby significant web pages can be searched out of the web page group related to the provided topic.
The above is calculated during a search, but as a static method for calculating significance of WWW documents, a page ranking method used by Google Inc. in the USA is well known. For example, as U.S. Pat. No. 6,285,999 Description shows, this page ranking method uses a huge link structure of WWW documents.
For example, if WWW document A refers to WWW document B, it is regarded that WWW document A supports significance of WWW document B. At this time, the significance of WWW document A is weighted by this support. The significance of WWW document A is represented by the sum total of the support of other WWW documents, which refers to [WWW document A] and the weighted significance. In this way, if large scale calculation is performed recurrently, tracking the references of all WWW documents, significance of each WWW document is determined.
Recently due to improved software and browser functions to read WWW documents, browsers that users are accessing are measured, linking with search engines, and this measured popularity is added to the parameters to determine significance.
According to “2 Beyond Page Rank: Machine Learning for Static Ranking” by Matthew Richardson, Amit Prakash, Eric Brill, Proc. WWW 2006, [online], [searched on Jan. 29, 2008], Internet, <URL: http://www2006.org/programme/files/xhtml/3101/p3101-Richardson.html>, the frequency and time when users access (that is, popularity) is added to the page ranking to determine the significance of a WWW document. According to US Patent Application Laid-Open No. 2007/0143345 Description, data on how often [the WWW document] was clicked on, out of the search result during a predetermined period, is used for calculating ranking as a history.
Prior arts on [determining] significance of WWW documents were described above, but a problem is that there are too many choices to present the search result according to conformity. To solve this problem of too many choices, a method of estimating user interest based on browsing history of the user, and rearranging the ranking of the pages listed based on the weight of the characteristics of search history, has been proposed. In “E output interface E-2-(1), output with ranking” reported in Data of the Technology Trend Group Planning and Research Division, Patent Administration Dept., Japan Patent Office: “Theme title: Creation of standard technologies on search engines”, an overview of technical trends of WWW search engines, [online], [searched on Jan. 29, 2008], Internet <URL: http://www.jpo.go.jp/shiryou/s_sonota/hyoujun_gijutsu/search_engine/douko.htm>, the following is disclosed.
In other words, in order to solve the problem of too many choices, a method of estimating user interest based on browsing history of this user, and rearranging the sequence of pages listed based on the weight of the characteristics of the search history, is proposed. In more concrete terms, it is assumed that a user browsed pages 1, 2, . . . , n following links. Based on the assumption that the interest of the user is higher for the content which was read more recently, weight is increased for the most recently read web page. A weight of a word (weight of index) is determined by adding up the “weight of history’ of pages including the target word. This will be described with reference to
After the above browsing, the user inputs a keyword to the search engine, and collects necessary information. An index included in each of the collected pages is detected, and the weights of these indexes are added up, whereby the weight of the page, that is the selection candidate, is calculated. The user can access sequentially from a page having a heavier weight. The same method is also disclosed in Japanese Patent Application Laid-Open Nos. 10-207901 and 2002-32401.
In a document search, a search technology using the tf·idf characteristic is under consideration. In this technology, the weight of keyword ti (i=1, . . . , M), which appears in a document set {Dj|j=1, . . . , N} is calculated for each document, and the keyword weight vector wj is expressed by the following Expression (1).
[Expression 1]
wj=(wj1, wj2, . . . , wjM)T (1)
where T denotes transposition.
Here N denotes a number of search target documents, M denotes a number of keywords in a natural language (e.g. Tokyo, portable phone, baseball, station, economy, stocks, . . . ), and is a very large number.
Each weight can be calculated by the following Expression (2),
[Expression 2]
wji=tfji×idfi (2)
In other words, the weight is given by the product of term frequency (tf) and the inverse number of document frequency (idf). Term is a synonym for keyword.
A weight wji of a keyword ti, which appears in a document Dj, should be high if [the keyword ti] appears frequently in a document Dj, and do not appear infrequently in other documents. If the keyword ti appears frequently [in document Dj] and also appears frequently in other documents, the weight wji may be low. The tf·idf characteristic is a representation of this heuristic knowledge, and can be defined as shown in the following Expressions (3) and Expression (4).
[Expression 3]
tfji=freq(i,j) (3)
where freq(i, j) denotes frequency of appearance of the term ti in the document Dj.
where Dfreq(i) denotes a number of documents in which the term ti appears (document frequency), and idfi denotes Dfreq(i) normalized by the total number of documents N. The tf idf characteristic has many improved versions, but the above mentioned general definition is used here.
Now search input is expressed as a search vector q. This is also M-dimensional, and is given by the following Expression (5).
[Expression 5]
q=(q1, q2, . . . , qM)T (5)
In Expression (5), qi is 1 if the keyword ti is included, and is 0 if not included.
In search processing, document DX, of which similarity is the maximum, is searched out of the document set. For searching, the cosine distance determined by normalizing the inner product is normally used, as shown in Expression (6) and Expression (7), to normalize the number of words in a document.
Expression (7) itself, however, expresses a degree of similarity, and the cosine distance used as a scale to satisfy the system of axioms of distance is 1−sim(q, wj).
Conventional example 1 is a search system based on the keyword weight vector shown in
This conventional example 1 is for simply outputting a search result, so a conventional example 2, which is a search system using evaluation values given by the following Expression (8) and Expression (9), to evaluate similarity considering user profile, is under consideration as an improvement of conventional example 1. Based on the evaluation values calculated by Expression (8) and Expression (9), display of the searched WWW documents is processed. In other words, the searched WWW documents are displayed in the sequence according to the evaluation values.
[Expression 8]
A_score(q,wj;pk)=λsim(q,wj)+(1−λ)sim(pk,wj), 0≦λ≦1 (8)
where pk denotes a user profile of a user k.
[Expression 9]
pk=(pk1, pk2, . . . , pkM)T (9)
As shown above, the user profile of a user k is represented by the keyword weight vector. In this way, the WWW documents, searching word and user profile of a user k can also be represented by similar vectors.
To construct the user profile, the sum of Nw(j) in the WWW documents Dj accessed in the past is determined, as shown in
pk=Σjε
Also as a format to add significance as an evaluation point, a conventional example 3, which is a search system using an evaluation value given by the following Expression (11), is under consideration.
[Expression 11]
B_score(q,wj;pk,sj)=λA_score(q,wj;pk)+(1−λ)sj, 0≦λ≦1 (11)
where sj (0≦sj≦1) denotes a significance of the WWW document Dj. The value λ may differ from that in Expression (8).
Furthermore “Shohei Tsujimoto, Noriyuki Matsuda, So Harijima, Junichi Toyota, “Browsing support using context information—mounting on web and experimental evaluation thereof”, Annual Conference of JSAI (11th) Post Proceedings (Jun. 24, 1997), The Japanese Society for Artificial Intelligence, pp. 466-467” is known.
The above conventional search methods are based on the following assumptions. That is, (1) basic concept on page ranking, that a WWW document linked with a good quality WWW document has good quality, and (2) a keyword weight vector w of a WWW document and a personal profile p of a user are generated by sufficient information.
However, the above assumptions are not always applicable to a set of WWW documents viewed by a mobile terminal (hereafter called “mobile content”), and an appropriate search result cannot always be acquired by a prior art.
Since sites are not linked to each other, the assumption that a WWW document linked with a good quality WWW document has a good quality, is not always established. Also WWW documents are short documents and do not contain many keywords, which is a different characteristic from WWW documents viewed on a PC. Another characteristic is that a number of dynamically generated WWW documents, such as news and transfer guides, is high. For example, in the case of site A in
Because of this situation, it is difficult to determine significance of a several hundred word content without a link, considering the personal accessing history using such an evaluation value as the one shown in Expression (8) or Expression (11), and it is also difficult to represent a personal profile with a keyword weight vector, and as a consequence, it is difficult to present WWW documents that satisfy a user in a search.
SUMMARY OF THE INVENTIONWith the foregoing in view, it is an object of the present invention to provide a document processing device and document processing method with which a search result, satisfactory to the user, can be provided, for WWW documents which are not linked with each other very much and which users do not access very frequently.
In order to solve these problem s, a document processing method of the present invention has: a collection step of collecting access history of a user; a document similarity computing step of computing document similarity, which indicates similarity between documents, by one user pattern which indicates a plurality of users who have accessed one document, and another user pattern which indicates a plurality of users who have accessed another document according to the access history collected in the collection step; a keyword weight vector correction step of correcting a keyword weight vector of the one document using the document similarity computed in the document similarity computing step; and an evaluation value calculation step of calculating an evaluation value for input information for searching, based on the keyword weight vector corrected in the keyword weight vector correction step.
According to the present invention, the access history of the user is stored, and the document similarity which indicates similarity between documents is computed by one user pattern which indicates a plurality of users who have accessed one document and another user pattern which indicates a plurality of users who have accessed another document, according to the access history, and the keyword weight vector in the one document is corrected using the computed document similarity. And the evaluation value for the input information for searching can be calculated based on the corrected keyword weight vector.
By this, the keyword weight vector can be interpolated based on the document having a user pattern similar to the user pattern of a user accessing documents, and the keyword weight vector of a document having low access quantity and link quantity, such as a document having mobile content, can be more accurate, and as a result, searching with a higher accuracy is implemented.
In the document processing method of the present invention, it is preferable that the keyword weight vector correction step further comprises a step of correcting a keyword weight vector in the above mentioned other document using the document similarity, and correcting a keyword weight vector in the above mentioned one document using the corrected keyword weight vector.
By this, the keyword weight vector in the other document is corrected and the keyword weight vector in the one document is corrected using this corrected keyword weight vector, and as a result, the keyword weight vector of a document with small document volume can be more accurate, and a more accurate search can be implemented.
The document processing method of the present invention further has: a user similarity computing step of computing a user similarity, which indicates similarity between users, by one document pattern which indicates a plurality of documents accessed by one user and another document pattern which indicates a plurality of documents accessed by another user, according to the access history collected in the collection step; and a user profile correction step of correcting a user profile which indicates characteristics of the above mentioned one user using the user similarity computed in the user similarity computing step, wherein the evaluation calculation step further has a step of calculating the evaluation value for the input information for searching based on the user profile of the one user corrected in the user profile correction step.
According to the present invention, the user similarity which indicates similarity between users is computed using one document pattern which indicates a plurality of documents accessed by one user and another document pattern which indicates a plurality of documents accessed by another user, and a user profile of the one user is corrected using the computed user similarity. Then, based on the corrected user profile of this one user, an evaluation value for input formation for searching can be calculated. By this, a user profile of a user who does not access much can be compensated for by a peripheral user, and a search result with high conformity for the user can be provided.
In the document processing method of the present invention, it is preferable that the user profile correction step further has a step of correcting a user profile of another user using the user similarity, and correcting the user profile of the above mentioned one user based on this corrected user profile.
By this, a user profile of a user, who does not access much, can be compensated for by a peripheral user, and a search result with high conformity for the user can be provided.
It is preferable that the document processing method further has an acquisition step of acquiring significance information which indicates significance attached to each document, wherein the evaluation value calculation step further has a step of calculating an evaluation value for the input information for searching, using the significance information acquired in the acquisition step.
According to the present invention, a user can acquire significance information which indicates significance attached to each document, and calculate an evaluation value for input information for searching using the acquired significance information. By this, the significance can be reflected on the evaluation value, and a more appropriate evaluation result can be provided.
In the document processing method of the present invention, it is preferable that the evaluation value calculation step further has a step of calculating an evaluation value using the corrected keyword weight vector when the corrected keyword weight vector in the above mentioned one document exists, and calculating an evaluation value using the keyword weight vector before correction when the corrected keyword weight value in the one document does not exist.
According to the present invention, whether the corrected keyword weight vector is used or the uncorrected keyword weight vector is used can be switched according to the presence of the corrected keyword weight vector, therefore even a document which is not held or collected in advance can be evaluated appropriately, and [this evaluation value] can be provided to the user.
It is preferable that the document processing method of the present invention further has an acquisition step of acquiring a document from a search server according to an access by a user, wherein accesses accepted in the acquisition step are collected in the collection step as the access history.
According to the present invention, the terminal at the user side need not have the access history collection function, therefore the configuration thereof can be simplified.
A document processing method of the present invention has: a collection step of collecting access history of a user; a document similarity computing step of computing a document similarity, which indicates similarity between documents by one user pattern which indicates a plurality of users who have accessed one document and another user pattern which indicates a plurality of users who have accessed another document, according to the access history collected in the collection step; a keyword weight vector correction step of correcting a keyword weight vector of the above mentioned one document using the document similarity computed in the document similarity computing step; an acquisition step of acquiring significance information which indicates a significance attached to each document; a significance correction step of distinguishing a first user pattern which indicates users who have accessed one document during a first time period, and a second user pattern which indicates users who have accessed one document during a second time period, according to the accesses of users collected in the collection step, and correcting the significance of the above mentioned one document based on the similarity of the first user pattern and the second user pattern and a number of accesses to the one document; and an evaluation value calculation step of calculating an evaluation value for input information for searching, based on the keyword weight vector corrected in the keyword weight vector correction step, and the significance information corrected in the significance correction step.
According to the present invention, a first user pattern which indicates users who have accessed one document during a first time period, and a second user pattern which indicates users who have accessed the one document during a second time period, are separately stored, and significance of the one document is corrected based on the similarity of the stored first user pattern and the second user pattern and the number of accesses to this one document. By this, significance of one document can be more appropriate. In other words, users who access a document change as time passes, and if a document of which user patterns are similar and which is accessed repeatedly by the same users, the document has high significance. Therefore the significance of this document is corrected so that the evaluation value thereof becomes high.
It is preferable that the document processing method of the present invention further has an output step of outputting the search result searched by the user according to the evaluation value calculated in the evaluation value calculation step.
By this invention, a search result based on the calculated evaluation value can be output, and a search result which the user can clearly see, such as outputting documents sequentially from one having a higher evaluation value, can be provided.
A document processing method of the present invention has: a first generation step of generating a user profile based on a keyword weight vector that is to be a reference value; a second generation step of generating a new keyword weight vector based on the user profile generated in the first generation step and the keyword weight vector that is to be a reference value; a third generation step of generating a new user profile based on the new keyword weight vector generated in the second generation step; a user profile similarity generation step of computing similarity between the new user profile generated in the third generation step and the user profile generated immediately before the new user profile; and an evaluation value calculation step of calculating an evaluation value based on the keyword weight vector and user profile when the similarity computed in the user profile similarity generation step becomes a predetermined value or more.
According to the present invention, a user profile is generated first based on a keyword weight vector to be a reference value, then a new keyword vector is generated based on the generated user profile and the keyword weight vector to be the reference value, and a new user profile is generated based on the new keyword weight vector. Then similarity of the new user profile and the user profile generated immediately before the new user profile is computed, and it is judged whether this similarity has a predetermined value or more. The user profile and keyword weight vector are repeatedly generated until the similarity has a predetermined value or more, and the evaluation value is calculated based on the keyword weight vector and user profile when the computed similarity becomes a predetermined value or more.
By generating the keyword weight vector and user profile to be interdependent like this, the user profile propagates into the keyword weight vector, whereby the user profile and keyword weight vector are smoothed and interpolated. Therefore the keyword weight vector of a document having small document volume, such as a mobile content, can be more accurate. A user profile of a user who does not access much can be compensated for by a peripheral user, and a search result with high conformity for the user can be provided.
It is preferable that the document processing method according to the present invention further has a judgment step of judging whether the similarity generated in the user profile similarity generation step is a predetermined value or more, wherein the evaluation value calculation step further has a step of calculating the evaluation value based on the keyword weight vector and user profile when the similarity computed in the user profile similarity generation step becomes a predetermined value or more.
According to the present invention, the evaluation value can be calculated based on the keyword weight vector and user profile when the similarity computed in the user profile similarity generation step becomes a predetermined value or more, whereby a search result with high conformity for the user can be provided.
The present invention cannot only be described as an invention for the document processing method as above, but can also be described as an invention for a document processing device, search system and document processing programs, as described herein below. These are substantially the same invention, only the categories are different, and implement similar functions and effects.
A document processing device of the present invention has: access history collection means for collecting access history of a user; document similarity computing means for computing a document similarity, which indicates similarity between documents by a user pattern which indicates a plurality of users who have accessed one document and a user pattern which indicates a plurality of users who have accessed another document, according to the access history collected by the collection means; keyword weight vector correction means for correcting a keyword weight vector of the one document, using the document similarity computed by the document similarity computing means; and evaluation value calculation means for calculating an evaluation value for input information for searching, based on the keyword weight vector corrected by the keyword weight vector correction means.
A search system of the present invention has: a user terminal for storing access history; an information collection device for generating a keyword weight vector of a document accessed by the user terminal; and the above mentioned document processing device for acquiring the access history of the user terminal and the keyword weight vector generated by the information collection device.
A document processing program of the present invention has: a collection module for collecting access history of a user; a document similarity computing module for computing a document similarity, which indicates similarity between documents, by a user pattern which indicates a plurality of users who have accessed one document and a user pattern which indicates a plurality of users who have accessed another document, according to the access history collected by the collection module; a keyword weight vector collection module for correcting a keyword weight vector of the one document, using the document similarity computed by the document similarity computing module; and an evaluation value calculation module for calculating an evaluation value for input information for searching, based on the keyword weight vector corrected by the keyword weight vector correction module.
A document processing device of the present invention has: primary WWW document extraction means for extracting WWW documents according to a searching word; user extraction means for extracting a user set of users who have accessed the WWW documents extracted by the primary WWW document extraction means; secondary WWW document extraction means for extracting a WWW document set of WWW documents accessed by the users extracted by the user extraction means; and significance calculation means for calculating significance of the WWW documents extracted by the primary WWW document extraction means based on a degree of accesses by users to the WWW document set extracted by the secondary WWW document extraction means.
According to the present invention, a user set of users who have accessed the WWW document according to a searching word is extracted, and a set of WWW documents accessed by the users is extracted. And significance of the WWW documents can be calculated based on a degree of accesses by users to the extracted WWW document set. By this, significance of a WWW document having a small access quantity and link quantity, such as a mobile content, can be accurately calculated, and an accurate search can be implemented.
In the document processing device in the present invention, it is preferable that the significance calculation means calculates the significance of a WWW document based on a degree of accesses by each user in the user set extracted by the user extraction means.
According to the present invention, the significance of the WWW document can be calculated based on a degree of accesses by each user in the extracted user set, and significance can be accurately calculated, which implements accurate search.
A document processing device of the present invention has: primary WWW document extraction means for extracting WWW documents according to a searching word; user extraction means for extracting a user set of users who have accessed the WWW documents extracted by the primary WWW document extraction means; data structure holding means for holding data for which reference relationships among the WWW documents can be managed as a directed graph; secondary WWW document extraction means for extracting other WWW documents which each WWW document extracted by the primary WWW document extraction means refers to, and other WWW documents which refer to each WWW document, based on the data stored in the data structure holding means; and significance calculation means for calculating significance of the WWW documents extracted by the primary WWW document extraction means based on a degree of accesses by the users extracted by the user extraction means to the WWW document set extracted by the secondary WWW document extraction means.
According to the present invention, a user set of users who have accessed the WWW documents according to the searching word is extracted, and other WWW documents which each extracted WWW document refers to and other WWW documents which refer to each WWW document are extracted based on the data which can manage the reference relationships among the WWW documents as a directed graph. And significance of the WWW documents can be calculated based on a degree of accesses by the users to the extracted WWW document set. By this, significance of a WWW document can be accurately calculated, and an accurate search can be implemented.
A document processing device of the present invention has: access history holding means for holding an access history to a WWW document by a plurality of users; data structure holding means for holding data for which reference relationships among WWW documents can be managed as a directed graph; primary WWW document extraction means for extracting WWW documents according to a searching word; user extraction means for extracting a user set of users who have accessed the WWW documents extracted by the primary WWW document extraction means from the access history holding means; secondary WWW document extraction means for extracting other WWW documents which each WWW document extracted by the primary WWW document extraction means refers to, and other WWW documents which refer to each WWW document, based on the data stored in the data structure holding means, and extracting one node set by adding the user set extracted by the user extraction means and the WWW document set of the extracted WWW documents; and significance calculation means for calculating significance of the WWW documents by weighting a degree of being referred to among the WWW documents in the node set extracted by the secondary WWW document extraction means and a degree of accesses by each of the users to each of the WWW documents respectively.
According to the present invention, data that can be managed as a directed graph among WWW documents is held in advance, and a user set of users who have accessed a WWW document extracted according to a searching word is extracted. Also other WWW documents which each WWW document refers to and other WWW documents which refer to each WWW document are extracted based on the data which allows managing the reference relationships among the WWW documents as a directed graph, and one node set is extracted by adding the user set of extracted users and the WWW document set of extracted WWW documents. Then the significance of the WWW documents is calculated by weighting a degree of being referred to among the WWW documents in the extracted node set and a degree of accesses by each user to each WWW document respectively. By this, the significance of the WWW document can be accurately calculated, and an accurate search can be implemented.
A document processing device of the present invention has: data structure holding means for holding data for which reference relationships among WWW documents can be managed as a directed graph; primary WWW document extraction means for extracting WWW documents according to a searching word; user extraction means for extracting a user set of users who have accessed the WWW documents extracted by the extraction means from the access history holding means; secondary WWW document extraction means for extracting other WWW documents which each WWW document extracted by the primary WWW document extraction means refers to, and other WWW documents which refer to each of the WWW documents, based on the data stored in the data structure holding means; hub score calculation means for calculating a hub score which indicates a degree of accesses by each user of the user set extracted by the user extraction means to each WWW document extracted by the secondary WWW document extraction means; and significance calculation means for calculating significance based on a degree of matching a visit vector of users who have visited a WWW document included in any of the WWW documents and the hub score calculated by the hub score calculation means.
According to the present invention, a user set of users who have accessed the extracted WWW document according to a searching word is extracted, and other WWW documents which each extracted WWW document refers to and other WWW documents which refer to each of the WWW documents are extracted based on the data which allows managing the reference relationships among the WWW documents as a directed graph. Each user of the extracted user set calculates a hub score which indicates a degree of accesses to each extracted WWW document. Then the significance is calculated based on a degree of matching of a visit vector of users who have visited the WWW document, included in the WWW document, and the hub score. By this, the significance of the WWW document can be accurately calculated, and an accurate search can be implemented.
The present invention can not only be described as a document processing device, as mentioned above, but can also be described as a document processing method, as mentioned herein below. In this case, the functional effects thereof are the same as those of the document processing device.
A document processing method of the present invention has: a primary WWW document extraction step of extracting WWW documents according to a searching word; a user extraction step of extracting a user set of users who have accessed the WWW documents extracted in the primary WWW document extraction step; secondary WWW document extraction step of extracting a WWW document set of WWW documents accessed by users extracted in the user extraction step; and significance calculation step of calculating significance of the WWW documents extracted in the primary WWW document extraction step based on a degree of accesses by the users to the WWW document set extracted in the secondary WWW document extraction step.
The document processing method of the present invention is a document processing method for a document processing device having data structure holding means for holding data for which reference relationships among WWW documents can be managed as a direct graph, having: a primary WWW document extraction step of extracting WWW documents according to a searching word; a user extraction step of extracting a user set of users who have accessed the WWW documents extracted in the primary WWW document extraction step; a secondary WWW document extraction step of extracting other WWW documents which each WWW document extracted in the primary WWW document extraction step refers to, and other WWW documents which refer to each of the WWW documents, based on the data stored in the data structure holding means; and a significance calculation step of calculating significance of the WWW documents extracted in the primary WWW document extraction step based on a degree of accesses by the users extracted in the user extraction step to the WWW document set extracted in the secondary WWW document extraction step.
A document processing method of the present invention is a document processing method for a document processing device having access history holding means for holding history of access to a WWW document by a plurality of users, and data structure holding means for holding data for which reference relationships among WWW documents can be managed as a directed graph, having: a primary WWW document extraction step of extracting WWW documents according to a searching word; user extraction step of extracting a user set of users who have accessed the WWW documents extracted in the primary WWW document extraction step from the access history holding means; secondary WWW document extraction step of extracting other WWW documents which each WWW document extracted in the primary WWW document extraction step refers to, and other WWW documents which refer to each of the WWW documents, based on the data stored in the data structure holding means, and extracting one node set by adding the user set extracted in the user extraction step and the extracted WWW document set of WWW documents; and significance calculation step of calculating significance of the WWW documents by weighting a degree of being referred to among the WWW documents in the node set extracted in the secondary WWW document extraction step and a degree of accesses by each of the users to each of the WWW documents respectively.
A document processing method of the present invention is a document processing method for a document processing device having access history holding means for holding history of access to a WWW document by a plurality of users, and data structure holding means for holding data for which reference relationships among WWW documents can be managed as a directed graph, having: a primary WWW document extraction step of extracting WWW documents according to a searching word; a user extraction step of extracting a user set of users who have accessed the WWW documents extracted in the primary WWW document extraction step from the access history holding means; a secondary WWW document extraction step of extracting other WWW documents which each WWW document extracted in the primary WWW document extraction step refers to, and other WWW documents which refer to each of the WWW documents, based on the data stored in the data structure holding means; a hub score calculation step of calculating a hub score which indicates a degree of accesses by each user of the user set extracted in the user extraction step to each WWW document extracted in the secondary WWW document extraction step; and a significance calculation step of calculating significance based on a degree of matching of a visit vector of users who have visited a WWW document included in any of the WWW documents and the hub score calculated in the hub score calculation step.
According to the present invention, the keyword weight vector can be interpolated based on documents having a pattern similar to the user pattern of a user accessing documents, and the keyword weight vector of a document having low access quantity and link quantity, such as mobile content, can be more accurate, and as a result, searching with more accuracy is implemented.
According to the present invention, significance of WWW documents having a small access quantity and link quantity, such as mobile content, can be accurately calculated based on a degree of accesses by the user, and an accurate search can be implemented.
Embodiments of the present invention will now be described with reference to the accompanying drawings. Whenever possible, identical portions are denoted with a same reference symbol, for which redundant description is omitted.
First EmbodimentThe proxy device 100 is further comprised of an access pattern collection unit 101, user access history holding unit 102, keyword vector holding unit 103, WWW document similarity computing unit 104, user similarity computing unit 105, keyword vector smoothing unit 106, user profile smoothing unit 107, smoothed user profile holding unit 108, smoothed keyword vector holding unit 109 and rearranging unit 110. The user terminal 200 is further comprised of a WWW browser 201, access history holding unit 202 and access history transfer unit 203. This user terminal 200 represents a user or a plurality of users, and about 1 million units are assumed in this system. A user may user a plurality of user terminals. A number of users is represented by the constant K herein below.
The proxy device 100 here is constructed by the hardware shown in
The access pattern collection unit 101 is a portion to collect access patterns accessed in the user terminal 200 in a predetermined period. The access pattern here means access destination information, such as a URL, where a WWW document, which the user attempted to access, is located. The access destination information acquired here is output to the information collection device 400. The information collection device 400 can acquire the WWW document according to the access destination information, as mentioned later. For example, the information collection device 400 can acquire a WWW document shown in
The access pattern collection unit 101 acquires access pattern information indicating a plurality of WWW documents which the user terminal 200 accessed in a predetermined period from the information collection device 400, and calculates an access user vector uj (K×1 vectors) (1≦j≦N) shown in Expression (12), and a visit WWW document vector vk (N rows 1 column vector) (1≦k≦K) shown in Expression (13). The results are stored in the user access history holding unit 102.
Here the access user vector uj is defined as a vector which is ujk=1 if the WWW document Dj has been accessed by the user k, and is 0 otherwise.
[Expression 12]
uj=(uj1, uj2, . . . , ujK)T (12)
This indicates a reader list (user pattern) of the WWW document Dj. K denotes a number of users.
In the same way, the visit WWW document vector vk is defined as a vector which is vkj=1 if the user k has accessed the WWW document Dj, and is 0 otherwise.
[Expression 13]
vk=(vk1, vk2, . . . , vkN)T (13)
This indicates the WWW document list accessed by the user k. N denotes a number of WWW documents.
The user access history holding unit 102 is a portion to store the access destination information, visit WWW document vector vk, access user vector uj collected by the access pattern collection unit 101, and access pattern information indicating a WWW document (including significance) acquired based on the access destination information.
The information collection device 400, on the other hand, acquires the WWW document according to the access destination information which is output from the access pattern collection unit 101. Then morphological analysis is performed on the acquired WWW document Dj, words included in the WWW document Dj are extracted, and the keyword weight vector wj is generated based on the extracted words. The keyword weight vector wj can be generated according to Expression (1) to Expression (4), as mentioned above.
According to the present embodiment, a word included in the WWW document is not directly converted into a keyword weight vector, but is replaced with a broader term using a keyword from a thesaurus, embracing a range of synonyms. For example, “professional baseball” is converted into “baseball”.
The keyword to which a word is converted is a broader term, and [this processing] is performed for an entire WWW document accessed by the user, and the tf idf characteristic is calculated to determined the keyword weight vector wj. Then wj is normalized to be a vector of magnitude 1. Normalization of the keyword weight vector and user profile was not mentioned as a tf·idf characteristic, but according to the present embodiment, [vectors] are always handled as normalized vectors having a magnitude of 1.
The keyword vector holding unit 103 is a portion to store the keyword weight vector wj given by Expression (1) generated in the information collection device 400 for each user (for each user terminal 200).
The WWW document similarity computing unit 104 is a portion to compute the degree of matching of an access user vector uj of a user accessing one WWW document Dj and an access user vector uje of a user accessing another WWW document Dje, and computes similarity between the WWW documents by computing the degree of matching of the access user vectors u. The degree of matching of the access user vectors is computed by the following Expression (14) (cos measure).
[Expression 14]
sim(uj,uje) (14)
This Expression (14) is used as a scale to indicate a degree of matching of user patterns [of the users] who visited the WWW document Dj and WWW document Dje, and is used for judging similarity between WWW documents.
The user similarity computing unit 105 is a portion to compute a degree of matching of a visit document vector vk of a WWW document accessed by one user k, and a visit document vector vke of a WWW document accessed by another user ke, so that similarity between the users can be judged by computing a degree of matching of these visit document vectors. The degree of matching of visit document vectors v is computed by the following Expression (15).
[Expression 15]
sim(vk,vke) (15)
This Expression (15) is used as a scale to indicate similarity between the user k and user ke in the document pattern of the accessed WWW documents.
The keyword vector smoothing unit 106 is a portion to smooth a keyword weight vector in one WWW document being held in the keyword vector holding unit 103, and is a portion to correct the keyword weight vector wj using a WWW document of which access pattern is similar to that of this one WWW document. Thereby even if an accurate keyword weight vector cannot be calculated because the number of accesses to the one WWW document is insufficient, the keyword weight vector can be interpolated using another similar WWW document, and as a result, a more accurate keyword vector can be calculated.
In concrete terms, the keyword vector smoothing unit 106 smoothes and interpolates the keyword weight vector wj using the above Expression (14) and the following Expression (16), so as to generate the smoothed keyword weight vector wj′.
Here ε denotes an experimentally defined real number. ε is 1/N in the present embodiment.
The use profile smoothing unit 107 is a portion to generates a user profile pk using a keyword weight vector wj in one WWW document held in the keyword vector holding unit 103 and a visit document vector vk stored in the user access history holding unit 102, and perform smoothing and interpolation of the user profile pk for the generated user profile pk, and is also a portion to correct a user profile using an access pattern of another user which is close to the access pattern of the above mentioned one user. Thereby even if an accurate user profile cannot be calculated because of the number of samples is insufficient in the visit WWW document vector of one user, the user profile can be interpolated and corrected using a visit WWW document vector of another user having a similar [access pattern], and as a result, a more accurate user profile can be calculated.
More concretely, the user profile smoothing unit 107 generates a user profile according to Expression (17). The user profile pk is generated by multiplying a matrix W (see Expression (18)) acquired by arranging the keyword weight vector wj, which is a column vector, by the visit WWW document vector for initialization.
[Expression 17]
pk=Wvk (17)
W=[w1 w2 . . . wN] (18)
The user profile smoothing unit 107 performs smoothing and interpolation of the user profile pk, initialized and generated like this, according to Expression (19).
As a transformation of Expression (19), the following Expression (20) can also be used.
In this case, interpolation is performed not using the similarity of the accessed WWW documents, but the similarity of keywords in the accessed WWW documents.
The smoothed user profile holding unit 108 stores the smoothed user profile pj′, which was smoothed and interpolated by the user profile smoothing unit 107.
The smoothed keyword vector holding unit 109 stores the smoothed keyword weight vector wj′, which was smoothed and interpolated by the keyword vector smoothing unit 106.
The rearranging unit 110 is a portion to perform rearranging processing to the top X (for example X=20) WWW documents which were searched by the search server 300 using the search vector q based on the searching word which was input via the user terminal 200, and were output as WWW document search candidates. In concrete terms, [the WWW documents] may be displayed sequentially from the top rank based on the evaluation value calculated according to the above Expression (8), or significance si may be added, as shown in the following Expression (21). The rearranging unit 110 may store the searched WWW documents temporarily in a WWW document storage unit (not illustrated), and have the keyword vector holding unit 103 store the keyword weight vector.
[Expression 21]
B_score(q,w′j;p′k,sj)=λA_score(q,w′j; p′k)+(1−λ)sj, 0≦λ≦1 (21)
λ is 0.9 in the present embodiment.
In the present embodiment, it is assumed that all the WWW documents acquired from the search server 300 are stored in the user access history holding unit 102. It is, for certain, possible that the search server 300 could be from another provider, and a WWW document, which does not exist in the proxy device 100, is acquired as a search result, but exceptions can be handled by using wj, without smoothing, for calculation of this WWW document. In other words, the proxy device 100 has a judgment unit to judge whether the keyword weight vector of the collected WWW document is stored in the smoothed keyword vector holding unit 109, or whether the WWW document is already stored. And if a WWW document of which keyword weight vector is stored in the proxy device 100 is acquired as a search result, the proxy device 100 calculates an evaluation value using Expression (21), and if a WWW document of which keyword weight vector is not stored in the proxy device 100 is acquired as a search result, the proxy device 100 may calculate the evaluation value using Expression (11). Only one of the keyword weight vector and user profile may be smoothed.
Now the user terminal 200 will be described. As
The WWW browser 201 is an application to access WWW documents held on the Internet. The user of the user terminal 200 can access a desired WWW documents by operating the WWW browser 201. In the present embodiment, the WWW browser 201 can access the WWW documents for searching, output a search request to a search server via the proxy device 100, receive the search result via the proxy device 100, and display it for the user.
The access history holding unit 202 is a portion to store access destination information (URL) for which the WWW browser 201 performed access processing.
The access history transfer unit 203 is a portion to send the access destination information stored in the access history holding unit 202 to the proxy device 100 in a predetermined cycle or timing.
Operation of the proxy device 100 of the present embodiment will now be described.
Then the user similarity computing unit 105 performs the user similarity computation (S204). The WWW document similarity computation is also performed (S205). The computed user similarity and WWW document similarity are smoothed and interpolated by the keyword vector smoothing unit 106 and user profile smoothing unit 107, and the smoothed keyword weight vector and smoothed user profile are generated (S206, S207). The generated smoothed keyword weight vector and smoothed user profile are stored in the smoothed keyword vector holding unit 109 and smoothed user profile holding unit 108 (S208, S209).
When a search request comes from the user terminal 200, [the proxy device 100] requests a search to the search server 300 according to the request (S210), and when a search result is received from the search server 300, the rearranging unit 110 performs rearrangement processing on the search result using the smoothed keyword weight vector and smoothed user profile (S211).
The proxy device 100 of the present embodiment has an effect to improve the statistic reliability of the WWW document and user profile. In statistical language processing, a keyword which is supposed to appear may not be included in the WWW document if the number of observed data is not large enough. An object of the proxy device 100 of the present embodiment is to compensate keywords in a personal profile, which is difficult to directly observe, and keywords of a WWW document of which number of words is small.
In particular, a WWW document which functions as a parent directory and which does not have sufficient keywords, or which is mostly images and has no keywords, can be interpolated with keywords of a document that can be visited simultaneously.
The case when WWW document-A is a parent directory of WWW document-B and WWW document-C, which have mobile content structures shown in
Expression (16) and Expression (19) represent smoothing for WWW documents of which access patterns are similar, according to the distance of the user. Thereby the user profile of a user, who does not access very much, can be compensated for by peripheral users, and the WWW document vector of which document volume is small can be compensated for by user access.
Now a variant form of a proxy device 100 will be described.
The functional effect of the proxy device 100 of the present embodiment will now be described. The proxy device 100 has the user access history holding unit 102 to store the access history of one user acquired by the access pattern collection unit 101. The access pattern collection unit 101 also generates an access user vector uj which is one user pattern indicating a plurality of users who accessed one WWW document Dj, and an access user vector uje which is another user pattern indicating a plurality of users who accessed another document Dje.
The user similarity computing unit 105 computes a document similarity sim (uj, uje) which indicates a similarity of the WWW document Dj and WWW document Dje. The keyword vector smoothing unit 106 corrects the keyword weight vector wje in other documents using the computed document similarity sim (uj, uje), and corrects the keyword weight vector wj in one document based on the corrected keyword weight vector wje, so as to acquire the smoothed keyword weight vector w′j. The smoothed keyword vector holding unit 109 stores the smoothed keyword weight vector w′j acquired here. The rearranging unit 110 can calculate the evaluation value B_SCORE for the input information for searching based on the smoothed keyword weight vector w′j.
By this, the keyword weight vector can be interpolated based on a document of which user pattern of accessing users is similar, and the accuracy of a keyword weight vector of a document with small document volume, such as mobile content, can be increased, and as a result an accurate search can be implemented.
In the proxy device 100, the access pattern collection unit 101 generates a visit WWW document vector vk which is one document pattern indicating a plurality of documents accessed by one user, and a visit WWW document vector vke which is another document pattern indicating a plurality of documents accessed by another user, and has the user access history holding unit 102 to store these vectors. The WWW document similarity computing unit 104 computes a user similarity sim (vk, vke), which indicates a similarity between users. Using the computed user similarity sim (vk, vke), the user profile smoothing unit 107 corrects a user profile pke, which is a document pattern of another user, and acquires the smoothed user profile pk by correcting the user profile pk of one user, based on the corrected user profile pke. The rearranging unit 110 can calculate the evaluation value for the input information for searching, based on the smoothed user profile pk of one user. By this, the user profile of a user who does not access much can be compensated for by a peripheral user, and a search result having high conformity for the user can be provided.
In the present embodiment, the keyword weight vector wj and user profile pk are smoothed, but it is sufficient that at least the keyword weight vector wj is smoothed. In this case, the user profile pk before smoothing is input for the smoothed user profile pk to be input to the evaluation value B_SCORE.
In the proxy device 100 of the present embodiment, the access pattern collection unit 101 acquires the significance si to indicate significance attached to each WWW document from the information collection device 400, along with the WWW document, and the rearranging unit 110 calculates the evaluation value B_SCORE for the input information for searching, using this significance si. Since the significance can be reflected in the evaluation value, a more appropriate evaluation result can be provided.
In the proxy device 100, when the search result is output according to the search request from the user terminal 200, the rearranging unit 110 can output the search result in the sequence based on the evaluation value B_SCORE calculated as above, and can provide a search result that can be easily seen by the user, such as outputting the result in the sequence of a document having a higher evaluation value.
Also in the proxy device 100, if the smoothed keyword weight vector w′j of the one WWW document exists, the rearranging unit 110 calculates the evaluation value B_SCORE using this smoothed keyword weight vector w′j (Expression 24), and if the smoothed keyword weight vector w′j of the one WWW document does not exist, the evaluation value B_SCORE is calculated using the keyword weight vector w′j before smoothing (Expression 11). By this, the evaluation processing can be executed even if the WWW document has not been stored in advance.
Also in the proxy device 100, the access pattern collection unit 101 acquires documents from the search server according to the access from the user, and the access received here is stored in the user access history holding unit 102 as access history. By this, the function to have the user terminal 200 to hold the access history is unnecessary, and the configuration of [the proxy device 100] can be simplified.
Second EmbodimentA device to correct the evaluation value based on the significance according to the time-based change of the similarity of the access pattern of a user will be described.
The configuration of the second embodiment will now be described.
The access pattern collection unit 101a is an expanded version of the access pattern collection unit 101 of the first embodiment, and in accordance with the access pattern acquired from the user terminal 200, the access user vector uj used for Expression (12) is generated respectively for each time period in the past, such as “from t to t+δ”, and “from t+δ to t+2δ” as shown in Expression (24), and has the user access history holding unit 102 to store this information.
[Expression 22]
uj(t,t+δ)uj(t+δ,t+2δ) (22)
The significance correction unit 111 can calculate the correction value Δsj of the significance sj of the WWW document Dj, considering the similarity of access patterns (user patterns) and the number of accessed users between the time in the past, “t, t+δ” and “t+δ, t+2δ” using the access user vector uj (t, t+δ) and uj (t+δ, t+δ).
[Expression 23]
Δsj=log(|uj(t,t+δ)∥uj(t+δ,t+2δ)|)sim(uj(t,t+δ),uj(t+δ,t+2δ) (23)
The significance correction value holding unit 112 is a portion to store the correction value Δsj calculated by the significance correction unit 111.
By Δsj, significance of the WWW document, of which access pattern did not change in the time period in the past, is corrected as Expression (24).
The rearranging unit 110a is a portion to perform processing to rearrange the top 20 WWW documents which were searched by the search server 300 using the search vector q based on the searching word which was input in the user terminal 200, and were output as the WWW document search candidates, and controls so that [the WWW documents] are displayed in the sequence of higher evaluation value, which was calculated by Expression (24).
[Expression 24]
B_score(q,w′j;p′k,sj+Δsj)=λA_score(q,w′j;p′k)+(1−λ)(sj+Δsj), 0≦λ≦1 (24)
Operation of the proxy device 100a of the present embodiment will now be described.
Then the user similarity computation is performed by the user similarity computing unit 105 (S204). Also the WWW document similarity computation is performed (S205). Meanwhile, the correction value Δsj of the significance sj is generated by the significance correction unit 111 (S205a), and is stored in the significance correction value holding unit 112 (S205b).
The computed user similarity and WWW document similarity are smoothed and interpolated by the keyword vector smoothing unit 106 and user profile smoothing unit 107, and the smoothed keyword weight vector and smoothed user profile are generated (S206, S207). The generated smoothed keyword weight vector and smoothed user profile are stored in the smoothed keyword vector holding unit 109 and smoothed user profile holding unit 108 respectively (S208, S209).
When a search request is received from the user terminal 200, the search is requested to the search server 300 according to the request (S210), and when a search result is received from the search server 300, the rearranging unit 110 rearranges the search result according to the smoothed keyword weight vector, smoothed user profile and the significance sj corrected by the correction value Δsj (S211).
By calculating an evaluation value to which the correction value Δsj of significance is added so that WWW documents which are accessed repeatedly in the past to now are displayed by high ranking, a search with more conformity and output thereof can be implemented.
Now the functional effect of the proxy device 100a of the present embodiment will be described. In the proxy device 100a, the user access history holding unit 102 separately stores an access user vector uj (t, t+δ), which is a first user pattern indicating users who accessed one document in a first time period (e.g. from t to t+δ), and an access user vector uj (t+δ, t+2δ), which is a second user pattern indicating users who accessed the one document in a second time period (e.g. from t+δ, to t+2δ). Here the significance of the one WWW document can be corrected based on the similarity of the stored access user vector uj (t, t+δ), and access user vector uj (t+δ, t+2δ), and the number of accesses of the one WWW document. By this, significance of the one WWW document can be more appropriate. In other words, users who access a WWW document change as time passes, but a WWW document of which user pattern is similar and which was accessed repeatedly by the same users can be said to have high significance. Therefore the significance is corrected so that the evaluation value of this WWW document becomes high.
Third EmbodimentA proxy device 100b of the third embodiment will now be described.
It is assumed that the following expression is established. Expression (25) is an expression to estimate the user profile estimate value p−k of each user from the keyword weight vector estimation value w−j. Expression (25) is M rows 1 column vector=M rows N columns matrix×N row 1 column vector, and in Expression (26) is an M rows N column matrix vector. w− is the same as “̂” (hat) above w, and in this description, “w−” is used for convenience. “−” attached to other characters is also the same as “̂” (hat) over a character.
[Expression 25]
{circumflex over (p)}k=Ŵvk (25)
Ŵ=[ŵ1 ŵ2 . . . ŵN] (26)
As Expression (27) shows, it is assumed that the keyword weight vector estimate value w−j is a weighted mean of the user profile estimate value p−k and keyword weight vector wj. Expression (27) is M rows 1 column vector=M rows K column matrix×K rows 1 column vector, and Expression (28) is an M rows K column matrix vector.
[Expression 27]
ŵj=(1−α){circumflex over (P)}uj+αwj, 0<α≦1 (27)
{circumflex over (P)}=[{circumflex over (p)}1 {circumflex over (p)}2 . . . {circumflex over (p)}K] (28)
Expression (27) indicates a projection from the user profile p−k to the keyword weight vector w−j of the WWW document, whereby the smoothing effect is implemented. The gain of this projection is 1−α, so by repeating the processing of Expression (27), the user profile p−k and the keyword weight vector w−j converge. To judge convergence, the inner product sim (wj−n, wj−n−1) of the new calculation result wj−n and the previous calculation result wj−n−1 becomes 0.9 or more, for example. [The processing of Expression (27)] may be repeated until both of the user profile p−k and keyword weight vector w−j converge, or until only one converges.
The proxy device 100b will be described referring back to
In more concrete terms, the WWW document/user profile matching unit 113 generates a user profile estimate value p−k as an initial value in Expression (25). W− at this time is a keyword weight vector wj of the initial value. And [the WWW document/user profile matching unit 113] generates the keyword weight vector estimate value w−j using the initial value p−k in Expression (27). Using this keyword weight vector estimate value w−j again in Expression (25), the user profile estimate value p−k is generated. Here the WWW document/user profile matching unit 113 normalizes each element, and judges whether the similarity of the keyword weight vector estimate value wj−n and the previous keyword weight vector estimate value wj−n−1 is a predetermined value or more. Similarity here is calculated by the inner product sim (wj−n, wj−n−1) (see Expression (7).
The keyword weight vector estimate value w % and user profile estimate value p−k converged here are stored in the user profile holding unit 108a and keyword vector holding unit 109a as keyword weight vector wj and user profile pk respectively.
The rearranging unit 110 calculates the evaluation values using the user profile pk and keyword weight vector wj stored in the user profile holding unit 108a and keyword vector holding unit 109a, by one of Expression (8), Expression (11), Expression (21) and Expression (24).
Now the processing of the proxy device 100b constructed like this will be described.
Then when a search request is received from the user terminal 200, the search is requested to the search server 300 according to the request (S210), and when the search result is received from the search server 300, the rearranging unit 110 rearranges the search result by the matched keyword weight vector and user profile (S211).
The similarity is a predetermined value or more, 0.9 or more, for example, then it is judged that the user profile converged, and the keyword weight vector wj−n+1 and user profile estimate value pk−n+1 are stored in the keyword vector holding unit 109a and user profile holding unit 108a as the keyword weight vector wj and user profile pk respectively. This data is used for calculating evaluation values for rearranging processing (see Expression (8), Expression (11), Expression (21) or Expression (24)).
Now the functional effect of the proxy device 100b of the present embodiment will be described. First the WWW document/user profile matching unit 113 generates a user profile pk−n based on the keyword weight vector wn=0 to be a reference value, and generates a new keyword weight vector wj−n+1 based on the generated user profile pk−n and a keyword weight vector wj to be a reference value. Then [the WWW document/user profile matching unit 113] generates a new user profile pk−n+1 based on the new keyword weight vector wj−n+1. And a similarity of the new user profile pk−n+1 and a user profile pk−n generated immediately before this new user profile is computed, and it is judged whether the similarity is a predetermined value or more. Here the user profile pk−n+1 and keyword weight vector wj−n+1 are repeatedly generated until the similarity becomes a predetermined value or more, and the evaluation values are calculated based on the keyword weight vector wj−n+1 and user profile pk−n+1 when the computed similarity becomes a predetermined value or more.
By generating the keyword weight vector and user profile to be interdependent, the user profile propagates to the keyword weight vector, thereby smoothing and interpolation of the user profile and keyword weight vector can be performed. Therefore the keyword weight vector of a document having a low document volume, such as mobile content, can be more accurate. Also the user profile of a user who does not access much can be compensated for by a peripheral user, and a search result with high conformity to the user can be provided.
Variant Forms of First to Third EmbodimentsNow variant forms of the first embodiment to the third embodiment will be described. In each of these embodiments, the user terminal 200 has the access history holding unit 202, but the proxy device 100, 100a or 100b may have [the access history holding unit 202]. In this case, the access history need not be transferred from the user terminal 200, so the access history transfer unit 203 is unnecessary.
The first embodiment to third embodiment were described in the form of the device and method, but may be implemented in the form of a program. In other words, [the present invention] can be embodied as a document processing program by constructing each configuration using program modules. In concrete terms, a configuration the same as those in each block diagram of the first to third embodiments, is modulized, and this program is stored in a storage media (e.g. CDROM), and is read by a personal computer.
Fourth EmbodimentNow a method for calculating significance of a WWW document using HITS will be described. As mentioned in the section on related art, authority is a page having high significance, among the pages related to a keyword. It is desirable that a page to be an authority is displayed with a high ranking in the search result. A hub, on the other hand, is hidden data for discovering authority. The HITS calculation step will now be described in concrete terms.
WWW documents to be the search target are extracted by keyword matching or the like. The top 200 documents, for example, are extracted in general, and are called “WWW document set R”. A WWW document to be an authority is ideally included in this group, but may not be, so WWW documents which are linked from the WWW documents belonging to the WWW document set R and WWW documents which link to the WWW documents belonging to the WWW document set R are extracted, and these documents become search target S.
As the following Expression (29) shows, an authority score ai and hub score hi are assigned to a WWW document belonging to search target S.
[Expression 29]
a=(a0, a1, . . . , ap, . . . aN−1)T
h=(h0, h1, . . . , hp, . . . hN−1)T (29)
The total number of WWW documents included in the search target S is N. The suffix T denotes the transposition of the matrix vector.
1. Initialization
The authority score a and hub score h are initialized as shown in the following Expression (30).
[Expression 30]
a<t=0>=(1, 1, . . . , 1)TεRN,h<t=0>=(1, 1, . . . , 1)TεRN (30)
These are non-negative integers which indicate a number of repeats in operation after <t=0>.
2. Updating Authority Score and Hub Score
[Expression 31]
a<t>
h<t> (31)
Expression (31) is updated by the calculation according to the link structure shown in the following Expression (32).
For each page p, the total of authority scores of pages to which the page p is linked is calculated, and the hub score hp of the page p is replaced with this total. And for each page p, the total of hub scores of pages which link to this page p is calculated, and the authority ap of the page p is replaced with this total.
[Expression 32]
For all pεS, hp=:Σq:link of p→qaq
For all pεS, ap=Σq:link of q→phq (32)
3. Normalization
Normalization is performed so that the norm of column vector a of the authority score and column vector h of the hub score become 1 (see Expression (33)).
[Expression 33]
a<t>←a<t>/∥a<t>∥
h<t>←h<t>/∥h<t>∥ (33)
The above update processing and normalization processing are repeated until the authority score and hub score converge. Normally convergence takes about several tens of times, and here [computing is repeated until] t=100 (see Expression (34)).
[Expression 34]
a<t=100>, h<t=100> (34)
Convergence of this computation is guaranteed as the existence of a solution of the Eigen value problem of a matrix.
First the link structure is represented by the N×N square connection matrix shown in the following Expression (35).
The above repeated calculation is as shown in the following Expression (36).
[Expression 36]
h<t+1>←CTa<t>
a<t+1>←Ch<t> (36)
Because of the above expressions and normalization processing, the authority score and hub score can be determined as shown in the following Expression (37).
[Expression 37]
h<t=∞>←Eigen vector corresponding to the maximum Eigen value of CTC
a<t=∞>←Eigen vector corresponding to the maximum Eigen value of CCT (37)
The authority score does not depend on the initialization, but can be uniquely determined by the link structure. Therefore a document with high significance, that is, with a high authority score in this case, can be extracted from WWW documents having high conformity.
The present embodiment uses the HITS calculation method. In concrete terms, in the HITS of prior art, the target is the link structure of the WWW documents. The present embodiment is characterized in that the conformity is calculated using the link structure involved in the access state of the user. Details will be described below.
The index holding unit 701 is a portion to store the index file (such information as keyword in a document) in a WWW document. It is preferable that the index file includes not only such information as a keyword and URL, but link information (URL) to other WWW documents included in this WWW document, as information to make searching easier. If the link information is written in the index file, each WWW document can be managed by a directed graph to indicate the relationship of links among each WWW document.
The primary searching unit 702 is a portion to search a WWW document including a searching word, which was input from WWW document browser 800 of the user terminal, from the index file stored in the index holding unit 701.
The primary index set holding unit 703 is a portion to store the WWW documents searched by the primary searching unit 702, as an initial WWW document set.
The access history holding unit 704 is a portion to store the access history of a WWW document in a WWW document browser 900 (including the WWW document browser 800), and to store the ID information for specifying a user and a URL to indicate that the accessed content is corresponded and stored.
The secondary searching unit 705 is a portion to specify, from the access history holding unit 704, users who accessed each WWW document of the WWW document set stored in the primary index set holding unit 703, and to search which WWW document the specified user accessed, thereby the WWW document set is extracted.
The secondary index set holding unit 706 is a portion to store the WWW document set extracted by the secondary searching unit 705.
The authority score calculation unit 707 is a portion to calculate the authority score. In concrete terms, [the authority score calculation unit 707] is implemented by the following processing.
The authority score calculation unit 707 extracts a user set U who accessed the WWW document set R conforming to the searching word, which was input via the WWW document browser 800, from the access history holding unit 704, and determines the WWW document set V which the users (WWW document browsers), specified by this user set U, accessed.
The reference information from the user set U to the WWW document set V is represented by the following Expression (38) in list format.
[Expression 38]
E={(p·q)|reference from pεU to qεV} (38)
The authority score a and hub score h are represented by the vectors given by the following Expression (39), where M denotes a number of users of the user set U and N denotes a number of documents of the WWW document set V.
[Expression 39]
a=(a0, a1, . . . ap, . . . aN−1)T
h=(h0, h1, . . . hp, . . . hM−1)T (39)
As this vector representations show, the authority score is defined on the WWW document set V, and the hub score is defined on the user set U. Based on this, the processing shown in
S401: Initialization (See Expression (40))
[Expression 40]
a<t=0>=(1, 1, . . . , 1)TεRN,
h<t=0>=(1, 1, . . . , 1)TεRM (40)
S402: Update
The following Expression (41) is calculated with reference to the reference information E.
[Expression 41]
For all pεU, hp=Σq:link of p→qaq
For all pεV, ap=Σq:link of q→phq (41)
S403: Normalization (See Expression (42))
[Expression 42]
a<t>←a<t>/∥a<t>∥
h<t>←h<t>/∥h<t>∥ (42)
S404: Convergence Judgment
The processings of S402 and S403 are executed until the authority score a and hub score h converge. The processing count is also judged in parallel, so that the processings in S402 and S403 do not exceed 100.
S405: t=t+1
If convergence is not reached in S404, 1 is added to t, and the processings in S402 and S403 are executed. As mentioned above, this processing is repeated until t=100. In this way, the authority score a and hub score h are calculated.
The WWW document collection unit 708 is a portion to collect WWW documents according to the index being held in the primary index set holding unit 703.
The rearranging unit 709 is a portion to rearrange the WWW documents collected by the WWW document collection unit 708 according to the WWW documents (index information) extracted by the secondary index set holding unit 706, and the authority score. By this rearrangement, the WWW documents are displayed in the sequence of the authority score in the WWW document browser 800, whereby WWW documents with more significance can be more easily accessed.
The functional effect of the present embodiment will now be described. In the document processing device 700 of the present embodiment, the primary searching unit 702 searches according to the searching word which was input via the WWW document browser 800, the secondary searching unit 705 extracts a user set U of users who accessed the searched WWW document R, and extracts the WWW document set V of the WWW document accessed by the user, and stores [the user set U and WWW document set V] in the secondary index set holding unit 706. The authority score calculation unit 707 can calculate the significance (authority score a) of each WWW document based on a degree of accesses by the user to the extracted WWW document set V (hub score h). By this, significance of a WWW document of which access quantity and link quantity are low, such as mobile content, can be accurately calculated, and an accurate search can be implemented.
Variant Form of Fourth EmbodimentIn the fourth embodiment, the WWW document set V, which is a set of WWW documents referred to by users, is determined based on a user set U, which is a set of users who visited WWW document set R conforming to the searching word, but this WWW document set V may become too large, or a WWW document of which conformity is low but number of access users is high (e.g. a specific popular portal site) may be extracted as an authority. Therefore a possible variant form is to perform the authority calculation based on an expanded WWW document set S, where a WWW document set which is referred to by the WWW document set R and a WWW document set which refers to the WWW document set R are added, just like the prior art.
In other words, a user set U who visited a WWW document set R conforming to the searching word is determined as shown in
E={(p·q)|reference from pεU to qεS} (43)
The authority score and hub score are represented by the following Expression (44), where M denotes the number of users [of the user set] U, and N denotes a number of documents of the WWW document set S.
[Expression 44]
a=(a0, a1, . . . ap, . . . aN−1)T
h=(h0, h1, . . . hp, . . . hM−1)T (44)
As these vector representations show, the authority score is defined on the set V, and the hub score is defined on the set U.
The calculation is performed according to the following steps. Since this is the same as the above mentioned HITS calculation method, details thereof are omitted.
Step 1: Initialization (See Expression (45)) [Expression 45]
a<t=0>=(1, 1, . . . , 1)TεRN,
h<t=0>=(1, 1, . . . , 1)TεRM (45)
The following Expression (46) is calculated with reference to the reference information E.
[Expression 46]
For all pεU, hp=Σq:link of p→qaq
For all pεS, ap=Σq:link of q→phq (46)
a<t>←a<t>/∥a<t>∥
h<t>←h<t>/∥h<t>∥ (47)
Every time step 2 and step 3 are repeated, t is incremented by 1, and processing ends when t=100.
In order to execute the above processing according to the present variant form, in the document processing device 700 of the fourth embodiment, the secondary searching unit 705 extracts the WWW document set S, including other WWW documents which refer to each WWW document of the WWW document set R and other WWW document which each WWW document of the WWW document set R refers to, using the index file stored in the index holding unit 701. The secondary searching unit 705 extracts each WWW document of the WWW document set S which each user of the user set U refers to, and extracts the reference information E, then stores this information to the secondary index set holding unit 706.
The authority score calculation unit 707 calculates the authority score a using the reference information E, by the HITS method.
The functional effect of the document processing device 700 according to the present variant form will now be described. In the document processing device 700 of the present variant form, the primary searching unit 702 searches the WWW document set R according to the searching word which was input via the WWW document browser 800, and the secondary searching unit 705 extracts a user set U of users who accessed the searched WWW document set R according to the history information stored in the access history holding unit 704. The secondary searching unit 705 also extracts other WWW documents which each extracted WWW document refers to and other WWW documents which refer to each WWW document, as the WWW document set S, based on the data (index file) which can manage the reference relationships among WWW documents stored in the index storing unit 701 as a directed graph. The second index set holding unit 706 stores reference information E which indicates that each user of the user set U referred to the document set S. The authority score calculation unit 707 can calculate the significance (authority score a) of each WWW document based on a degree of accesses by each user of the user set U to the WWW document set S (hub score h). By this, significance of a WWW document can be calculated accurately, and an accurate search can be implemented.
Fifth EmbodimentA fifth embodiment will now be described. According to the fifth embodiment, unlike the fourth embodiment, the WWW documents and users are not distinguished in the link structure, and are handled as the same nodes. The link structure is not 0, 1, but is handled as a continuous value [0.0, 1.0], for example. The data definition in the present embodiment will be described with reference to
A user set U who accessed a WWW document set R conforming to a searching word is determined. On the other hand, a WWW document set S, which the WWW document set R referred to and which referred to the WWW document set R, is determined. Then a node set W is generated by combining the WWW document set S and the user set U. The number of nodes belonging to the node set W is a value of the number of WWW documents N, and the number of users M which are added together, which is denoted by L=N+M for simplification.
A connection matrix is defined as shown by the following Expression (48).
Here it is assumed that in general 0<t≦s≦1.0. Here t denotes a weighting factor to a reference when the user refers to a WWW document, and s denotes a weighting factor to a reference when a WWW document in the WWW document set S is referred to. The weight [0, 1.0] is introduced assuming that a reference between documents and a reference between users cannot be handled exactly the same way. For example, s=1.0 is set, and t is determined based on experiment. t=0.001, for example, can be used.
The authority score and hub score are represented by the vectors shown in the following Expression (49).
[Expression 49]
a=(a0, a1, . . . ap, . . . aL−1)T
h=(h0, h1, . . . hp, . . . hL−1)T (49)
As these vector representations show, the authority score is defined on the set W, and the hub score is also defined on the set W.
The calculation is performed according to the following steps. This calculation processing is the same as the HITS method, as mentioned above.
Step 1: Initialization (See Expression (50)) [Expression 50]
a<t=0>=(1, 1, . . . , 1)TεRL,
h<t=0>=(1, 1, . . . , 1)TεRL (50)
The following Expression (51) is calculated with reference to the reference information E.
a<t>←a<t>/∥a<t>∥
h<t>←h<t>/∥h<t>∥ (52)
Step 2 and Step 3 are repeated until the result converges. If there is no convergence, t is incremented by 1 each time, and processing ends when t=100.
Functions of a document processing device 700a to implement this concrete processing will now be described.
The primary searching unit 702a generates a WWW document set S based on the searching word. The primary index set holding unit 703a stores the generated WWW document set S. Then the secondary searching unit 705a acquires the WWW document set S, WWW document set V and reference information E, and generates a node set W and connection matrix C. An authority score calculation unit 707a can calculate the authority score by executing the processings in Step 1 to Step 4, as mentioned above.
The functional effect of the document processing device 700a according to the present embodiment will now be described. In the document processing device 700a, the index holding unit 701 stores the data with which reference relationships among the WWW documents can be managed as a directed graph, and the primary searching unit 702a searches according to the searching word which was input from the WWW document browser 800. The secondary searching unit 705a extracts the user set U of users who accessed the WWW document set R including the searched WWW documents. The second searching unit 705a also extracts other WWW documents which each WWW document refers to, and other WWW documents which refer to each WWW document as the WWW document set S based on the data with which the reference relationships can be managed as a directed graph. The secondary searching unit 705a adds the user set U which indicates the extracted users and the WWW document set S of the extracted WWW documents, and generates one node set W. Then weight is assigned to a degree of reference among each WWW document of the generated node set W, and a degree of accesses by each user to each WWW document respectively (connection matrix C), and significance (authority score a) of each WWW document is calculated using this connection matrix C. By this, significance of a WWW document can be calculated accurately, and an accurate search can be implemented.
Variant Form of Fifth EmbodimentA possible variant form of the fifth embodiment is determining the authority score directly from the connection matrix C, as shown in
Taher Haveliwala: “Efficient Computation of Page Rank,” Stanford University Technical Report, September 1999, [online], [searched on Dec. 8, 2008], Internet: <http://infolab.stanford.edu/%7Etaherh/papers/efficient-pr.pdf>
Sixth EmbodimentA sixth embodiment will now be described. In the fourth and fifth embodiments, the document processing device, assuming use in the search service, was described, but in the sixth embodiment, a device for calculating significance only based on the access pattern of users, for a more general arbitrary searching word, is described.
According to the present embodiment, a hub vector for one WWW document is calculated, and significance of this WWW document to the searching word can be evaluated by fixing this hub vector, and checking what kind of individuals visited this WWW document. Details will be described below.
As
E={(p·q)|reference from pεU to qεS} (53)
Here, the authority and hub score are represented as the vectors of the following Expression (54), where M denotes a number of users in the user set U and N denotes a number of documents in the WWW document set S
[Expression 54]
a=(a0, a1, . . . ap, . . . aN−1)T
h=(h0, h1, . . . hp, . . . hM−1)T (54)
As these vector representations show, the authority score is defined on the WWW document set V, and the hub score is defined on the user set U. The calculation is performed according to the following steps.
Step 1: Initialization (See Expression (55))
[Expression 55]
a<t=0>=(1, 1, . . . , 1)TεRN,
h<t=0>=(1, 1, . . . , 1)TεRM (55)
Step 2: Update
The following Expression (56) is calculated with reference to the reference information E.
[Expression 56]
For all pεU, hp=Σq:link of p→qaq
For all pεS, ap=Σq:link of q→phq (56)
Step 3: Normalization (See Expression (57))
[Expression 57]
a<t>←a<t>/∥a<t>∥
h<t>←h<t>/∥h<t>∥ (57)
Every time Step 2 and Step 3 are repeated, t is incremented by 1, and processing ends when t=100.
The hub score calculation unit 707b of the fifth embodiment is roughly the same as the third and fourth authority score calculation unit 707, but a difference is that a hub vector is output. Using the hub vector calculated by the hub score calculation unit 707b, the significance calculation unit 709a performs the following calculation for the WWW document which was arbitrarily searched and acquired by the primary searching unit 702.
It is assumed that the number of visits of a user who visited this WWW document is recorded in this arbitrarily searched WWW document. In the present embodiment, this visit count is called “visit vector u”, and is represented by a column vector of the following Expression (58). M denotes a number of users of the user set U.
[Expression 58]
u=(u0, u1, . . . uM−1)T (58)
The hub vector calculated by the hub score calculation unit 707b represents a specific searching word, that is, generally a user to be for a hub on a certain topic. Therefore the significance is calculated by the following Expression (59) in the same way as Expression (7).
[Expression 59]
Significance=sim(u,h) (59)
If the cosine distance 1−sim (u, h) between the hub vector and visit vector is small, it can be judged that the WWW document is a WWW document which is the appropriate result for a predetermined searching word, that is, close to the searching word, and significance is high.
As a variant form, other similarities may be used, instead of the cosine distance between the visit vector u and hub vector h. For example, a similarity can be calculated by using the inner product of the visit vector u and hub vector h. To express a non-similarity by distance, an absolute distance, Euclidean distance, Mahalanobis' (generalized) distance, and Minkowsky distance, for example, can be used, instead of the cosine distance.
The functional effect of the document processing device 700b according to the present embodiment will now be described. In the document processing device 700b, the primary searching unit 702a searches according to the searching word which was input via the WWW document browser 800, and extracts a WWW document set R. The secondary searching unit 705a extracts a user set U of users who accessed the searched WWW document set R, and extracts other WWW documents which each extracted WWW document refers to, and other WWW documents which refer to each WWW document, as the WWW document set S, based on the data (index file) with which the reference relationships among WWW documents stored in the index holding unit 701 can be managed as a directed graph. The hub score calculation unit 707b calculates a hub score h, which indicates a degree of accesses to each extracted WWW document S by each user of the extracted user set U, and the significance calculation unit 109 calculates the significance based on a degree of matching of the visit vector u of a user who visited this WWW document included in an arbitrary WWW document and hub score h. By this, significance of a WWW document can be accurately calculated, and an accurate search can be implemented.
The method shown above is when a WWW document acquired by a searching word is a document related to a certain topic, a hub of a user is determined based on the user visit behavior to this document, and fixing this hub a significance on an arbitrary topic of the arbitrary WWW document is shown. This is a method which allows classifying WWW documents by category, using the user visit behavior as observation data, and the WWW documents which are the initial primary searching result as master data.
Claims
1. A document processing method, comprising:
- a collection step of collecting access history of a user;
- a document similarity computing step of computing a document similarity, which indicates similarity between documents, by one user pattern which indicates a plurality of users who have accessed one document and another user pattern which indicates a plurality of users who have accessed another document, according to the access history collected in the collection step;
- a keyword weight vector correction step of correcting a keyword weight vector of the one document using the document similarity computed in the document similarity computing step; and
- an evaluation value calculation step of calculating an evaluation value for input information for searching, based on the keyword weight vector corrected in the keyword weight vector correction step.
2. The document processing method according to claim 1, wherein the keyword weight vector correction step further comprises a step of correcting a keyword weight vector in the other document using the document similarity, and correcting a keyword weight vector in the one document using the corrected keyword weight vector.
3. The document processing method according to claim 1, further comprising:
- a user similarity computing step of computing user similarity, which indicates similarity between users, by one document pattern which indicates a plurality of documents accessed by one user and another document pattern which indicates a plurality of documents accessed by another user, according to the access history collected in the collection step; and
- a user profile correction step of correcting a user profile which indicates characteristics of the one user using the user similarity computed in the user similarity computing step, wherein
- the evaluation value calculation step further comprises a step of calculating the evaluation value for the input information for searching based on the user profile of the one user corrected in the user profile correction step.
4. The document processing method according to claim 3, wherein the user profile correction step further comprises a step of correcting a user profile of another user using the user similarity and correcting the user profile of the one user based on the corrected user profile.
5. The document processing method according to claim 1, further comprising an acquisition step of acquiring significance information which indicates a significance attached to each document, wherein
- the evaluation value calculation step further comprises a step of calculating an evaluation value for the input information for searching, using the significance information acquired in the acquisition step.
6. The document processing method according to claim 1, wherein
- the evaluation value calculation step further comprises a step of calculating an evaluation value using the corrected keyword weight vector when the corrected keyword weight vector in the one document exists, and calculating an evaluation value using the keyword weight vector before correction when the corrected keyword weight vector in the one document does not exist.
7. The document processing method according to claim 1 further comprising an acquisition step of acquiring a document from a search server according to an access by a user, wherein accesses accepted in the acquisition step are collected in the collection step as the access history.
8. A document processing method, comprising:
- a collection step of collecting access history of a user;
- a document similarity computing step of computing a document similarity, which indicates similarity between documents, by one user pattern which indicates a plurality of users who have accessed one document and another user pattern which indicates a plurality of users who have accessed another document, according to the access history collected in the collection step;
- a keyword weight vector correction step of correcting a keyword weight vector of the one document using the document similarity computed in the document similarity computing step;
- an acquisition step of acquiring significance information which indicates a significance attached to each document;
- a significance correction step of distinguishing a first user pattern which indicates users who have accessed one document during a first time period, and a second user pattern which indicates users who have accessed one document during a second time period, according to the accesses history of users collected in the collection step, and correcting the significance of the one document based on the similarity of the first user pattern and the second user pattern and a number of access to the one document; and
- an evaluation value calculation step of calculating an evaluation value for input information for searching, based on the keyword weight vector corrected in the keyword weight vector correction step, and the significance information corrected in the significance correction step.
9. The document processing method according to claim 1, further comprising an output step of outputting the search result searched by the user according to the evaluation value calculated in the evaluation value calculation step.
10. A document processing method, comprising:
- a first generation step of generating a user profile based on a keyword weight vector that is to be a reference value;
- a second generation step of generating a new keyword weight vector based on the user profile generated in the first generation step and the keyword weight vector that is to be a reference value;
- a third generation step of generating the new use profile based on the new keyword weight vector generated in the second generation step;
- a user profile similarity generation step of computing similarity between the new user profile generated in the third generation step and the user profile generated immediately before the new user profile; and
- an evaluation value calculation step of calculating an evaluation value based on the similarity computed in the user profile similarity generation step, the keyword weight vector and user profile.
11. The document processing method according to claim 10, further comprising a judgment step of judging whether the similarity generated in the user profile similarity generation step is a predetermined value or more, wherein the evaluation value calculation step further comprises a step of calculating the evaluation value based on the keyword weight vector and user profile when the similarity computed in the user profile similarity generation step becomes a predetermined value or more.
12. A document processing device, comprising:
- access history collection means for collecting access history of a user;
- document similarity computing means for computing a document similarity, which indicates similarity between documents, by a user pattern which indicates a plurality of users who have accessed one document and a user pattern which indicates a plurality of users who have accessed another document, according to the access history collected by the collection means;
- keyword weight vector correction means for correcting a keyword weight vector of the one document, using the document similarity computed by the document similarity computing means; and
- evaluation value calculation means for calculating an evaluation value for input information for searching, based on the keyword weight vector corrected by the keyword weight vector correction means.
13. A search system, comprising:
- a user terminal for storing access history;
- an information collection device for generating a keyword weight vector of a document accessed by the user terminal; and
- the document processing device according to claim 12, for acquiring the access history of the user terminal and the keyword weight vector generated by the information collection device.
14. A document processing program, comprising:
- a collection module for collecting access history of a user;
- a document similarity computing module for computing a document similarity which indicates similarity between documents, by a user pattern which indicates a plurality of users have who accessed one document and a user pattern which indicates a plurality of users who have accessed another document, according to the access history collected by the collection module;
- a keyword weight vector correction module for correcting a keyword weight vector of the one document, using the document similarity computed by the document similarity computing module; and
- an evaluation value calculation module for calculating an evaluation value for input information for searching, based on the keyword weight vector corrected by the keyword weight vector correction module.
15. A document processing device, comprising:
- primary WWW document extraction means for extracting WWW documents according to a searching word;
- user extraction means for extracting a user set of users who have accessed the WWW documents extracted by the primary WWW document extraction means;
- secondary WWW document extraction means for extracting a WWW document set of WWW documents accessed by the users extracted by the user extraction means; and
- significance calculation means for calculating significance of the WWW documents extracted by the primary WWW document extraction means based on a degree of accesses by users to the WWW document set extracted by the secondary WWW document extraction means.
16. The document processing device according to claim 15, wherein the significance calculation means calculates the significance of a WWW document based on a degree of accesses by each user of the user set extracted by the user extraction means.
17. A document processing device, comprising:
- primary WWW document extraction means for extracting WWW documents according to a searching method;
- user extraction means for extracting a user set of users who accessed the WWW documents extracted by the primary WWW document extraction means;
- data structure holding means for holding data for which reference relationships among the WWW documents can be managed as a directed graph;
- secondary WWW document extraction means for extracting other WWW documents which each WWW document extracted by the primary WWW document extraction means refers to, and other WWW documents which refer to each WWW document, based on the data stored in the data structure holding means; and
- significance calculation means for calculating significance of the WWW documents extracted by the primary WWW document extraction means based on a degree of accesses by the users extracted by the user extraction means to the WWW document set extracted by the secondary WWW document extraction means.
18. A document processing device, comprising:
- access history holding means for holding an access history to a WWW document by a plurality of users;
- data structure holding means for holding data for which reference relationships among WWW documents can be managed as a directed graph;
- primary WWW document extraction means for extracting WWW documents according to a searching word;
- user extraction means for extracting a user set of users who have accessed the WWW documents extracted by the primary WWW document extraction means from the access history holding means;
- secondary WWW document extraction means for extracting other WWW documents which each WWW document extracted by the primary WWW document extraction means refers to, and other WWW documents which refer to each of the WWW documents, based on the data stored in the data structure holding means, and extracting one node set by adding the user set extracted by the user extraction means and the WWW document set of the extracted WWW documents; and
- significance calculation means for calculating significance of the WWW documents by weighting a degree of being referred to among the WWW documents in the node set extracted by the secondary WWW document extraction means and a degree of accesses by each of the users to each of the WWW documents respectively.
19. A document processing device, comprising:
- data structure holding means for holding data for which reference relationships among WWW documents can be managed as a directed graph;
- primary WWW document extraction means for extracting WWW documents according to a searching word;
- user extraction means for extracting a user set of users who have accessed the WWW documents extracted by the extraction means from the access history holding means;
- secondary WWW document extraction means for extracting other WWW documents which each WWW document extracted by the primary WWW document extraction means refers to, and other WWW documents which refer to each of the WWW documents, based on the data stored in the data structure holding means;
- hub score calculation means for calculating a hub score indicating a degree of accesses by each user of the user set extracted by the user extraction means to each WWW document extracted by the secondary WWW document extraction means; and
- significance calculation means for calculating significance based on a degree of matching of a visit vector of users who have visited a WWW document, included in any of the WWW documents and the hub score calculated by the hub score calculation means.
20. A document processing method, comprising:
- a primary WWW document extraction step of extracting WWW documents according to a searching word;
- a user extraction step of extracting a user set of users who have accessed the WWW documents extracted in the primary WWW document extraction step;
- secondary WWW document extraction step of extracting a WWW document set of WWW documents accessed by the users extracted in the user extraction step; and
- significance calculation step of calculating significance of the WWW documents extracted in the primary WWW document extraction step based on a degree of accesses by the users to the WWW document set extracted in the secondary WWW document extraction step.
21. A document processing method for a document processing device having data structure holding means for holding data for which reference relationships among WWW documents can be managed as a directed graph, the method comprising:
- a primary WWW document extraction step of extracting WWW documents according to a searching word;
- a user extraction step of extracting a user set of users who have accessed the WWW documents extracted in the primary WWW document extraction step;
- a secondary WWW document extraction step of extracting other WWW documents which each WWW document extracted in the primary WWW document extraction step refers to, and other WWW documents which refer to each WWW document, based on the data stored in the data structure holding means; and
- a significance calculation step of calculating significance of the WWW documents extracted in the primary WWW document extraction step based on a degree of accesses by the users extracted in the user extraction step to the WWW document set extracted in the secondary WWW document extraction step.
22. A document processing method for a document processing device having access history holding means for holding history of access to a WWW document by a plurality of users, and data structure holding means for holding data for which reference relationships among WWW documents can be managed as a directed graph, the method comprising:
- a primary WWW document extraction step of extracting WWW documents according to a searching word;
- a user extraction step of extracting a user set of users who have accessed the WWW documents extracted in the primary WWW document extraction step from the access history holding means;
- a secondary WWW document extraction step of extracting other WWW documents which each WWW document extracted in the primary WWW document extraction step refers to, and other WWW documents which refer to each of the WWW documents, based on the data stored in the data structure holding means, and extracting one node set by adding the user set extracted in the user extraction step and the WWW document set of the extracted WWW documents; and
- significance calculation step of calculating significance of the WWW documents by weighting a degree of being referred to among the WWW documents in the node set extracted in the secondary WWW document extraction step and a degree of accesses by each of the users to each of the WWW documents respectively.
23. A document processing method for a document processing device having access history holding means for holding history of access to a WWW document by a plurality of users, and data structure holding means for holding data for which reference relationships among WWW documents can be managed as a directed graph,
- the method comprising:
- a primary WWW document extraction step of extracting WWW documents according to a searching word;
- a user extraction step of extracting a user set of users who have accessed the WWW documents extracted in the primary WWW document extraction step from the access history holding means;
- a secondary WWW document extraction step of extracting other WWW documents which each WWW document extracted in the primary WWW document extraction step refers to, and other WWW documents which refer to each of the WWW documents, based on the data stored in the data structure holding means;
- a hub score calculation step of calculating a hub score which indicates a degree of accesses by each user of the user set extracted in the user extraction step to each WWW document extracted in the secondary WWW document extraction step; and
- a significance calculation step of calculating significance based on a degree of matching of a visit vector of users who have visited a WWW document included in any of the WWW documents and the hub score calculated in the hub score calculation step.
Type: Application
Filed: Apr 21, 2009
Publication Date: Oct 22, 2009
Patent Grant number: 8176033
Applicant: NTT DoCoMo, Inc. (Chiyoda-ku)
Inventors: Minoru Etoh (Yokohama-shi), Takehiro Nakayama (Setagaya-ku), Yoshikazu Akinaga (Fujisawa-shi)
Application Number: 12/427,302
International Classification: G06F 17/30 (20060101); G06F 7/00 (20060101); G06F 7/20 (20060101);