SYSTEM AND METHOD FOR ANNOTATING DOCUMENTS
Methods, apparatus and articles of manufacture therefor, are disclosed for annotating documents. An embodiment for annotating documents may be performed by the method of: retrieving a document selected for display by a user; locating sub-document elements in content of the retrieved document; computing a similarity measure for each of the located sub-document elements; identifying similarity measures of annotated sub-document elements and the located sub-document elements that indicate a correspondence there between; augmenting the located sub-document elements of the retrieved document with annotations of those annotated sub-document elements that have comparable similarity measures; displaying the retrieved document augmented with annotations.
Latest PALO ALTO RESEARCH CENTER INCORPORATED Patents:
- COMPUTER-IMPLEMENTED SYSTEM AND METHOD FOR PROVIDING CONTEXTUALLY RELEVANT TASK RECOMMENDATIONS TO QUALIFIED USERS
- Methods and systems for fault diagnosis
- Arylphosphine nanomaterial constructs for moisture-insensitive formaldehyde gas sensing
- SYSTEM AND METHOD FOR ROBUST ESTIMATION OF STATE PARAMETERS FROM INFERRED READINGS IN A SEQUENCE OF IMAGES
- METHOD AND SYSTEM FOR FACILITATING GENERATION OF BACKGROUND REPLACEMENT MASKS FOR IMPROVED LABELED IMAGE DATASET COLLECTION
Priority is claimed from U.S. Provisional Application No. 60/890,464, filed Feb. 16, 2007, entitled “System And Method For Annotating Documents And Searching Annotated Document Collections”, which is incorporated herein by reference (Docket No. 20070042-US-PSP). In addition, cross-reference is made to the following U.S. patent application Ser. No. ______, entitled “System And Method For Annotating Documents Using A Viewer” (Docket No. 20070042Q-US-NP) and Ser. No. ______, entitled “System And Method For Searching Annotated Document Collections” (Docket No. 20070485-US-NP) that (a) are concurrently filed herewith, (b) are assigned to the same assignee as the present invention, (c) are incorporated in this patent application by reference, and (d) claim priority to U.S. Patent Provisional Application Ser. No. 60/890,464.
BACKGROUNDThe following relates generally to methods, apparatus and articles of manufacture therefor, for annotating documents, and subsequently sharing such annotations and searching annotated document collections.
Web-based services are available on the Internet today that enable social tagging of web pages, such as Yahoo's MyWeb and del.icio.us. Such web-based services, allow users to tag web documents (such as web pages) of interest for sharing or later recalling the web documents by allowing users to bookmark a web document and attach a set of freely chosen tags (or keywords) to the web document. Also, users may elect to share their bookmarks or tags with other users, which may subsequently be searched and browsed by the other users.
In addition to allowing users to discover bookmarked pages via tags defined and shared by other users, data from social tagging can also be used to enhance document search. Social tagging systems, however, are limited as they do not account for the nature of the content of tagged web pages (e.g., that the content of web pages may be dynamic, or that the content of one web page may be similar to that of another web page). For example, unless a user reviews and updates tags to web documents previously defined and shared with other users, each user-specified tag associated with a URL (universal resource locator) remains the same, even as a sub-document element of the underlying content of the web page pointed to by the URL changes in a way that less accurately or no longer reflects reason why the tag was applied to the document.
Further, available social tagging systems do not account for the similarity between published web documents. For example, different web sites may publish the same or a very similar news story. Because available social tagging systems do not account for the similarity between published content in different web documents, they are not adapted to propagate tags to similar content. Such propagation of tagged information would advantageously simplify a user's effort to tag similarly published content. Also, available social tagging systems are not integrated within a web browser (or reader). Instead, available social tagging systems require users to access a web page that is independent of the web document that is being read. Such lack of interoperability encumbers the user's ability to refer to the content of a web document at the same time a tag is created or reviewed for the web document.
Accordingly, there continues to be a need for systems and methods for supporting in situ tagging of sub-document elements (such as paragraphs of web documents) and the sharing of such tagged data (or more generally annotated data). A solution for tagging sub-document elements of web documents that is integral to web browsers would advantageously reduce the amount of cognitive and interaction overhead that is required to annotate web pages. Further, by providing an integral solution that facilitates social tagging of web pages, users would advantageously be more likely to collaborate and share tagged data. Also, by propagating tags to web pages with similar content and accounting for the dynamicity of web pages, the integrity of the association between tags and sub-document elements of a web page is maintained.
Further, there continues to be a need for improved systems and methods for searching collections of documents that have been tagged (or more generally annotated) through collaborative tagging (or more generally collaborative annotation). It would therefore be advantageous to provide improved systems and methods for searching tag-based collections of documents to increase the accuracy and/or precision of searching such document collections.
These and other aspects of the disclosure will become apparent from the following description read in conjunction with the accompanying drawings wherein the same reference numerals have been applied to like parts and in which:
A. Definition of Terms
The terms defined below have the indicated meanings throughout this application, including the claims and the figures:
“Document” or “web document” is used herein to mean a collection of electronic data that may define a variable number of pages depending on how the collection of electronic data is formatted when viewed, such as documents that may be viewed using a web browser (e.g., web pages, images, word documents, and documents in a portable document format (pdf)). The electronic data making up a document may consist of static and/or dynamic content.
“Sub-document element” is used herein to mean an element of a document's structure that when taken on its own is less than the whole of the document, which sub-document element may be of a type selected from the set of words, images, phrases, sentences, paragraphs, pages, sections, and chapters.
“Co-user” is used herein to mean any user (e.g., an individual or alias of an individual) or group of users (e.g., a distribution list of individuals within a group or organization) that may be identified to or known to a “user” of a general purpose computer (as may be permitted using access controls and/or blocking), which may be known or made known to the user of the general purpose computer by, for example, using an identifier such as a user or login name.
B. Operating Environment for Dynamic Annotation and Search
C. Dynamic Annotation Elements and Operations
Initially at 302, the web browser 110 of the client-side application module 106 is initialized with an annotation plug-in 108, which includes operations for communicating with server-side application module 112 and operations for augmenting documents for enabling both (a) user specification of annotations of a document displayed using web browser 110, and (b) display of user-annotated and co-user annotated data. Generally, operations performed on the client-side between 304 and 306 and on the server-side between 305 and 307 concern the augmentation of web pages with annotations and in preparation for user annotation, while operations on the client-side between 306 and 308 and on the server side between 307 and 309 concern user annotation of augmented web pages. In another embodiment, the functions of the annotation plug-in 108 can be provided by a proxy server (operating on the network 102) that augments documents with annotation functions before passing the documents to the user.
At 310, the web browser 110 receives a user request to load a web page for display. At 312, the requested web page is accessed using, for example, a URL (universal resource locator) that identifies a location on the network 102 of a server such as a web page server 118 storing the web page. At 314, the annotation plug-in 108 in response to the web page request at 312, communicates the requested web page (e.g., by sending its URL) with annotation servlet 126 through web server 128 to request a service be performed by the server-side annotation module 114 to identify annotations made by the user and any identified co-users to content similar to that of the requested web page.
At 316, the server-side annotation module 114 retrieves a copy of the subject web page (e.g., using the URL provided at 314) from the web page server 118. Alternatively, a copy of the web-page retrieved at 312 may be provided by the client-side application module 106 along with the service request at 314. At 318, the server-side annotation module 114 identifies in the retrieved web page one or more web page sub-document elements, which may be of a single type, such as a paragraph, or a combination of types, such as paragraphs and sections. At 320, a similarity measurement is computed for those sub-document elements identified at 318 that are associated with the user and selected co-users using similarity measure calculator 124. The similarity measure may be computed based on any one or more factors (that act as a unique identifier or fingerprint), such as for example: (a) the length of the words appearing in the sub-document element, (b) the first characters of the first n words that appear in the sub-document element, (c) the frequency of similar non-stop words appearing in the sub-document element, and (d) MD5 (Message-Digest algorithm 5), a cryptographic hash function. In one embodiment, each measure of similarity (or fingerprint) is a hash value of the associated sub-document element for which it is computed. Once computed, the fingerprint of annotated sub-document elements stored at the annotation server 113 may be compared against the fingerprint of the sub-document elements of the retrieved web page identified at 318.
At 322, web page sub-document elements (having associated annotations and similarity measures) previously annotated either by the user or co-users that are recorded in database 122 with similarity measures comparable (i.e., are likely to be the same or similar fingerprints) to the similarity measures computed at 320 are identified. At 324, the stored annotated data of the sub-document elements identified at 322 is provided to the client-side application module 106.
At 326, the web page is augmented for display and (further) annotation by: making, at 326(A), each word on the web page separately selectable; inserting, at 326(B), user name labels at the end of a sub-document element and associated stored user's annotation data of the sub-document element identified at 322, including textual annotations (such as tags, keywords, or comments), graphical annotations (such as graphical icons), or audio/video annotations (such as links to audio or video clips); and inserting, at 326(C), annotations (such as highlighting, which may include text, graphics, audio, or video) made to the content of sub-document elements of the web page associated with the stored annotated data identified at 322. The annotations inserted in the web page at 326(B) and 326(C), respectively, may be made to sub-document elements (or to a combination of sub-document elements, such as a chapter of a document defined by a combination of sub-document elements) of the retrieved web page identified at 318 when the sub-document elements are similar to (or have matching fingerprints of) the sub-document elements of annotated data stored at 322 in database 122.
In one embodiment at 326, the annotation plug-in 108 makes each word on a web page selectable by augmenting HTML content of the web page (using, for example, AJAX) by: (a) altering, at 326(A), the Document Object Model (DOM) tree of a web page by enclosing each word of the web page with an HTML (HyperText Markup Language) tag <span> before it is loaded by web browser 110; and (b) attaching, at 326(D), an event listener, such as a mouse event listener, to each sub-document element for detecting word selections performed by the user in the web page.
Subsequent to augmenting the web page at 326, the web page is displayed at 330 and made available to the user for (a) viewing annotations previously applied to similar content by the user or identified co-users and (b) further annotation. At 332, an event handler in the annotation plug-in receives and responds to events associated with input received from the user directed at the displayed web page. An event that indicates portions of a sub-document element to be annotated, causes the event handler to transmit, at 334, and store, at 342, the annotations in the database 122 (together with the similarity measure (or fingerprint) of the sub-document element associated with the annotations), respectively, and present them for display on the web page to the user at 340. For example,
An event that indicates a word selection by the user, causes the event handler to open a tagging comment field on the web page and the selected word to be inserted therein as a tag (and in one embodiment cause the name label field to be inserted). An event that indicates the selected tags or keywords to be saved, causes the event handler to transmit the saved tags and/or keywords to the server-side annotation module 114 to be checked for errors (which may include correcting spelling and/or punctuation errors, formatting inconsistencies, and/or eliminating stop-words in a specified annotation) at 336 and subsequently stored (together with the similarity measure (or fingerprint) of the sub-document element associated with the word selection) at 342. Annotations that are corrected at 336 are transmitted back at 338 to the annotation client 111 for subsequent display at 340. For example,
More specifically, in
Advantageously, after annotating a web page the user's annotations are stored by the annotation server 113 in database 122 to enable propagation of annotations to web pages with similar content. In other words, annotations propagate when the annotation server 113 provides the annotation client 111 with stored annotations to add to any document that contains content similar to that of the stored content. In addition, the annotation server 113 does not propagate stored annotations to the annotation client 111 that fail to match the fingerprint of the content of a sub-document element, even if the content may have previously existed on the web page being viewed when the annotations and the fingerprints of sub-document elements of the web page were originally recorded.
D. User Interface for Creating and Sharing Annotations
Further, the area 702 enables a user to specify page-level tagging of the user and view the page-level tags of specified co-users. Page-level tagging, as opposed to tagging at sub-document levels, simply associates tags, keywords, or comments with a web page (or URL). Similar to tagging of sub-document elements, page-level tagging is recorded at 706, which in one embodiment includes using commands “Save Page Tag” and “Update Page Tags” in the area 702 and accessing the database 140 through web server 136 by page-tag servlet 138 (which may in an alternate embodiment be integrated with or operate together with server side annotation module 114).
Below the control area 702 is the document area 704. In the document area 704, web page content is augmented, which includes in this example annotations associated with a user (“lichan_hong”) and selected co-users (“edhchi” and “kooltag”). Thus, by examining the control area 702 and the document area 704, a user is able to see whether the displayed document is annotated at a page level and/or sub-document level.
E. Tag-Based Search
Using a web-based annotations service such as that made available by the client-side annotation module 106, users are given the ability to bookmark web documents and attach a set of tags or keywords to (or more generally annotate, by for example highlighting or attaching comments to) the document bookmark at the page level (and in an alternate embodiment to document elements at the sub-document element level). Subsequently, the user may search and retrieve the document from the user's personal bookmark collection using the set of tags of the user. Additionally, users may elect to share, either publicly to all users or semi-publicly to selected co-users, their bookmarks (and associated tags or keywords), which may then be browsed and searched by other users. This collaborative sharing of user cultivated document bookmark collections enables users to benefit by allowing bookmarked documents to be discovered using the shared user (i.e., collaboratively developed) bookmark collection.
In one embodiment shown in
At 810, the bigraph matrix (or more generally any n-dimensional matrix) is used to compute tag profiles and document profiles over the nodes of the bigraph by computing the profiles using spreading activation iteratively as vectors A as follows:
A[1]=E;
A[2]=αM*A[1]+βE;
.
.
.
A[n]=αM*A[n−1]+βE;
where:
A[1], A[2], . . . A[n] are iteratively computed profile vectors of URLs and tags;
E is a unit vector representing a tag or document entry node;
M is a matrix representation of the bigraph (or more generally any n-dimensional graph) arranged by column or row according to the selected entry node;
α and β are parameters for adjusting spreading activation.
After iteratively performing spreading activation for “n” steps (which number of steps “n” may vary depending on accuracy and/or performance), spreading activation is stopped on the tag side of the bigraph or document side of the bigraph, thereby providing a tag profile vector or document profile vector for the tag or document entry node E. The resulting pattern of weights in the tag profile vector and document profile vector define “tag profiles” and “document profiles”, respectively.
As shown in
Once the tag profiles and document profiles are computed at 810, they are in one embodiment cached for subsequent retrieval and search (at 812). In an alternate embodiment, such tag and document profiles are computed in real-time on demand. Users may search on a document collection using a search interface 1100 as illustrated in
The search results may then be refined further by the user by specifying which documents in 1104 are relevant, using, for example, methods known in the art as “relevance feedback”. In one embodiment, users indicate their interest by clicking on a selection box. Documents that are selected may be looked up in the cached computation results (at 812 in
More generally once cached (at 812), the tag profiles and document profiles form the basis of different similarity computations and lookups for retrieval, search, and recommendations. For example, in the embodiment shown in
At 820, the identified tag profile and/or document profile are sorted to rank related tags and/or documents by importance. These most similar tags and/or documents arranged by rank are returned for presentation in for example the user interface 1100 illustrated in
In other words at 814, 816, 818, and 820, if a user would like to find information related to a document, the tag-based search server may: (a) lookup the corresponding document profile, assuming the document already exists in its cached spreading activation computation profiles (at 812), and select and return for display selected documents in that profile arranged by relevance, for example, from highest to lowest weight (which arrangement of documents may in addition be filtered to retain only those weighted above a threshold value); (b) use the corresponding document profile to compare it against all other document profiles in the system to find similar document profiles, and thereafter select and return for display selected documents in those profiles arranged by relevance; and/or (c) use information retrieval techniques, such as computing the traditional cosine similarity measure of document word vectors, to find those most similar documents in the URL/TAG bigraph when the user specified document does not already exist in the cached spreading activation computation profiles (at 812), and thereafter use either (a) or (b) to select and return related documents. Alternatively, at 814, 816, 818, and 820, if a user would like to find information related to a tag, similar operations performed for a document are performed instead for the tag and the corresponding tag profiles.
In an alternate embodiment, keyword searches may be used to identify documents or tags cached in the spreading activation computation profiles (at 812), their profiles of which are subsequently used to identify related documents and tags. In yet other embodiments, when multiple keywords are provided as search criteria, the profile vectors associated with the tags and documents corresponding to the multiple keywords are summed together, ranked and subsequently used to identify documents and tags of interest. Alternatively, profile vectors corresponding to different keywords are weighted differently before being summed together, ranked and subsequently used to identify documents and tags of interest. Such summation of profile vectors may in addition be used to further refine results by adding additional keywords and/or documents into the set of profile vectors that is ultimately summed together, ranked and subsequently used to identify documents and tags of interest.
In yet further embodiments, the exemplary embodiments set forth herein may be extended beyond a two variable relationships between documents and tags. In such alternative embodiment, graphs may be for example defined between two different variables (e.g., such as between documents and users) or between three or more variables (e.g., such as between documents, tags, and users). The technique works for three or more variables since the spreading activation technique can be performed and cached for these different variables (e.g. resulting in tag profiles, document profiles, and user profiles).
Further, while shared bookmark collections are discussed in the forgoing embodiments, those skilled in the art will appreciate that bookmark collections may alternatively be substituted for shared collections of documents with metadata recording individual bookmark preferences with each document.
F. Miscellaneous
In view of the above description, an embodiment for annotating web pages may be performed by the method of: retrieving a document selected for display by a user; locating sub-document elements in content of the retrieved document; computing a similarity measure for each of the located sub-document elements; identifying similarity measures of annotated sub-document elements and the located sub-document elements that indicate a correspondence there between; augmenting the located sub-document elements of the retrieved document with annotations of those annotated sub-document elements that have comparable similarity measures; displaying the retrieved document augmented with annotations.
Further in view of the above description, the forgoing embodiment includes the following features wherein: the one or more sub-document elements may be selected from the group consisting of words, images, phrases, sentences, paragraphs, pages, sections, and chapters; the augmenting may associate each annotation of the structural elements with a name label; the name labels may be that of the user or a co-user; the document may be a web page; bounding boxes for each word may be defined in the web page; a user input event indicating a word selection in the web page may be identified, and the web page may be automatically tagged with the identified word selection; a user input event may be automatically identified indicating highlighting of one or more words in the web page; and the similarity measure may be a unique identifier that uniquely identifies sub-document elements.
Using the foregoing specification, the embodiments disclosed herein may be implemented as a machine (or system), process (or method), or article of manufacture by using standard programming and/or engineering techniques to produce programming software, firmware, hardware, or any combination thereof. It will be appreciated by those skilled in the art that the flow diagrams described in the specification are meant to provide an understanding of different possible embodiments. As such, alternative ordering of the steps, performing one or more steps in parallel, and/or performing additional or fewer steps may be done in alternative embodiments.
Any resulting program(s), having computer-readable program code, may be embodied within one or more computer-usable media such as memory devices or transmitting devices, thereby making a computer program product or article of manufacture according to the disclosed embodiments. As such, the terms “article of manufacture” and “computer program product” as used herein are intended to encompass a computer program existent (permanently, temporarily, or transitorily) on any computer-usable medium such as on any memory device or in any transmitting device.
A machine embodying the disclosed embodiments may involve one or more processing systems including, but not limited to, CPU, memory/storage devices, communication links, communication/transmitting devices, servers, I/O devices, or any subcomponents or individual parts of one or more processing systems, including software, firmware, hardware, or any combination or subcombination thereof, which embody the disclosed embodiments as set forth in the claims. Those skilled in the art will recognize that memory devices include, but are not limited to, fixed (hard) disk drives, floppy disks (or diskettes), optical disks, magnetic tape, semiconductor memories such as RAM, ROM, Proms, etc. Transmitting devices include, but are not limited to, the Internet, intranets, electronic bulletin board and message/note exchanges, telephone/modem based network communication, hard-wired/cabled communication network, cellular communication, radio wave communication, satellite communication, and other stationary or mobile network systems/communication links.
While particular embodiments have been described, alternatives, modifications, variations, improvements, and substantial equivalents that are or may be presently unforeseen may arise to applicants or others skilled in the art. Accordingly, the appended claims as filed and as they may be amended are intended to embrace all such alternatives, modifications variations, improvements, and substantial equivalents.
Claims
1. A method for annotating documents, comprising:
- retrieving a document selected for display by a user;
- locating sub-document elements in content of the retrieved document;
- computing a similarity measure for each of the located sub-document elements;
- identifying similarity measures of annotated sub-document elements and the located sub-document elements that indicate a correspondence there between;
- augmenting the located sub-document elements of the retrieved document with annotations of those annotated sub-document elements that have comparable similarity measures;
- displaying the retrieved document augmented with annotations.
2. The method according to claim 1, wherein the sub-document elements are a type selected from the group consisting of words, images, phrases, sentences, paragraphs, pages, sections, and chapters.
3. The method according to claim 1, wherein the sub-document elements are a plurality of types selected from the group consisting of words, images, phrases, sentences, paragraphs, pages, sections, and chapters.
4. The method according to claim 1, wherein said augmenting associates each annotation of the structural elements with a name label.
5. The method according to claim 4, wherein the name labels is one or more of the user and a co-user.
6. The method according to claim 1, wherein the document is a web page.
7. The method according to claim 6, further comprising defining bounding boxes for each word in the web page.
8. The method according to claim 7, further comprising:
- identifying a user input event indicating a word selection in the web page; and
- automatically tagging the web page with the identified word selection.
9. The method according to claim 7, further comprising:
- identifying a user input event indicating highlighting of one or more words in the web page; and
- automatically highlighting the one or more words in the web page.
10. The method according to claim 1, wherein the similarity measure is a unique identifier that uniquely identifies sub-document elements.
11. An apparatus for annotating documents, comprising:
- a memory for storing processing instructions of the apparatus; and
- a processor coupled to the memory for executing the processing instructions of the apparatus; the processor in executing the processing instructions:
- retrieving a document selected for display by a user;
- locating sub-document elements in content of the retrieved document;
- computing a similarity measure for each of the located sub-document elements;
- identifying similarity measures of annotated sub-document elements and the located sub-document elements that indicate a correspondence there between;
- augmenting the located sub-document elements of the retrieved document with annotations of those annotated sub-document elements that have comparable similarity measures;
- displaying the retrieved document augmented with annotations.
12. The apparatus according to claim 11, wherein the processor in executing instructions for augmenting further comprises instructions for associating each annotation of the structural elements with a name label comprising one or more of the user or a co-user.
13. The apparatus according to claim 11, wherein the document is a web page.
14. The apparatus according to claim 13, wherein the processor in executing the processing instructions further comprises defining bounding boxes for each word in the web page.
15. The apparatus according to claim 14, wherein the processor in executing the processing instructions further comprises:
- identifying a user input event indicating a word selection in the web page; and
- automatically tagging the web page with the identified word selection.
16. The apparatus according to claim 14, wherein the processor in executing the processing instructions further comprises:
- identifying a user input event indicating highlighting of one or more words in the web page; and
- automatically highlighting the one or more words in the web page.
17. The apparatus according to claim 11, wherein the similarity measure is a unique identifier that uniquely identifies sub-document elements.
18. An apparatus for annotating documents, comprising:
- means for retrieving a document selected for display by a user;
- means for locating sub-document elements in content of the retrieved document;
- means for computing a similarity measure for each of the located sub-document elements;
- means for identifying similarity measures of annotated sub-document elements and the located sub-document elements that indicate a correspondence there between;
- means for augmenting the located sub-document elements of the retrieved document with annotations of those annotated sub-document elements that have comparable similarity measures;
- means for displaying the retrieved document augmented with annotations.
19. The apparatus according to claim 18, wherein said augmenting means associates each annotation of the structural elements with a name label comprising one or more of the user or a co-user.
20. The apparatus according to claim 18, wherein the document is a web page.
21. The apparatus according to claim 20, further comprising means for defining bounding boxes for each word in the web page.
22. The apparatus according to claim 21, wherein the processor in executing the processing instructions further comprises:
- means for identifying a user input event indicating a word selection in the web page; and
- means for automatically tagging the web page with the identified word selection.
23. The apparatus according to claim 21, wherein the processor in executing the processing instructions further comprises:
- means for identifying a user input event indicating highlighting of one or more words in the web page; and
- means for automatically highlighting the one or more words in the web page.
24. The apparatus according to claim 18, wherein the similarity measure is a unique identifier that uniquely identifies sub-document elements.
Type: Application
Filed: Aug 13, 2007
Publication Date: Aug 21, 2008
Applicant: PALO ALTO RESEARCH CENTER INCORPORATED (Palo Alto, CA)
Inventors: Lichan Hong (Mountain View, CA), Ed H. Chi (Palo Alto, CA)
Application Number: 11/837,837
International Classification: G06F 15/00 (20060101);