Online intelligent information comparison agent of multilingual electronic data sources over inter-connected computer networks
A method and system for real-time online search processing over inter-connected computer networks, in which an offline database information is maintained for a plurality of vendor sites from the inter-connected computer networks. The information includes URLs, search form URLs, description of domains, and vendor descriptions, and while the vendor descriptions comprise generalized rules about how product information is organized on each of the vendor sites. Parameters are processed for a price comparison request for a desired product using the information maintained in the offline database whereas price comparison request is received from an online user or buyer and/or from the system of the present invention. Real-time price and product information is then extracted from identified ones of the plurality of vendor sites, wherein the extracted price and product information are in a native language of the site; and the extracted price and product information are displayed to the user.
The present application claims priority under 35 U.S.C. §119(e) from provisional applications Nos. 60/236,574, filed Sep. 29, 2000, and 60/299,360, filed Jun. 19, 2001.
COMPUTER PROGRAM LISTING APPENDIXReference is made to the computer program listing appendices submitted herewith in a total of two (2) compact discs (including one duplicate compact disc). A single file is included in the submitted discs and is entitled “implementation.txt,” bears a date of creation of Sep. 18, 2001, and has a size of 14K bytes. The material contained in the compact discs is hereby incorporated by reference herein.
FIELD OF THE INVENTIONThe present invention relates generally to automating tasks on the World Wide Web (the “Web”) and more particularly to automating tasks for an online buyer or user such as comparison shopping or interacting with the multilingual vendors on the World Wide Web through a single interface to increase communication efficiencies and to provide a personalized buying experience.
DESCRIPTION OF THE BACKGROUNDSince the creation of the World Wide Web in the mid 1990's, the size of the Internet has exploded a thousand-fold. People are now inter-connected, not by means of direct face-to-face interaction, but through virtual communication channels. This new revolution of technology has fundamentally changed the way people live.
A parallel development with the World Wide Web is the “Information Technology Age” that presents a stunning variety of online information resources ranging from product information to academic papers. These elements have enabled the exponential growth of Electronic Commerce that capitalizes on the convenience and low cost which the Internet delivers.
There are several million or more online vendors on the World Wide Web. Although current comparison shopping or price comparison search engines can retrieve from different online competitors, according to an online buyer's or user's query, somewhat relevant search results pertinent to any desired products requested and their desired prices, the buyer or user can be confronted with an endless sea of information. Sometimes, the buyer or user receives a “failure page” of search results because the search engines have missed other Websites of online multilingual vendors existing in the rest of the Internet-connected countries (currently numbering 245) selling exactly what was requested. Furthermore, although information about products and vendors is easily accessible on the Web, buyers or users are still in the loop in all stages of the buying process.
The potential of the Internet for transforming the present mode of e-commerce into a truly global ensemble marketplace is largely unrealized today, and electronic purchases are still non-automated. Buying on the Internet is far from being simple, efficient, or enjoyable. Search engines and centralized directory services are insufficient for locating products the online buyer wants and the merchants willing to sell such products or services. Furthermore, the typical online purchase procedure is mostly manually driven and requires the buyer to enter all terms and keywords for which he or she wants to search. Therefore, a prospective buyer is faced with a daunting task, with responsibility for collecting and interpreting information about merchants and products, making decisions about them, and ultimately entering purchase and payment information. The scenario is that the user or buyer is easily overloaded with information without sufficient time and expertise.
In order of complexity, there are two imperfect strategies presently adopted and implemented to partially automate an online catalog price comparison process as follows:
(1) Non real-time approach
(2) Real-time hard-coded wrappers approach
The non real-time approach is the simplest way to implement a price comparison agent. Its implementation involves manually collecting all necessary information from the Web, and then writing a separate HTML file for each item of the search results in order to visually display the search results.
The benefits of the above are obvious—easy implementation and short searching time. Notwithstanding those benefits, there are three main undesirable drawbacks. Firstly, as the price comparison is done manually, maintaining a large wrapper repository becomes very costly, particularly in view of the continuing growth of the Internet. Secondly, great effort must be invested to keep the price and other information up-to-date. Lastly, the size of the database required to store and coordinate all of the above information is extremely large.
The real-time hard-coded wrappers approach is an alternative to the non real-time approach. Instead of fetching the items directly as in the non real-time approach, the real-time approach tries to generalize the HTML page into a specific format. To perform this extraction task, a customized wrapper procedure named pcwrapHLRT—programming acronym—is invoked.
When a HTML page is given, pcwrapHLRT sequentially scans the entire page starting from the head line number. The outer loop checks whether there are additional model numbers and/or price pairs to extract by searching for delimiter “<B>” on the non-scanned portion of the page. As long as the beginning of a model number is found, the inner loop is invoked to extract the appropriate page sub-strings.
Few Websites publish their formatting conventions. Thus, the designer of an information-gathering system using pcwrapHLRT would manually construct such a wrapper for each resource. Unfortunately, this hard-coding process is tedious and error-prone, as a common HTML page may consist of several thousand lines of code. Moreover, most sites periodically change their formatting conventions that usually will break a wrapper.
Another disadvantage of pcwrapHLRT is that the speed of search time is moderate, as the agents have to contact the vendor Website upon receiving a request from the user. Because this kind of wrapper is partially automated, extra administrative work must be performed to manually analyze the format of the HTML page in order to determine the wrapper.
SUMMARY OF THE INVENTIONIn view of such commonly encountered afore-mentioned problems, an alternative to manual and partially automated manipulation, based upon a new Internet strategy, is automatic manipulation—online intelligent price comparison agents that can relieve the price comparison process of online catalog buying or shopping, (auctioning, etc.), and can meanwhile provide a better navigational environment with an Internet-friendly interactive-agent-character graphical user interface (IACGUI). This will be particularly useful when the so-called 4th Generation Global Ensemble Marketplace Framework—agent-mediated B-to-C, C-to-C, B-to-B e-procurement and auction, G-to-B/C (Government-to-Business/Consumer) tendering e-commerce and m-commerce (mobile commerce)—becomes widely implemented. Thus, the system of the present invention provides a better environment for consumer-to-business transactions.
To put it simply, online intelligent price comparison agents are automated online buying or shopping assistants that scour global online multilingual stores and ferret out deals on every product. They also deliver value-added (customer rated) Business-Web services to the online buyer/user. Such agents are attractive because they can relieve users of the tedium of manually carrying out every operation in the Consumer Buying Behavior model.
Conventionally, a buyer/user communicates with a Web server of an online service through the interface at the front-end, which presents a form completed by the buyer/user for entering the terms to be searched. Once the buyer/user submits the search request, the online service's Web server queries its database for matches, and presents the results to the user's Web browser.
In the present invention, user agents (online intelligent price comparison agents acting on behalf of the human buyer/user) in the online catalog price comparison process carry the terms and keywords to be searched for, and communicate with numerous multilingual Web servers of any of the 246 Internet-connected countries over inter-connected computer networks on the World Wide Web for the buyer's/user's best interests. The user agent then ranks the online vendor sites it finds and presents a summary of search results via the Web browser to the online human user.
The advantages of applying the system of the present invention to multiple e-commerce segments are very significant. Communication efficiency and effectiveness can be increased considerably, and time and cost-savings for online vendors as well as online buyers can be maximized. Most importantly, the user/buyer will have access to an unprecedented and countless number of sources of information and a myriad of products sources on a global scale, as well as an immeasurable number of business opportunities. The system and method of the present invention will also help to collapse time and languages barriers, demographic boundaries, and truly enable the globalization of e-commerce. Besides, the personalized, continuously running, autonomous nature of the user agents makes them well suited for mediating buyer/consumer behaviors. It is believed that the present invention will help to optimize the whole buying experience and revolutionize current e-commerce.
It is therefore an object of the present invention to provide an improved price comparison of online vendors' products or services.
It is yet another object of the present invention to construct vendor descriptions of the online shop.
It is yet another object of the present invention to collect data, including sample products and URLs, which are used for training.
It is yet another object of the present invention to retrieve training data before performing learning of vendor sites or online shops.
It is yet another object of the present invention to collect training pages from online vendors using information given in the training data.
It is yet another object of the present invention to generate vendor descriptions from the training data and collected training pages.
It is yet another object of the present invention to store the generated vendor descriptions in an offline database.
It is yet another object of the present invention to provide an interface for a system administrator to add, modify and delete vendors supported by the system.
It is yet another object of the present invention to provide an interface for an administrator to view vendor information.
It is yet another object of the present invention to provide a price comparison method whereby a customer can initiate the price comparison.
It is yet another object of the present invention to parse HTML pages into useful data.
It is yet another object of the present invention to provide filtering and sorting of desired products/services.
It is yet another object of the present invention to provide a single interface to compare prices of different online multilingual vendors and different domains on the Internet or World Wide Web.
A first user agent is embodied in the system of the present invention, and is implemented in the form of a Semantics Recognition Learner Agent (SRLA). It conducts a real-time autonomous wrapper induction using an inductive learning method to learn the URL of a vendor's site and its domain description, and generalized rules about the organization of the vendor's site based upon previously compiled or prepared training examples provided by the system administrator. (In one embodiment, the SRLA connects a Microsoft brand back-end SQL-compliant server or Microsoft Access database to produce a vendor and products description only once per online store.) The wrapper induction is done by constructing in real-time a wrapper of examples that is extracted from vendor and products descriptions stored in the offline database. Then with the examples, the SRLA autonomously zaps through the Internet in real-time to the remote host of the vendor site to access the Web pages exhibiting the specified examples according to the URL provided, then intelligently fills-in a relevant search form with the domain or product information, and then virtually “presses enter” to thereby submit a search request to the site. Result pages that are returned in response to the search criteria are either a successful page containing accurate information or a failure page. These result pages, having vendor and products descriptions that are unique to a particular vendor (either a registered or non-registered vendor with the system), are consequently stored in a vendor description list in the offline database (such as in an SQL-compliant server or Microsoft Access database) maintained by the system administrator. Vendor URLs, vendor descriptions and other information, are preferably automatically updated once daily on schedule.
A second user agent embodied in the system of the present invention is referred to as a Semantics Recognition Buyer Agent (SRBA). The SRBA uses the vendor descriptions previously “learned” by the Semantics Recognition Learner Agent to search for a match while accessing simultaneously various online multilingual vendor sites on the World Wide Web. The SRBA intelligently fills-in a vendor's search form with the product information provided by an online buyer or user and virtually “presses enter.” The vendor then returns search result pages to the SRBA through the World Wide Web in such a manner that result pages arrive at about the same time as other ones being returned from other vendors. (The Semantics Recognition Buyer Agent stores these returned pages in a separate memory or cache location as hits for later use by other SRBAs.) The SRBA analyzes the returned pages according to the corresponding vendor descriptions, extracts from them relevant information and data, sorts prices and model numbers, and displays them in a formatted summary on the screen of a client-machine via a Web browser to the online buyer/user.
In accordance with the present invention, a method is provided for a computer-implemented Semantics Recognition Learner Agent to perform an inductive learning. The method comprises retrieving training data specific to an online vendor to generate a corresponding vendor description from inter-connected computer networks. The method comprises collecting training pages using the given training pages using the given training data stored in the vendor list. Using the training data as well as the retrieved training pages, the method comprises an inductive learning method to generate a vendor-specific vendor description from information that is extracted from the training data and retrieved training pages.
A method is provided for storing the retrieved and/or extracted vendor descriptions in an offline database that will be later used by a Semantics Recognition Buyer Agent (SRBA).
A method is provided in accordance with the present invention for price comparison of products or services from online vendors. The method comprises an online user initializing a request for a specific product or service, then a Semantics Recognition Buyer Agent constructs parameters of a search request using pre-defined vendor descriptions. The method comprises posting requests to different online vendors, preferably at the same time, extracting data from result pages returned from the online vendors using a parser that comprises the vendor descriptions. The method comprises constructing/composing sorted and filtered data by a Semantics Recognition Buyer Agent in a HTML format for presenting the data to the online buyer/user.
A method is provided and implemented through the Semantics Recognition Buyer Agent for parsing returned pages from online vendors to retrieve useful data. The method comprises retrieving vendor descriptions from an offline database, parsing the returned page from online vendors for any of the (currently 246) Internet-connected countries on the World Wide Web, and collecting useful data using information from the returned vendor descriptions.
In one embodiment of the invention, the above functionality is only available on member Web pages after an online buyer signs up as a registered temporary trial or life member.
In accordance with the present invention, a method is provided for real-time online search processing of selected types of information over inter-connected computer networks. The method comprises a number of steps: assembling site descriptions for a plurality of sites in the inter-connected computer networks including for each of the plurality of sites (a) a URL for the site; a search form URL for the site; (b) generalized rules of how the selected types of information on the site is organized; (c) sample data retrieved from the site corresponding to the selected types of information; and (d) descriptions of domains found on the site; receiving a request for specified types of information from an online user; identifying from the site descriptions, sites which may have the specified types of information; constructing search requests for the specified types of information using the site descriptions for each identified site; submitting the constructed search requests to the identified sites; receiving search results from the identified sites, and upon locating accurate matches in the received search results, extracting information corresponding to the specified types of information in a native language of the site, and displaying the extracted information to the user.
More generally, the present invention involves a method for real-time online search processing over the inter-connected computer networks. The method comprises the steps of: (a) maintaining in an offline database information for a plurality of vendor sites from the inter-connected computer networks; the information includes URLs, search form URLs, description of domains, and vendor descriptions, wherein the vendor descriptions include generalized rules about how product information is organized on each of the vendor sites; (b) processing parameters for a price comparison request for a desired product using the information maintained in the offline database, while the price comparison request is received from an online user and/or the Semantics Recognition Buyer Agent; (c) extracting real-time price and product information from identified ones of the plurality of vendor sites, wherein the extracted price and product information are in a native language of the site; and (d) displaying the extracted price and product information to the user.
Referring to
In the preferred embodiment 10 of the present invention, a Learner Agent 18 (also referred to as a Semantics Recognition Learner Agent) and a Shopper Agent 20 (also referred to as a Semantics Recognition Buyer Agent) are provided. A server 22 is employed to provide access to an offline database 24 that stores global multilingual vendor information. A system administrator 26 prepares/compiles training data about selected vendor sites and stores them in a “vendor list” 27 in offline database 24 through server 22. The system administrator 26 can then employ the training data and the Semantics Recognition Learner Agent 18 to conduct “inductive learning” from training pages retrieved from vendor sites by way of the World Wide Web 16. The “inductive learning” results in vendor descriptions in the form of vendor description list 28 which are stored in the offline database 24.
A user/buyer 12 can use the preferred embodiment of the present invention to retrieve designated information about designated subjects by using Semantics Recognition Buyer Agent (SRBA) 20. The SRBA 20 processes a request from the user/buyer 12 by using information contained in the previously learned vendor descriptions 28. The information in the vendor descriptions 24 permits the Semantics Recognition Buyer Agent 20 to instantly prepare and issue :searches on many vendor Websites substantially simultaneously by way of the World Wide Web 16. The vendor descriptions also permit the Semantics Recognition Buyer Agent 20 to instantly process received search results, and to present to the user/buyer 12 the results of the search from all vendor sites searched which have been filtered of extraneous and irrelevant information.
Referring now to
The training data includes a bundle of data pertaining to the online vendors from which information is to be learned. These data may include URLs, domain descriptions, sample products and attributes and other domain-specific information as shown below in the riaht column:
The “trained” data is preferably stored in an SQL-compliant or a Microsoft Access database. This adds extra extensibility to the selection of the data container from different vendors. Typically, the trained data is independent of the product domain, written characters and presentation style of the online vendor. One exception is the URL path in the trained data, which is required to uniquely identify different vendors.
Returning to
Next, in step 140, the computer program performs an inductive learning on the training pages obtained by the Semantics Recognition Learner Agent 18. The objective of the inductive learning is to obtain a generic description of the site and how it organizes the product data and logically presents the product data to a potential online customer. The product of this learning is called a “vendor description” 28—this phase will be further described and explained in accordance with
Then, in step 150, the Semantics Recognition Learner Agent 18 stores the learned result preferably in an SQL-compliant or Microsoft Access database 24. (The vendor information or “vendor descriptions” 28 stored in offline database 24 will later be used by the online Semantic Recognition Buyer Agent 20.) Following the completion of the storing step 150, the Semantics Recognition Learner Agent 18 returns to the step 120 to see if there are more vendors to be learned. If so, steps 130 through 150 are repeated. Otherwise, the learning process terminates.
Vendor Description Learning ProcessReferring now to
Firstly, a wrapper function generates a set of “labels” for the given training page. A label is used to identify the location of information for the training products in the training page.
Consider the first pair, <174, 180>. These integers indicate that the attribute of the first tuple is the sub-string between position 174 and 180, i.e. the string “HM381 MD” is located between position 174 and position 180. As used in this example, position means the number of characters from a designated beginning point, such as the beginning of a page, or the end of the “head” of a page. Spaces between text characters are counted as a character position. Inspection of the
Thus, if “b” stands for beginning and “e” stands for ending, then values identifying the positions of the second tuples comprise string b_,i, which is the value of the beginning of the model number, “M,” whereas string e_,i is the value of the ending of the model number, “0.” Analogically, it is to be understood that the present invention enables the automation of the labeling by invoking a modular heuristic search based upon a standard relational data model comprising an “Item Recognizer” and an “Intelligent Price Recognizer” in which, reiteratively, a tuple is a vector <b2,i, b2,p> of two strings. String b_,i is the value of item attribute, and string b_,p is the value of price attribute. So, attributes represent columns whereas tuples represent rows. The numeric value “b2.i” between the “b” and the ,“” connotes a position on the second row—the computation of positional values (labeling) are hence performed in real-time, automatically, on-the-fly during the invocation of efficient learnable wrapper induction of vendor descriptions in corroborating to label the entire Ppc (Page of product catalogue—a page “P” is the Web page containing the desired information) regardless of whether the Web pages formatted on the vendor site (in this example, www.800.com) are in native character strings of any language, or in natural language or coded in HTML, XML, cXML, Java, etc.
The labeling of the content of a training page is represented more generally in
After the system administrator has executed the learning system once, it then retrieves the training page from the vendor list in the offline database in compiling a set of possible delimiter candidates in parallel with the compilation with the possible sets of delimiter candidates in
Referring now to
As to navigational regularity, online stores or vendor sites are designed to service consumer and business buyer inquiries. Thus, almost all online vendors provide a searchable index for easy access to specific inquired database. Using the searchable form of a vendor site enables the Semantics Recognition Learner Agent 18 to generalize the formatting fashion of multilingual home and Web pages.
With regard to uniformity regularity, although online stores or vendors differ widely from each other in their product description formats, any given online vendor typically lays out all item descriptions in a simple consistent format.
As originally devised the information infrastructure of the Internet—site architecture, formatting of online vendor's products description and expression of technologies—was intended for use by humans. This is apparent in the use of query mechanisms and output standards which are particularly suited for direct human manipulation. Online vendors comply with these regularities because they enable online sales to human shoppers or buyers. Although there is no guarantee that what makes an online store easy for humans to navigate will make it user—friendly for an intelligent software agent to master, the system—online intelligent information comparison of multilingual electronic data sources—of the present invention is designed to take advantage of these regularities.
In accordance with the present invention, wrapper construction is implemented through an inductive learning process. The methodology learns a vendor's wrapper by reasoning about a sample of the vendor's Web pages. In the methodology of the present invention, instances correspond to the vendor's pages, a page's label corresponds to its relevant content, and hypotheses correspond to the constructed wrappers.
Besides, in accordance with the present invention, an efficiently learnable wrapper class, such as the HLRT wrapper class, is incorporated.
Furthermore, in order to make sure that the methodology performs well, noise-tolerant techniques are employed when training data exhibit high levels of noise. For instance, given the screenshot example of www.800.com in
In effect, vendors attempt to create a sense of identity by using a uniform look for all types of products. To exemplify, a vendor presents an MD product information in the same format as a DVD product. By taking advantage of this regularity, every product is assumedly described in the same format.
The Semantics Recognition Learner Agent 18 in
Continuing with
For further detail, see the right-hand column in
In step 220 the Semantics Recognition Learner Agent 18 performs inductive learning on the retrieved training pages using the associated labels to output a set of possible “vendor description” candidates. Since the candidates are generated from a particular training page with specific training data, there is no way the candidates can be invalid for those pages. However, if the candidates are to be valid throughout the vendor's entire site, a cross page validation should be performed to derive a generalized vendor description which will be valid across the vendor's site.
In step 240, a vendor description validator (VDV) validates a possible vendor description candidate against another set of training pages (retrieved in step 130,
Training data for use with a particular vendor is preferably compiled by the system administrator 26. As will be described in further detail herein, to add a specific vendor to the system of the present invention, the vendor's name, vendor's URL, submit form's URL, domain data for the corresponding training examples are provided and stored in the offline database 24, which can be, for example, a Microsoft Access database. The vendor's name will be the primary key of a record. Manual wrapper input can be provided as an option. In order to provide an accurate set of data for the training example, which in turn will greatly enhance the accuracy and efficiency of the vendor information learned by the Semantics Recognition Learner Agent 18 in preparing itself for generating vendor descriptions in real-time, automatically, on-the-fly, it is important that the system administrator, or other individual, who prepares the training example data be knowledgeable about Web page URLs, and domain name setup, be somewhat knowledgeable of the native language used in any multilingual vendor's Website being processed, and be able to identify the information types which are targeted for learning. This person need not be knowledgeable about coding.
Once a vendor's information is provided, administrator 26 can run the Semantics Recognition Learner Agent process for each vendor. After the administrator has executed the Semantics Recognition Learner Agent 18 one time for a vendor, he or she then navigates through the learning process with a step-by-step walkthrough of the screens of the Interactive-Agent-Character Learner Interfaces (IACLIs) in running any desired options as shown in
In a nutshell, the Semantics Recognition Learner Agent 18 of the present invention generates a vendor description that is unique to a particular vendor. The vendor description is the set of generalized rules of how a vendor organizes its product information in a specific format. Hence, the input to the wrapper construction system of the present invention is essentially a sample of the behavior of the wrapper to be learned. Under this formulation, wrapper construction becomes a process of reconstructing a wrapper based on a sample of its behaviors.
The methodology of the Semantics Recognition Learner Agent (SRLA) 18 is summarized in
Referring to step 4, a result page is returned according to the searching criteria. The result may be a successful result page with relevant product descriptions, or may be a failure page. It is to be noted that the content of interest in the returned page are the HTML codes, the item description, the item price, and the locations of such information with respect to the HTML codes.
In steps 5 and 6, the search result page is returned by way of the Internet to the Learner Agent 18 for analysis. In step 7, the analysis conducted is called “Wrapper Induction,” in which the page is generalized to a set of layout and formatting rules that the vendor follows to present its product description in a logical manner. With these rules, during the Semantics Recognition Buyer Agent process of the present invention, the Buyer Agent 20 may extract product information from the same vendor when a user/buyer is searching for some product information from the same domain on the vendor site.
It is to be understood that in accordance with the present invention, the Semantics Recognition Learner Agent process will be performed for each vendor from which a vendor description is desired. Because of the information delimiter approach used by the present invention, vendor descriptions can be obtained from any vendor sites in any language—concisely, although the language presented to a user might be of a particular native character string, the underlying codes which can be identified as delimiters for the desired information, remain the same regardless of the native character strings of the language. In other words, information of the vendor descriptions will be obtained from a vendor site in the native language used by that site. There is no need for any translation of the native language used into a standard language. Moreover, because the delimiter candidates identified for each vendor site are not coded in underlying codes of the programming language used for that site, subsequent searching can be done without the need for different programming languages used among the sites to be searched for. This permits the Semantics Recognition Buyer Agent 20 to conduct searches on multilingual and multiple domains (product categories) bases, and independently of any programming language.
Referring to
If any “hits” are not found in step 312, step 320 will invoke the Buyer Agent 20 to use the input parameters to retrieve the vendor descriptions from the offline database 24. These “vendor descriptions” are the ones previously defined by Semantics Recognition Learner Agent 18 during the vendor descriptions learning process. In step 330, Buyer Agent 20 will compose a user's new request to access different online vendors identified in the “vendor description list.” The composed user's new request will be based on the parameters given by the user and the data in the vendor descriptions. Preferably, if there are N online vendors to whom a request (such as a product model request) will be made, there will be N new requests to be composed by the Semantics Recognition Buyer Agent 20.
The Semantics Recognition Buyer Agent 20 uses the vendor descriptions to get the price information from the vendor site in real-time. The Buyer Agent 20 uses the vendor's URL and the vendor's name that are included in the information that makes up the vendor descriptions, to navigate to the vendor's site. Also included in the vendor descriptions is the search form URL for the vendor. In step 340, after navigating to the vendor site, the Semantics Recognition Buyer Agent 20 “virtually” fills-in the vendor search form based on the user's new request, and “virtually” presses enter to submit it. This is done for each of the identified online vendors.
As mentioned above, the vendor descriptions in the offline database 24 include a field that provides information of the vendor's search form URL, such as “http://twww.onlineshop.com/search.asp?item=.” The Semantics Recognition Buyer Agent 20 uses the user's input parameter(s) and the search form URL to compose a new HTTP request for each of the identified online vendors. For instance, if the user wants to buy a “hard disk,” the new request composed by the Semantics Recognition Buyer Agent 20 will be as follows:
“http://www.onlineshop.com/search.asp?item=harddisk,”
and the Semantics Recognition Buyer Agent 20 will send this HTTP request to the online vendor as if the user submitted the request himself. If there are N identified vendors, the Semantics Recognition Buyer Agent 20 will initiate N threads to fill-in the search form for each of the N identified vendors. The Semantics Recognition Buyer Agent 20 preferably proceeds in parallel to each online vendor's searchable index, fills it in and submits a search request.
In step 350, the Semantics Recognition Buyer Agent 20 will wait for the response from the online vendor within a specified timeout or a user-defined timeout. If a timeout occurs, the Semantics Recognition Buyer Agent 20 proceeds to step 370; otherwise it will go to steps 358 and 360 to further process the received search result data.
Within the timeout period, the Semantics Recognition Buyer Agent 20 collects the search request's responses from different online vendors. In step 358, the Semantics Recognition Buyer Agent 20 receives the search result responses from the online vendors and stores them in cache or memory within the server 22. In step 360, the data of interest is extracted from the received responses. The Semantics Recognition Buyer Agent 20 extracts the desired data using the information in the vendor description list stored in the vendor description 28 or offline database 24. To exemplify, the vendor descriptions include fields that identify the codes for left and right wrappers. Firstly, the Semantics Recognition Buyer Agent (SRBA) 20 will use the left wrapper information to locate the start of the valid data in the response page. Afterwards, the data at the exact location (as defined by the information in the vendor description list) of the target data will be extracted and stored in the memory. (Recall that the information in the vendor descriptions are what have been learned by the Semantics Recognition Learner Agent 18 during the learning process of
In the extraction process, for example, the product description and the product price will be extracted. It is to be understood that the information contained in the vendor descriptions defined by the Semantics Recognition Learner Agent 18 are domain independent and multi-language dependent. For example, assuming that the online buyer or user's platform uses a Windows 98 Operating System and is running a language version “B” (or preferably his or her platform is running Windows 2000 Professional in the English version and/or his or her platform has a Personal Web Server version “B” installed), the Microsoft Internet Explorer will prompt him or her to download “B” language display software after he or she logs on to the portal of the present invention. After the online buyer or user “A” enters a product model in the native characters of language “B” as a keyword in the text box provided at the portal of the present invention, the Semantics Recognition Buyer Agent 20 in
Recall that in step 7 of
The vendor descriptions in memory or cache are preferably automatically updated once a day.
In other words, the Semantics Recognition Buyer Agent (SRBA) 20 can use the data in the vendor descriptions to locate target data in different domains and different languages. This is because, for a particular vendor, although the language may change, the underlying coding corresponding to the target information will not. As the three “formatting regularities” predominate most vendor sites, such as B-to-C, C-to-B, C-to-C online stores, etc., different domains on a vendor site will consistently use the same formatting and underlying coding to present the target information, such as item description and price.
Therefore, for each returned search response, the Semantics Recognition Buyer Agent 20 will perform the data extraction using the vendor descriptions. If the time is out, the Semantics Recognition Buyer Agent 20 will go to step 370, in
After the sorting is done, the Semantics Recognition Buyer Agent 20 will go to step 380. In step 380, the Semantics Recognition Buyer Agent 20 composes HTML pages based on the filtered and sorted data from step 370. In step 390, Semantics Recognition Buyer Agent 20 responds to the user request by presenting the composed HTML page as a “result” page to the user using the ActiveX component established previously.
If time is not out in step 350, the Semantics Recognition Buyer Agent 20 will go to step 358. In step 358, the Semantics Recognition Buyer Agent 20 stores data of the search results pages in the memory or cache for use in instantaneously responding to further new requests of the same user/buyer or to the request of a new user/buyer. Following step 358, the Semantics Recognition Buyer Agent 20 will go to step 360 in which it extracts target information from the search results pages, sorts the results in step 370 and the extracted data retrieved from different online vendors based on the user-defined sorting criteria, then composes HTML pages in step 380 based on the filtered and sorted data from step 370, and finally in step 390 responds to the user/buyer using the ActiveX component established previously.
The default language of the Semantics Recognition Buyer Agent 20 is English. By default, the Semantics Recognition Buyer Agent 20 will go to all vendors when it receives the request of a user. When the responses return, the Semantics Recognition Buyer Agent 20 will use the vendor descriptions which have already been learned by the Semantics Recognition Learner Agent 18 to filter out invalid results.
In another embodiment of the present invention, the vendors can be classified to the user's locale so that the user 12 can choose an “advanced search” to search for that classified group of vendors.
The employed methodologies of the present invention are multilingual in nature. When the Semantics Recognition Learner Agent 18 learns a vendor site, such learning can be performed in the native language of that site. The results which are retrieved and used to populate the vendor descriptions for that site will be in the native language of that site. Thus, when an online user/buyer 12 submits a request in a particular native language in step 310 of
The computer program's modules built with the database server's development tool (preferably a Microsoft SQL-compliant database) used in the preferred embodiment of the system of the present invention are standard and the present invention can be used with any relational database, such as SQL database servers from Oracle Corporation of Redwood Shores, Calif., Sybase Corporation of Emeryville, Calif., and others, that support ODBC. As mentioned above, multithreading for concurrent searching is important to the preferred embodiment of the present invention. In that regard, use of a Windows NT 4.0 platform (a product of Microsoft Corporation) can provide such a multithreading capability.
Referring now to
The above information is then saved to the vendor list 27 as training data in the offline database 24. It is to be noted that the training examples that have been entered are a list of specific products which will be searched for during the training process in real-time on the designated identified vendor site from which training pages will be obtained. The “vendor descriptions” will then be “learned” from these returned training pages.
The information can then be displayed on the screen as shown in
Referring now to
In
After the learning/training process is completed, the results of the trained or learned examples which have been returned from its site are displayed on the Learner Interface's screen as illustrated in
“D></TD><TD></TD></TR></”
For the Item description information, the Left Delimiter is identified as the string below:
“G SRC=/Lmg/trans+1X1.gifBORDER-0W1D . . . ”
The Right Delimiter of the Item description is identified as the string: “</b>.”
The Left Delimiter of the Price is identified as the string below:
“</b></A></TD> <TD ALIGN=right><FON . . . ”
Finally, the Right Delimiter of the Price is identified as the string: “</T.”
Although the character strings in
The methodology underlying the Semantics Recognition Leamer Agent 18 will now be described in more detail at a “proof-of concept” level.
Basic ConceptsThe wrapper induction problem is framed in the form of a simple model of information extraction as shown in
As exhibited in
A standard relational data model is adopted. Associated with each product record are two distinct attributes: item and price. Where “item” represents the product name or model number,” and price represents the price of a product.
A “tuple” is a vector <Ai, Ap> of two strings. String Ai is the value of the “item” attribute, and string Ap is the value of “price” attribute, whereas attributes represent columns in the relational model, “tuples” represent rows. Thus, as illustrated in
The content of a page is a set of “tuples” that it contains. For example, the literal string notation is adequate, but since pages have unbounded length, a clearer and more concise representation of a page's content is used instead. Rather than listing the attributes explicitly, a Page's “label” is used to represent the content of a page in term of a set of indices in the Page.
For example, the “label,” Lpc, for the simple product catalogue page (Ppc) is illustrated in the right-hand column of
The “label” Lpc indicates that the simple product catalogue page contains four “tuples,” where each “tuple” consists of item and price values. A pair of integers represents each of the values. Consider the first pair, <174, 180>. These integers indicate that attribute of the first tuple is the sub-string between position 174 and 180, i.e. the string “HM381MD”. Inspection of the character strings of the right-hand side of
More generally, the content of page P can be represented by label L.
For a page with only single “tuple,” the following label results:
L={bl,i,el,ibl,p,el,p}
Label L encodes the content of page P. The page contains lLl>0 “tuples,” each of which has two attributes, item and price. The integers 1<m<lLl index “tuples” within the page. Each pair <bm,j, em,i> encodes an item value, and each pair <bm,p, em,p> encodes a price value. The value bm,i is the index in P of the beginning of an item in the mth “tuple,” the value em,i is the end index of an item value in the mth “tuple.” Similarly, the value bm,p is the index in P of the beginning of a price in the mth “tuple,” the value em,p is the end index of a price value in mth “tuple.” Thus, the item attribute of mth tuple occurs between <bm,i, em,i>, the price attribute of mth “tuple” occurs between <bm,p, em,p>. Thus, the pair <b2,i, e2,i>=<229, 234> in the example of
As shown above, a wrapper W is a function from a page to a label; the notation W(P)=L indicates that the result of invoking wrapper W on a page P is label L. At this level of abstraction, a wrapper is simply an arbitrary procedure.
A wrapper class is a set of wrappers. As will be seen later herein, the wrapper employed by the present invention is called an HLRT wrapper class.
In view of the foregoing explanation of terminology and conventions used to describe the methodology of the present invention, further explanation will now be provided as to how the learner learns a wrapper for a vendor's product catalog pages.
Intuitively, the input to the learning system of the present invention is a sample of product catalogue pages and their associated “labels.” At this point, it is assumed that the “labels” have already been identified and are given. A further elaboraion of the method used to generate labels for a sample page will be provided herein later. The output is a wrapper W ∈ W. Ideally, W outputs the appropriate label for all of sample pages. In general, such a guarantee cannot be made, so (in the spirit of inductive learning) it is required that W generate the correct labels for a given set of training examples.
Solutionwise, the wrapper induction problem (with respect to a particular class W) is as follows:
Input: a set ε={. . . , <Pn, Ln>, . . . } of training examples, where each Pn is a page, and each Ln is a label;
Output: a wrapper W ∈ W, such that W(Pn)=Ln for every <Pn, Ln>∈ε.
The HLRT Wrapper ClassAs explained herein earlier, the pcwrapHLRT procedure illustrates a “programming acronym”—using head delimiter, left-hand delimiter, right-hand delimiter and tail delimiter to extract relevant product information and its price from a vendor product catalogue. The Head-Left-Right-Tail (HLRT) wrapper class is one way to formalize this acronym. The procedure “execHLRT” set forth in
Note that although the delimiters in this example are entire HTML tags, the methodology of the present invention is not limited to operating with HTML tags. Furthermore, the text might not be HTML at all. Thus, the dollar sign symbol, “$,” might be valid left delimiter for price such as “$399.95.”
The execHLRT routine specifies how HLRT wrappers behave. Earlier it was stated that the W(P) is the label that results from invoking wrapper W on page P. The routine execHLRT is a procedure for determining W(P) and from W and P, for the case when W is an HLRT wrapper.
The values of li and ri, indicate the left-hand delimiters and right-hand delimiters for the item attribute, while lp and rp indicate the right-hand delimiters for the price attribute, and h and t indicate the head and the tail of the page, respectively. (Note that h is a line number instead of a string. For example, if h=100, then the first 100 lines of the page is the head, the Semantics Recognition Buyer Agent 20 may skip these lines immediately when it search for a product.) For example, if execHLRT is invoked with the parameters h=7, li=“<B>,” ri=“</B>,” lp=“<l>,” rp=“</l>” and t=“</TABLE>,” then execHLRT behaves like pcwrapHLRT.
More generally, any HLRT wrapper for a vendor site is equivalent to a vector of (h, li, ri, lp, rp, t), and any such vector can be interpreted as an HLRT wrapper. Given this equivalence, the notation (h, li, ri, lp, rp, t), is used as shorthand for the HLRT wrapper obtained by partially evaluating execHLRT with the given delimiters.
Since an HLRT wrapper is simply a vector (h, li, ri, lp, rp, t), the HLRT wrapper induction example of
variables: Head delimiter of the page P: h
-
- Tail delimiter of the page P: t
- Left delimiter of the item attribute: li
- Right delimiter of the item attribute: ri
- Left delimiter of the price attribute: lp
- Right delimiter of the price attribute: rp
domains: each delimiter is an arbitrary string, except the head delimiter;
constraints: W(Pn)=Ln for every <Pn, Ln>∈ε, where HLRT wrapper W=(h, li, ri, lp, rp, t),
The learnHLRT methodology will now be described which addresses the above constraint satisfaction problem.
Delimiter CandidatesTo begin, it is to be noted that the domains of the delimiter variables are tightly constrained by the examples ε. At the very least, the delimiters will be sub strings of the example pages. Of course, one can do much better. On the basis of just the single example (Ppc, Lpc), it can be seen that rp (the right-hand delimiter for the price attribute) must be a prefix of “</l></TD></TR>,” where “⇓” indicates a new line character. By “prefix” it is meant combinations of the characters in the string, starting from the right-most character of the string; for example, “>,” “D>,” “TR>,” etc.
Note that if rp is not a prefix of this string, then every wrapper with this delimiter will, at the very least, fail to extract “399.95” as the code attribute for Ppc's fourth “tuple.” Thus the candidates for rp are all prefixes of “</l></TD></TR>.” These candidates are illustrated in
In detail, the candidates for the delimiters for the simple product catalogue page are generated as follows:
Candidates for the li and lp
Consider lp, the left-hand delimiter for the price attribute. Recall the fragments “HM381MD</B></TD><TD><l>,” “MD2070</B></TD><TD><l>,” etc., that precede the price in
Delimiter li is more complicated because the string prior to the first attribute occurs between the first attribute and the last attribute of the previous “tuple,” as well as between the head of the page and the first “tuple.” In the example, the strings under consideration are “<TR><TD><B>” and “</l></TD></TR>⇓<TR><TD><B>.” Clearly, li is a suffix of this string. Thus, the candidates for li can be generated by enumerating the suffixes of one such fragment.
To generalize this elaboration, it is concluded that the candidates for delimiter li and lp, given the example set and written candsl(i, p, ε), are generated by enumerating the suffixes of the shortest string occurring to the left of each instance of item attribute or price attribute in each example. (As mentioned in the previous paragraph, the case item attribute is somewhat special. The suffixes of the shortest string either between adjacent tuples or before the first tuple must be enumerated.) For example, if ε={(Ppc, Lpc)}, then:
Candidates for the ri and rp
The candidates for the right-hand delimiters are generated similarly. But there are two distinctions. Firstly, the strings under consideration occur to the right of the appropriate attribute (rather than to the left). Secondly, ri and rp must be a prefix (not a suffix) of these strings. For example, in the simple product catalogue example, the delimiter ri must be a prefix of the string “</B></TD><TD><l>,” while rp must be a prefix of both “</l><ITD><TR>” and “</l></TD><TR>⇓<TR><TD><B>.”
In particular, the candidates for right delimiter given the example set ε—written candsr(k, ε) are generated by enumerating the prefixes of the shortest string occurring to the right of each instance of attribute k in each example. (As stated above, li is a special case. Similarly, rp is a special case. The prefixes of the shortest string occurring either between adjacent “tuples” or after the last “tuple” are enumerated. For example:
A similar analysis applies to delimiters for the head and tail. The “head” is the prefix of the page before the first item attribute occurs. Note that here, the “head” is represented as a string. When a wrapper is in actual implementation, in order to increase the performance of the invention, it is preferable to represent the “head” as an integer so that the human shopper or buyer, in using the wrapper to look for product information, can skip a page's head quickly without looking into the content. To convert a head string to an integer, simply find out the number of lines that are spanned by the head string.
Identifying the delimiter for the “tail” is quite similar to the right delimiter li and Lp. The tail candidates are the suffixes of the string after the last price attribute of the page.
Given these candidates for each delimiter, a module of the simple method in pseudo-code for learning these two delimiters is provided in
Because the module runs in time proportional to the product of the number of candidates for each delimiter, and because each delimiter can have many candidates, execution time can be slow.
A more efficient processing can be accomplished by observing that the delimiters ri, lp, rp are mutually independent. Furthermore, whether a candidate is valid for a particular delimiter in no way depends on any other delimiters. For example, it can be evaluated whether “</B>” is satisfactory for ri without reasoning about any of the other delimiters.
To see that this independence properly holds, recall the execHRLT procedure. At each point in its execution, execHRLT is searching for its input page P for exactly one of the delimiters ri, lp, rp. If any of these searches fails to identify the correct location in P, then the label output by execHRLT will be incorrect. But whether these searches return the right answer depends only on the delimiter under consideration and the example pages—not on the other delimiters.
Put another way, once a particular candidate (ri, lp, rp) is chosen for some delimiter, there is no way the candidate can be made invalid, no matter what candidates are selected for the other delimiters. The contrapositive of this assertion also makes intuitive sense: if a candidate is invalid, there is no way to repair it, no matter how carefully candidates are selected for other delimiters. Note that this independence property is guaranteed; it is not merely heuristics that facilitates learning.
The significance of this observation is that the three delimiters, ri, lp, rp, can be learned in isolation. In pseudo-code, they can be learned as follows:
1. Generate the candidate sets
2. For each delimiter, select a valid candidate.
This methodology is much faster than the procedure of
However, it is also observed that not all the delimiters are mutually independent. In contrast, as to delimiters h, t, and li, whether a particular character string is valid for one of these three delimiters depends on the choice for the other two. For example, is “<B>” valid for li? The answer depends on the choice for the choice for h and t. If h=“<HTML>,” then “<B>” is not a valid delimiter for li because execHLRT will not skip the irrelevant bold text “<B>A Simple Product Catalogues</B>.” On the other hand, if h=“</TH></TR>,” then li=“<B>” causes no problem. Similarly, li and t interact: li=“<B>” is unacceptable if t=“</HTML>,” but acceptable if t=“</TABLE>.” As a result, candidates for the three delimiters h, t and li must be considering jointly. Thus, all combinations of candidates for h, t and li are enumerated, and the valid ones are selected.
Candidates ValidityThe second step of this improved methodology involves precisely characterizing the conditions under which a delimiter candidate is valid.
Consider first the delimiter ri and rp. After the method has identified the beginning of some instance of the attribute, the method attempts to locate the end of that instance of the attribute. Thus a candidate, “u” for delimiter ri or rp must satisfy two constraints:
-
- Constraint C1: “u” must not be a sub string of any instance of an attribute in any of the example pages.
- Constraint C2: “u” must be a prefix of the text that occurs immediately following each instance of the attribute in every example page.
If these constraints are violated by a candidate “u” for delimiter ri or rp, then every wrapper will fail for at least one of the examples ε. If constraint C1 is violated, then attribute k will be too short; if C2 is violated, it will be too long.
In summary, of interest are the conditions that must hold if some candidate “u” is to be valid as a value for delimiter ri or rp, with respect to a given set of examples ε. These conditions will be referred to as valid (u ,r, ε). It is seen that validr(u ,r, ε) holds if and only if candidate “u” satisfies constraints C1 and C2 for delimiter ri and rp with respect to example set ε. Returning to the example, if the validr test is applied to the candidates generated by candsr, it is found that:
For the right delimiter of the item attribute:
Constraints on lp
The execHLRT procedure searches for delimiter lp. A candidate “u” for the delimiter lp must satisfy two constraints:
-
- Constraint C3: “u” must be a proper suffix of the text that occurs immediately before each instance of the attribute k in every example page.
If this constraint is violated, then every wrapper will disagree with the examples ε. At least, one of the starting indices bm, p computed by execHLRT will be incorrect—either less or greater than the correct value, or undefined, depending upon how “u” violates the constraint.
In summary, of interest are the conditions that must hold if some candidates “u” are to be valid as a value for delimiter lp according to example set ε. These conditions are referred to as validi(u, l, ε). It is seen that validi(u, l, ε) holds if and only if candidate “u” satisfies constraints C3 for delimiter lp with respect to C. Returning to the simple product catalogue example Ppc, it is seen that:
To determine whether a particular combination of candidates Uh, Ut, and Uli for h, t, and li are satisfactory, the constraints below are applied:
-
- Constraint C4: Uh must be a proper suffix of the portion of every page's head.
- Constraint C5: Uh must be a proper suffix of the portion of every page's head after the first occurrence of Uh.
- Constraint C6: Ut must not occur between the first occurrence of h in any page and the subsequent occurrence of li.
- Constraint C7: Ut must be a sub string of every page's tail.
- Constraint C8: Uli must not occur before t in every page's tail.
- Constraint C9: Uli must be a proper suffix of the text between “tuples” in every page.
Constraint 10: Ut must not occur before Uli in the text between “tuples” in any page.
HLRT InductionWith this background in place, procedure learnHLRT will now be described. A table of the detailed procedure as well as the related subroutines are provided in
In the earlier description of the present invention it has been assumed that the training data was already in place for use by the Semantics Recognition Learner Agent 18. That is, a set ε={. . . , <Pn>, Ln>, . . . } of training examples was already in existence, where each Pn is a page and each Ln is a label. To further understand how the Learner 18 works with the training examples, reference is made to
As stated earlier, at the shopping/buying phase, the Semantics-Recognition Buyer Agent 20 can perform five different functions which are as follows:
-
- (1) To compose Labels (LabelOracle) using a modular heuristic search methodology as described in
FIG. 15C . These are referred to as recognizers: One of which is an item recognizer and the other an intelligent price recognizer. - (2) Because it is inefficient for the Semantics Recognition Buyer Agent 20 to retrieve the vendor descriptions data each time it receives a request from the online human buyer or user, such descriptions will only be retrieved from the database (preferably a Microsoft Access database) or SQL-compliant database server 22 if it is the first time the Semantics Recognition Buyer Agent 20 requests the search-match-extraction of a desired set of them. Then the vendor descriptions will be stored in the memory or cache for more instantaneous retrieval use in other later Semantics Recognition Buyer Agent 20 requests.
- (3) The vendor descriptions in memory or cache preferably will be automatically updated once a day.
- (4) The system of the present invention creates multi-threads and zaps simultaneously several Semantics Recognition Buyer Agents to contact various designated online vendor sites through the World Wide Web. The use of this multi-threading methodology is preferably built on top of DCOM technology offered by Microsoft Corporation. Each Semantics Recognition Buyer Agent 20 intelligently fills-in the vendor's search form with the product information provided by the human buyer or user and presses “enter” virtually.
- (5) On the other hand, the Semantics Recognition Buyer Agent 20 of the present invention addresses heavy network traffic on the World Wide Web, now dominating the whole process of a shopper/buyer's online purchase, by speeding up vendor response times and allocating returned search result pages to separate memory locations as enabled by multi-threading.
- (1) To compose Labels (LabelOracle) using a modular heuristic search methodology as described in
Obtaining a training page involves making an example query to a vendor Website. For example,
The algorithm in which the recognizer, referred to as Label (LabelOracle), is composed from modular heuristics search methodology, will now be described in greater detail. A recognizer finds instances of a particular attribute on a page. For example, given the example page of
For example, again, given the example of
Recognizing the “items” is a simple pattern-matching problem if the item recognizer knows all items in advance. Nevertheless, this is infeasible because this requires a big list of item names/model numbers. Moreover, it is costly to maintain such a large database of items. Thus, it is not practical to guarantee that such a list of item names/model numbers is complete and up-to-date.
Fortunately, vendors attempt to create a sense of identify by using a uniform look for all products. For example, a vendor presents mini-disc (MD) product information in the same format as for a DVD product. By taking advantage of this regularity, it is assumed that every product is described in the same format.
The present invention learns a wrapper only from a specific domain of examples and attempts to fit this domain to all other domains in foreign languages organized in a consistent format globally on the Internet. In the preferred embodiment, the training examples solely originate from one domain, such as the MD domains of the vendor Websites. This results in the item recognizer that merely needs to recognize a specific domain of product, such as MD. In this manner, it is then feasible to maintain a thoroughly updated nomenclature of a specific domain of item names.
The present invention identifies a “price” by invoking a modular heuristic search. For instance, a price always follows a dollar sign ($); and a price is often a floating-point number, etc. If more than one price is found for an item, the keywords, such as “your price,” “our price,” “list price,” “original price,” etc are then extracted accordingly.
Detailed Steps of the Shopping PhaseAs described briefly earlier, the mechanism of how the Semantics Recognition Buyer Agent 20 works is illustrated in FIGS. 14 and 15A-15C. The flow of control consists of eight (8) steps that are labeled in the graphical diagram.
Step (1)
When a user identifies the need for a particular product, or services, instead of browsing into different multilingual vendors sites on the World Wide Web in a manual search for product information and price, ONE-BY-ONE, the present invention provides a portal through which the request for product information is entered once through an interactive-agent-character graphical user interface (IACGUI), commonly known as interactive-agent-character shopper/Buyer interface to achieve the same purpose, but with better, faster and more reliable results.
The resulting product description is stored in the member variable—m_ProdDesc—of the SRBA 20. The search also enables the user to customize how the agent works through the “Advanced Search” function, which provides selectable parameters such as vendors of choice, the time out (limit), price range, any manufacturers, keywords, and so forth.
Step (2)
Assuming, for example, that the online buyer or user's platform uses a Windows 98 Operating System and is running a language version “B” (or preferably his or her platform is running Windows 2000 Professional in the English version and/or his or her platform has a Personal Web Server version “B” installed), the Microsoft Internet Explorer will prompt him or her to download language “B” display software after he or she logs on to the portal of the present invention. After the online buyer or user “A” enters a product model in the native characters of language “B” as a keyword in the text box provided at the portal of the present invention, the Semantics Recognition Buyer Agent 20 in
Recall that in step 7 of
The vendor descriptions in the memory or cache preferably are automatically updated once a day.
Step (3)
Using the retrieved vendor descriptions, the system of the present invention creates multi-threads and zaps simultaneously several Semantics Recognition Buyer Agents 20 to contact various designated online vendor sites through the World Wide Web.
Step (4)
The use of this multi-threading methology is preferably built on top of DCOM technology offered by Microsoft Corporation. Each Semantics Recognition Buyer Agent intelligently fills-in the vendor's search form with the product information provided by the human buyer or user and presses “enter” virtually.
Step (5)
Each vendor then returns a search result page having the information of the requested product or an error message.
Steps (6), (7)
The search results pages are returned to the Semantics Recognition Buyer Agents 20 through the World Wide Web. It is noteworthy that several results pages may arrive back at the Semantics Recognition Buyer Agents 20 at the same time. The Semantics Recognition Buyer Agents 20 of the present invention address heavy network traffic on the World Wide Web which currently dominates the whole process of a shopper/buyer's purchase by speeding up vendor response times and allocating returned search result pages to separate memory locations as enabled by multi-threading.
Stage (8)
The Semantics Recognition Buyer Agent 20 analyzes the returned pages according to the corresponding vendor descriptions. Relevant information and data are extracted from the returned pages and are displayed in a formatted output fashion as shown in
Referring to
Preferably, developing the Semantics Recognition Buyer Agent 20 as an ActiveX Component produces several advantages. Firstly, overall performance can be improved. Writing the Semantics Recognition Buyer Agent 20 in Visual C++ permits the agent to be robust and makes available the powerful functionality of the ActiveX Component. There is no need to supply workaround solutions in the HTML and scripting code to meet the needs of the application. With the ActiveX Component, the agent can be run by adding a few lines of the code in the HTML file in the client side, while leaving aside all complex processes to be executed in the server side.
Secondly, ActiveX Components provide reusability to other applications instead of copying similar functions in every application module. An ActiveX component can be created to be accessible to all Active Server Pages modules. In other words, it is not required that all the logic be coded in ASP modules. Thus, this eliminates the redundancy in the application. Albeit the Semantics Recognition Buyer Agent is created within a single application, it does not hinder the ability to integrate with other applications as well. Furthermore, this feature can help reduce the development time significantly.
Thirdly, it is beneficial to connect an ASP Component to DLL (Dynamic Link Library) files as they are compiled and linked independently. No additional recompilation and re-linkage are needed to update the ASP Component. Hence, speed improvement or new functionality of later upgraded version can benefit the ActiveX Components that use the DLLs. Besides, DLLs can reduce memory and disk space requirements by sharing a single copy of common code and resources among multiple modules.
If there are several components using the same static link libraries, several identical copies of the library are required to be stored and executed. Then there will be several identical copies in memory if they run simultaneously. So, it is obvious that using static link libraries may result in redundancy and space wastage.
Only one copy of the code and resources are needed if a DLL is used in lieu of static link libraries. This can keep the server at a minimum workload as there are many concurrent connections from the Internet.
The Semantics Recognition Buyer Agent 20 is preferably an ActiveX Component that is developed as an in-process DLL. It can allow user to create an object of the SRBA through the World Wide Web. To communicate between the user and the server, ASP is used to act as the gateway between the user and the server.
ASP is an open application environment in which HTML pages, scripts, and ActiveX Components are combined to create Web-based applications. In addition, it is built as an Internet Server Application Program Interface (ISAPI) that runs on top of the Internet Information Server (IIS) product of Microsoft Corporation, or on a peer Web server relative of IIS.
To implement ASP, Microsoft ActiveX Scripting is used, such as Visual Basic (VB) script that is used in the process of managing ActiveX Components. It makes the language dynamic by adding the ability to invoke ActiveX Components running on the server as DLLs.
Program Logic—Create an object of the Semantics-Recognition Buyer Agent
An object of the Semantics Recognition Buyer Agent 20 is created when the user starts searching for the price of the desired product.
In the Active Server Page, the module in pseudo-code is coded as follows:
The above module creates an object of the Semantics Recognition Buyer Agent component 20 when the page is loaded, while NextGen is the name of the ActiveX Component. Semantics-Recognition Buyer is the name of the Agent in the NextGen component.
Connection Between the User and the ServerA connection is established between the user and the server after an instance of the Semantics Recognition Buyer Agent 20 has been created, as shown in
The Semantics Recognition Buyer Agent uses a connectable object to maintain a “one-to-one” (channel) through which the user communicates with the server, such as a user request to the Server to compare the price, while the Outgoing Interface is used by the Semantics Recognition Buyer Agent 20 as the connection (channel) through which the Server communicates with the user, such as a response in which the Server returns the search result requested to the user. The user can access the properties and invoke the methods of the Semantics Recognition Buyer Agent through the IConnectionPoint.
Semantics Recognition. Buyer Agent 20 employs the methodologies set forth below:
1. OnStartPage (Unknown Agent)
This methodology is used to initiate the object of the Semantics Recognition Buyer Agent that is called automatically when the ASP is loaded.
2. OnEndPage( )
This methodology is used to terminate the object of the Semantics Recognition Buyer Agent that is called automatically when the ASP is unloaded.
3. GetSearch (BStr input, BStr *output)
This methodology is used to search the required product price on the Internet after the user has provided the product description, such as model number. Input is the product description of the user whereas Output is the output of the search result page. The syntax to call this method is:
OutputName=AdObjectName.GetSearch(“Product Name”)
In the above code, AdObjectName is the object's instance name whereas “Product Name” is the product name which the buyer/user agent wants to compare the price, and the OutputName is the variable that gets the returned value. Refer to the pseudo-code below as an example,
If result=Agent
-
- Get search (“Radar detector”)
Register the component
Before the user can initiate an object of the Semantics-Recognition Buyer Agent, its component must be registered in the server by the following command:
Register path\Nextgen.dll where path is the absolute path the Nextgen.dll is saved.
Response to the User Request
As the object of the Semantics Recognition Buyer Agent calls the method of GetSearch through the IconnectionPoint, an instance of the Semantics-Recognition Buyer Agent in the server machine executes the Dynamic Link Library (DLL). See
Connecting Database
Data Source Name (DSN), identify (ID) and Password are needed to be provided to connect to the SQL Server through ODBC. RETCODE is a variable that stores the returned value from the SQL server. SQL_SUCCESS indicates a successful retrieval.
Execute an SQL Query
Before retrieving the desired data from SQL, a specified query is needed.
Retrieving Fields
After succeeding in executing the query, the vendor description will be stored into an array called vendor_description. There are two member variables in this array: a wrapper and a vendor URL.
Filling-in the Forms
If there are N vendor descriptions, the buyer agent will initiate N threads to fill-in the form in each vendor specified in the vendor descriptions.
To run each thread, the syntax is:
For each thread, the time limit preferably is about 5 seconds. If the vendor does not return the result within 5 seconds, this vendor will be discarded for this time, otherwise, the result will be stored into the memory for the use by the next process.
When the user inputs in a provided box a keyword of the purchase request on the portal of the present invention, it is determined whether there are any related vendor descriptions; i.e. vendor descriptions which contain the keyword. All related vendor descriptions containing wrappers and URLs are then retrieved from the offline database. Afterwards, the Semantics Recognition Buyer Agent 20 goes in parallel with each online vendor's searchable index, fills it in and submits it to the vendor site. On the vendor site, the Buyer Agent will call a member function httppost to complete the task. The httpPost member function posts a URL and form data to a vendor according to the vendor description and returns a HTML response as a string variable. The httpPost member function returns a Boolean value, where true indicates a successful retrieval of the HTML document, and false indicates that an error has occurred. If the return value is true, the generated item name and price will be extracted from the HTML document. The flow of posting a form is shown in
In step 1002 a Cinternet Session object for the session is created. The Cinternet Session class connects to a server for an Internet session. Typically this class is used early in a session to establish a connection to a Web server.
In step 1004 a CHttpConnection object is created by calling the Cinternet Session object's GetHttpConnection member function. The CHttpCbnnection class establishes an HTTP connection with a server.
In step 1006 a CHttpFile object is createdby calling the OpenRequest member function of the CHttpConnection object. The CHttpFile class let file transfer over the Internet to be treated as if working with a local disk file. It works with the CHttpConnection object to read or write Internet data.
Step 1008 calls the SendRequest member function of the CHttpFile object to send the POST request and form data to the remote HTTP server.
Steps 1010, 1012 and 1014 repeatedly call the CHttpFile object's Read member function, which returns chunks of response data to the program. When Read returns 0, no data is left to retrieve.
Extracting Price
After getting the result pages, the Semantics Recognition Buyer Agent 20 will match each result page against the generalized failure template. If the page does not match the template, it is assumed to be a successful search. Then the buyer agent 20 will use the wrapper for the corresponding vendor to strip header and trailer information from the successful pages. For example, assume that a user searches for a MD product with the model number MD203, that a given wrapper is {7,<B>,</B>,<l>,</l>,</TABLE>}, and that the result page is shown below.
In the wrapper, the useful information starts from line 7 and ends at </TABLE>, so the Semantics Recognition Buyer Agent 20 will cut the useless information before extracting the model number and price. The HTML file, after header and trailer information are stripped away, is
Then the Semantics Recognition Buyer Agent 20 will use pattern matching to extract the model number and the price of the product. In the wrapper, the pattern for the model number is <B>*</B>and the pattern for the price is </1>#</1>, where * represents the model number and # represents the price. The agent will firstly extract the model number HM381MD and compare it with the user's request model number “MD203.” As it does not match, the Semantics Recognition Buyer Agent 20 looks for another model number until it finds the model number MD203. After the model number is found, the Semantics Recognition Buyer Agent 20 uses the price pattern to extract the first price after the model number. When the model number and price have been extracted, the Semantics Recognition Buyer Agent 20 stops extracting information from the page and put the information into a array called array_item[].
Critical Section
The array_item[] is shared data for those N threads, and all the threads can access this member variable. There is a risk that more than one thread accesses the array_item[] at the same time which causes an access violation. In order to protect this shared data in a consistent state, a critical section is used to prevent more than one thread from modifying the data at the same time. It is declared as,
CCriticalSection m_csDoor;
Before inserting an element into the array_item, the line m_csDoor,Lock( );
is added which is used to start the critical section. All the variables inside the critical section will be locked to prevent other threads from accessing the particular variable. After finishing the insertion, the line
m_osDoor.Unlock( );
is added, which is used to state the end of the critical section. All the locked variables will be unlocked to allow other threads to access to the member variable. By doing so, the member variable of array_item can be safely shared by all threads.
Sorting the Price
In a specified time interval, the array sort_item which stores the product price will be sorted by a quick sort method.
The quick sort method can be implemented as follows:
A “key” value of the structure is selected to be positioned in every recursion of the code. The function then repeatedly scans through the structure in two directions. Values less than the key are passed to the left side of the structure and those greater are passed to the right side. These “left to right” and “right to left” scans and swaps continue until a flag condition tells them to stop.
Return Response to the User
An HTML file is returned to the user which will be stored in the member variable m_output, and which displays the sorted results of the search conducted by the SRBA 20.
Submitted with this application on the afore-stated compact disc is a computer program listing appendices which provide code sections which implement selected features of the present invention. In particular, in the portion labeled “3.1 The Learning Phase,” source codes are provided for “3.1.1 Main COOSA Application Class”—the main class file for the COOSA application; for “3.1.2 Add Vendor Class—adding a vendor class to the database; “3.13 COOSADoc Class”—invocation of the display of documents and screens for the Semantics Recognition Learner Interface; “3.1.4 COOSA View Class”—Learner Interface's and its function's screens; “3.1.5 Training Data Class”—invocation of the Semantics Recognition Learner Agent; and “3.1.6 Vendor Class”—Declaration of labeling algorithm to process through all vendor Web pages. In the portion labeled “Shopping Phase,” source codes are provided for “3.2.1 Agent Class”—declaration of the Semantics Recognition Buyer Agent; and “3.2.2 Thread Process”—part of the Semantics Recognition Buyer Agent's process.
Referring now to
It is to be further understood that while the present invention has been described in terms of the Internet and the World Wide Web, the present invention is equally suitable for use in recently introduced systems and next generation systems. To exemplify, the wireless application development tool, J2ME (Java to Micro Edition), can be used to incorporate the online intelligent multilingual- and -domain-independent price comparison capabilities into mobile/wireless platforms including all models of 3G or Web phones, Interactive and Ultimate TVs, pocket PCs, Palm organizers, all-in-one Web-enabled Palm Synchronizers, wireless tablets, etc for delivering to mobile workers and Netizens numerous products and multilingual value-added Business-Web services on a home page as point-of-access all in one-stop anywhere on a 24/7/365 basis.
Furthermore, the present invention can be used to deliver via wired and mobile/wireless platforms various products and multilingual value-added Business-Web services having such enablement and functions and features as price comparison, e-Wallet integration, Inter-Agent Communication with Negotiation Ability—Agent-to-Agent (A-to-A) contract-negotiation—real world simulation capabilities to multiple e-commerce segments, including Consumer-to-Business, Consumer-to-Consumer, and Business-to-Business auctions, Government-to-Business transactions, etc. These A-to-A commerce or A-commerce's activities will be constructed and activated on a Global Ensemble Marketplace Framework just-in-time in a dynamic fashion in combination with the use of either keyboard, mouse, and pointing device.
The terms and expressions which have been employed herein are used as terms of description and not of limitations, and there is no intention, in the use of such terms and expressions of excluding equivalents of the features shown and described, or portions thereof, it being recognized that various modifications are possible within the scope of the invention claimed.
Claims
1. A method for real-time online search processing of selected types of information in inter-connected computer networks, the method comprises the steps of:
- a. assembling site descriptions for a plurality of sites in the inter-connected computer networks including for each of the plurality of sites: i. a URL for the site; ii. a search form URL for the site; iii. generalized rules of how the selected types of information on the site are organized; iv. sample data retrieved from the site corresponding to the selected types of information; and v. descriptions of domains found in the site;
- b. receiving a request for specified types of information from an online user;
- c. identifying from the site descriptions, sites which may have the specified types of information;
- d. constructing search requests for the specified types of information using the site descriptions for each of the identified sites;
- e. submitting the constructed search requests to the identified sites;
- f. receiving search results from the identified sites, and upon locating accurate matches in the received search results, extracting information corresponding to the specified types of information in a native language of the site, and displaying the extracted information to the user.
2. The method of claim 1 wherein the generalized rules include identifying characteristics that can accurately identify the occurrence of the selected types of information within the site.
3. The method of claim 1 wherein the submitting constructed search requests step is multi-threaded.
4. A method for assembling site information for use in real-time online search processing of selected types of information over inter-connected computer networks, the method comprises the steps of:
- g. collecting information from a plurality of sites over inter-connected computer networks including: i. a URL for each of the plurality of sites; ii. a search form URL for each of the plurality of sites; iii. sample pages containing the selected types of information retrieved from each of the plurality of sites; iv. positional information associated with the selected types of information within the sample pages; and
- h. from the collected information, deriving generalized rules about how the selected types of information on each of the plurality of sites are organized.
5. The method of claim 4 wherein the deriving generalized rules step includes the step of assembling identifying characteristics which can uniquely identify the occurrence of the selected types of information within each of the plurality of sites.
6. The method of claim 5 wherein the assembling identifying characteristics step includes the steps of
- i. identifying delimiter characters which bound the selected types of information in the sample pages;
- ii. retrieving further sample pages from the site; and
- iii. corroborating the identified delimiter characters against the retrieved further sample pages.
7. The method of claim 6, wherein the identifying delimiter characters step includes the steps further wherein the corroborating step comprises the step of repeating the comparing step for the further sample pages retrieved contemporaneously from the site.
- iv. from the sample pages, compiling a list of possible strings of delimiter characters for each of the selected types of information.
- v. comparing the list of strings of delimiter characters compiled for one of the sample pages with the list of strings of delimiter characters compiled for another of the sample pages, and revising the complied lists of strings to be consistent for all compared sample pages; and
8-35. (canceled)
Type: Application
Filed: Jun 1, 2009
Publication Date: Oct 15, 2009
Inventor: Victor Hsieh (Alhambra, CA)
Application Number: 12/455,341
International Classification: G06F 17/30 (20060101);