Product Search Engine

Info

Publication number: 20140095463
Type: Application
Filed: Jun 5, 2013
Publication Date: Apr 3, 2014
Inventor: Derek Edwin Pappas (Palo Alto, CA)
Application Number: 13/911,049

Abstract

The present invention facilitates product searches on a personal computer, mobile or other device from remote sites via widget lookup using a computed image signature and optional product information extracted using a template in order to retrieve a list of the same or similar products available at other sites. The search starts with a widget lookup process, followed by the submission of the product image URL, optional product information extracted using a site specific product information template and information from HTML attributes to a server. The image signature is computed, a lookup based on the image signature and product information is executed and a product list with an image, price and link to each retailer where the product can be found is returned. The list is reduced based on the submitted image, optional product template and attribute information. The server sends the product list to the user's browser for display.

Description

Description

CROSS-REFERENCE TO RELATED APPLICATIONS

The present application claims the benefit of U.S. Provisional Application No. 61/656,502, filed Jun. 6, 2012, by Derek Edwin Pappas and titled “Structured and Social Data Aggregator”, incorporated by reference herein and for which benefit of the priority date is hereby claimed.

FEDERALLY SPONSORED RESEARCH

Not applicable.

SEQUENCE LISTING OR PROGRAM

Not applicable.

FIELD OF INVENTION

The present invention relates to web search, image processing, on-line shopping and social networking. Specifically, techniques of web search and image processing to aid users that either view, compare and buy products on-line, or share their product findings and preferences via social networks.

BACKGROUND OF THE INVENTION

Currently, users search for products on retailer, manufacturer, shopping engine, social network and blog websites. When users find a product image that they like, they often want to know how much the product costs, where they can buy it and other details and attributes about the product. In addition the users may want the specifications for the product and to know the manufacturer, model number, and product name. Currently users cannot search for products by image at remote websites. Users can search by image for similar images using services such as Google image search. But Google image search does not return the structured data associated with images search results.

Shopping search engines do not de-duplicate or normalize the shopping data for all products. Typically the same image will appear in many different records in a shopping search engine result.

Socially curated image sites which are used for curating images from other sites typically can capture the image and the page title. However, the meta-data (i.e., the structured data record in the page which was generated from a database) is not extracted automatically. The image could have been copied to a blog by another user.

Users can save a product image found on the 3rd party website, go to Google image search, upload the image and search for it. Google image search will find the images that contain the same features (i.e., are similar in terms of shape and structure). Google image search will also return images that contain keywords which match the keywords associated with the matching images. A list of images is presented in the search results. The list will contain the original image, images that are from the same manufacturer and other images. The user then needs to click on each image and visit each site to see the price and the product information. The user then needs to note the product information on each of the different product pages in order to compare the different offers for the same product. The Google search results do not include the product information or any related products.

The Google Data Highlighter tool performs structured data extraction using a template made by the user (web page owner). The user tags the data field values with data field names on the web page using the tool. The Data Highlighter finds pages that contain the same HTML markup and structure. The tool then finds the additional pages on the site which match the specified HTML layout. The tool then identifies the other pages on the site with the same structure. The extracted data is then presented as a rich snippet in the search results. Currently, the Data Highlighter can extract only the events-related data records which contain a time, date, place and person. The identification of semantic information is the most difficult part of structured data extraction from web pages.

“Intelligent data search engine” U.S. Pat. No. 8,190,556 automatically identifies pages with similar structure from the same site, finds the intersection between the page structure (i.e. the XPATH and semantic type) automatically generates an extraction template, crawls each page on the site and checks if the page matches the structure of the template. If there is a match the structured data on the page is extracted and stored in a data store.

Pinterest is a social information catalog that is curated by users. Users navigate to pages on remote sites which contain images and then press the “Pin it!” button embedded in the web page or use the “Pin it” bookmarklet to upload or add a pin on the Pinterest web site. A set of images from the web page appears and the user clicks on one of the images, adds a description, selects an existing pin board or creates a new pin board and then presses submit. The image, page title and user description are added to the user's Pinterest pin board. Currently Pinterest does not support product information extraction via templates nor do they allow the user to perform a remote product information search via their widget. Pinterest does identify URL's that belong to stores, looks up the price in a database that is created from a retailer data feed (not extracted) and displays the price in the page.

TheFind is a conventional shopping engine. A user searches for a product by brand, store, category or can use a limited set of specifications to narrow down the search results. The search results are presented to the user. The results often contain duplicate products from different and the same store. The results do not group the stores that contain the same product.

Currently neither Google nor Pinterest extract product information from a single product page using templates. Normally, TheFind and other shopping engines do not de-duplicate or group the same product together and display a canonical record for the product. Moreover, the shopping engines do not present all of the data from all of the stores on the Internet. Shopping engines use invented indexes generated by Apache SOLR or internal tools that index the fields in the product records. Based on the search results presented to the user, limited attempts are made to group the same product from different stores.

SUMMARY OF THE INVENTION

In accordance with the present invention, there is provided a method and system for implementing a shopping engine that users take with them when they browse the Internet, providing a centralized search service that is connected to content on remote sites and provides a remote lookup system for the Internet's products. The invention facilitates web search, image processing, on-line shopping and social networking. Specifically, the invention's unique methods of web search and image processing, when employed, aid users that view, compare and buy products on-line, or share their product findings and preferences via social networks.

The invention facilitates users search for products on retailer, manufacturer, shopping engine, social network, blog, and other types of websites. When users find a product they are interested in, the invention provides the information users want to know. Information such as product costs, where they can buy it, model numbers, product names, product specifications, the name of the manufacturer, and various other details.

Another aspect of the invention is that the lookup system provides the users viewing product information on product information sites with the following information: other stores that sell the same product; the best store to buy from for non-price reasons (i.e., support, store, warranty, returns, and customer service); the historical pricing for the product, similar products and aggregated information about social brand messages.

The product recognition process consists of the following eleven steps. First, execute a web browser program on a computer device with a screen, a microprocessor, volatile memory and persistent storage such as a hard disk drive or flash memory. Second, log into a first remote site incorporating our invention which contains our web browser device. Install the install the web browser device. Third, navigate the Internet via web browser to find a product page by searching, browsing or directly typing in a known URL.

In the fourth step, the advanced search method looks up the site URL and if the site template exists sends it from the server to the client browser. The template is created by the user in the current or previous session on the same or a different page at the same site by selecting the each data field value (DFV) in the HTML rendered web page, in the web browser, associating the DFV with a data field name (DFN), and extracting the XPATH to the data field value (DFV). The web browser device then uses the XPATH's in the template to the extract the DFV's from the current page and associate them with their respective DFN's. The product record DFN's include but are not limited to the manufacturer name (MN), model number (M#), retailer and manufacturer logos, product name (PN), product image, ratings, breadcrumb (product category), price, sales price, the rich attributes (specifications, colors, and features), and product identification codes such as Universal Product Code (UPC's), and ISBN's. If a template does not exist then the user is prompted to identify the parts of the page which are associated with each of the DFN's. The product information is extracted from different places in the HTML code of the product page using the XPATHS associated with each DFN/DFV. The places include “alt” attribute of the “img” tag, URL and title. The product information is also being extracted from the paragraphs, headings, breadcrumb and menu links and tables containing product name, description, category, specifications, retailer name, etc. The second method includes clicking on the web browser device icon in the browser address bar at any remote site. The user then selects an image to send a message to the first remote site's image lookup server. The message contains the name of the remote site and the image URL.

Fifth, the first remote site's image lookup server downloads the image from the image URL. The lookup server automatically performs the image signature computation producing an image signature conversion of the image to a vector of numbers, and creation of the image signature. Sixth, the software converts the image signature into a list of product IDs. Send the image signature to lookup in the image signature database via the product index, which finds a list of product records for the same or similar products with matching image signatures (a range check is performed to allow for image artifacts such as noise). Seventh, the list of product records that is sent back to the user who is waiting at the remote site. The displayed list of product records shows all stores where the user can buy the same or similar products. Eighth, combine the image signature and product information lookup results to allow further refining of the combined search results by checking for the same or similar products in the combined list, using such checks as a range check on the price and similar categories. In the event that two similar products have the same signature, the product information is used to verify that the combined results contain the same product. If the user requested that similar products be returned, then a combined result including similar products is returned to the user. Ninth, sending the resulting product list from the first remote site via a JSON file over the world wide web and displaying it in the user's browser which is executing on their client computing device. Tenth, the user selects retailer sites to visit by clicking on the links in the returned search results. And eleventh, optionally allowing the user to add the product(s) in the search results in the web browser executing on the client computing device to the user's collection on the first remote site.

BRIEF DESCRIPTION OF THE DRAWINGS

A complete understanding of the present invention may be obtained by reference to the accompanying drawings, when considered in conjunction with the subsequent, detailed description, in which:

FIG. 1 is a block diagram of various functional components of a system.

FIG. 2 is a block diagram of various functional components of a system.

FIG. 3 is a block diagram of various functional components of a system.

FIG. 4 is a flow chart of the image signature computation.

FIG. 5A is a block diagram of a step of the example image signature computation.

FIG. 5B is a block diagram of a step of the example image signature computation.

FIG. 5C is a block diagram of a step of the example image signature computation.

FIG. 5D is a block diagram of a step of the example image signature computation.

FIG. 5E is a block diagram of a step of the example image signature computation.

FIG. 5F is a block diagram of a step of the example image signature computation.

FIG. 5G is a block diagram of a step of the example image signature computation.

FIG. 5H is a block diagram of a step of the example image signature computation.

FIG. 5I is a block diagram of a step of the example image signature computation.

FIG. 5J is a block diagram of a step of the example image signature computation.

FIG. 6A is a flow chart illustrating use of the web browser device.

FIG. 6B is a flow chart illustrating use of the web browser device.

FIG. 6C is a flow chart illustrating use of the web browser device.

FIG. 6D is a flow chart illustrating use of the web browser device.

FIG. 6E is a flow chart illustrating use of the web browser device.

FIG. 6F is a flow chart illustrating use of the web browser device.

FIG. 6G is a flow chart illustrating use of the web browser device.

FIG. 6H is a flow chart illustrating use of the web browser device.

FIG. 6I is a flow chart illustrating use of the web browser device.

FIG. 7 is a flow chart of product recognition based on an image.

FIG. 8 is a block diagram of an example computing system.

DETAILED DESCRIPTION

Before the invention is described in further detail, it is to be understood that the invention is not limited to the particular embodiments described, as such may, of course, vary. It is also to be understood that the terminology used herein is for the purpose of describing particular embodiments only, and not intended to be limiting, since the scope of the present invention will be limited only by the appended claims.

Where a range of values is provided, it is understood that each intervening value, to the tenth of the unit of the lower limit unless the context clearly dictates otherwise, between the upper and lower limit of that range and any other stated or intervening value in that stated range is encompassed with the invention. The upper and lower limits of these smaller ranges may independently be included in the smaller ranges is also encompassed within the invention, subject to any specifically excluded limit in the stated range. Where the stated range includes one or both of the limits, ranges excluding either or both of those included limits are also included in the invention.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Although any methods and materials similar or equivalent to those described herein can also be used in the practice or testing of the present invention, a limited number of the exemplary methods and materials are described herein.

It must be noted that as used herein and in the appended claims, the singular forms “a”, “an”, and “the” include plural referents unless the context clearly dictates otherwise.

All publications mentioned herein are incorporated herein by reference to disclose and describe the methods and/or materials in connection with which the publications are cited. The publications discussed herein are provided solely for their disclosure prior to the filing date of the present application. Nothing herein is to be construed as an admission that the present invention is not entitled to antedate such publication by virtue of prior invention. Further, if dates of publication are provided, they may be different from the actual publication dates and may need to be confirmed independently.

Embodiments of the present invention include methods and apparatus for product searches on a personal computer, mobile or other device that provides means for a user to gather the information the user needs with minimal effort and in a straightforward way. One advantage is that users are able to bring the shopping engine with them when they browse the internet because the centralized service provides the databases for the Internet's products.

In some embodiments, methods invoke a browser extension in the form of a widget placed on the browser tool bar. A user can navigate on the Internet to the web browser device installation website using a browser. The web browser device installation websites contains a web browser device. The web browser device is installed in the web browser tool bar. The user can then navigate to any remote site where desired information is to be searched for. The web browser device has two buttons: remote search (for images) and Advanced Search (for data records). When Advanced Search is clicked, the remote URL is sent to the template server. The template server looks up the root URL. If a template is found, then it is returned to the web browser along with a JavaScript extractor which then extracts the data (product record) from the (template) page and sends it to the cleaner server which performs additional extraction and cleaning. The extracted product record is then looked up in a product database and all stores which carry the product along with additional information are returned to the browser which displays the information in a popup or another browser tab. Additional information can also be sent to the popup, such as similar products or in the case where the image signature results in a number of product matches from the same or different categories, the list of categories or canonical product records, which the user then chooses from to get a list of products.

If a template is not returned by the template server, then the user is prompted to complete the search form in the web browser device by right clicking on the data field value elements in the page as per the steps described above. The search process described above is performed and the products are looked up. When remote search button is clicked: the web browser device analyses the page, creates a template describing the page, and sends the information to the template server.

FIG. 1 shows the abstract of the system. The product information can be extracted from a remote product information site 101 by the automatic product information extraction 102 and the user generated template semi-automatic product information extraction 103. The extracted product information which is normalized, grouped, de-duped and classified is stored in the product data store 109. The product images which are first processed in the image processing service 111 are stored in the image data store 110. The user can then perform a lookup using the web browser device 104. The lookup 108 queries the product data store and the image data store and through the web services 105 returns results 114 displayed in the web browser 100. Advertisements stored in the ads data base 107 can also be looked up 106. The advertisements that were returned from the lookup are displayed 113 in the browser. The social network 112 is communicating with the web services 105 and contains records pointing to the remote product information sites 101.

FIG. 2 shows the system operation when the user presses the web browser device (bookmarklet or button or extension) 202 to extract visible or hidden data on the remote site web page. The user will register at the shopping engine or socially curated shopping site or search engine 201. The user installs the web browser device for product lookup. The user can then go to a remote third party site 208 generated by a remote web service 207 which contains products stored in a structured data format generated from a remote product or other structured data website database 205 and a remote web site template 206. The remote web page 209 contains the product record 210 and an image URL and/or image bytes 226.

Then the user clicks on the web browser device 202 containing JavaScript code, which can be an embedded button, widget, extension or toolbar button, in the browser 200. When the widget, extension or button is pressed, the product record 210 and the image URL 226 embedded in the web page 209 are extracted. The image 226 is downloaded 230 and processed as described later in the patent and sent to the web service controller 214. When the user presses the web browser device 202, the JavaScript is executed by the browser 200. The web browser device JavaScript code 202 creates an HTML script tag 213 in the page which points to a server side script 236 that will be created on the web service server 203. The HTML script tag 213 passes a URL 204 from the address bar 235 to the server side script 236 as an argument. The web service server 203 will extract the root URL from the sent URL 204 and look up the retrieved extraction template(s) 216. The server side script 236 is created on the web service server 203 which contains the merged site extraction template(s) 216 for the root URL associated with the URL 204, web browser device panel user interface code 218, and the JavaScript extractor 217. The modified HTML page 209 that contains the injected HTML script tag 237 is converted to the DOM representation 212 by the browser 200. The browser then executes the server side script 236 creates the following elements in 213 in the browser: the web browser device panel 218 which appears in the product page tab, the JavaScript extractor 217, and the merged site extraction template(s) 216. If a template was retrieved then the XPATH in each tuple is looked up in the DOM and the product record 219 is extracted and inserted into the web browser device panel UI 218 data fields. The extracted data field values will be highlighted in the web page and tagged with the corresponding data field name.

If no template was returned by the web service server 203 or the page 209 has changed or there is missing information in the page then the user selects that product information in the web page. The information that the user selects in the web page is checked for semantic errors, string too long errors, other types of checks and the data is cleaned by the product record checker/cleaner 225. After the user selects the product information in the web page and populates the panel the user presses the panel submit button 221 the web browser device sends the submission container 220 with the submitted product record 224, the new extraction template 222 which contains the list of tuples (data field name, data field value, XPATH, semantic type) url 226 in a post key/value form to the web service controller 214. If the user selected a price alert option for the product in the web browser device panel, then the set price alert message is sent to the price alert and history server which then stores the price alert in the price history database.

The user can press the find button 223 to search for products in the product database 241 and in the image database 227. The selected product record 224 is processed in the image/data processing pipeline 240. The index 242 is generated from the image and the product database. The lookup 243 will generate the search results 238 sent to the web service controller 214. The browser 200 will display the search results 238 containing the list of stores with prices 211 and product list 238 with the product that can be selected 232. When the user selects a product 232 from the search results 238 the selected product is looked up 231.

The web service will send a new template record which contains the URL of the page 204, the new extraction template 222, if the user created one, from the web browser device and submitted product record 224 from the web page to the product record cleaner. The cleaner will clean the product record and send a cleaned product record.

The web service performs the following operations: (1) the server generates a unique identifier. The product page URL 204 is hashed to a 256-bit UUID by the web service 214; (2) the web service sends the unique identifier and the user collection identifier to the user database 228; and (4) the server sends the unique identifier the extraction template in JSON form 222 to the extraction template database 215. Templates from the template database are checked by the template checker 233. Template widget stat server 234 communicates with the template database 215. The XPATH and the semantic type are used to extract data field values from pages on the site and associate them with data field names. Pages on the site are constructed from the same remote template 206. The new extraction template 222 contains the list of tuples (a tuple consists of the following: data field name, data field value, XPATH, semantic type).

If the user submitted a product using a web browser device the user and others can see the selected data record that was inserted into the collection specified by a collection id on their profile page on the socially curated website. Periodically a job is run to generate a new index 242 from the product database 241 to make it easier to search for the products in user collections.

Search engines index words and phrases. Attempts to extract structured data in web pages have been made by search engines using special markup in the web pages such as RDF, good relations, micro format and rich snippets. The web designer inserts the industry standard structured data formats into the web page to create data records in the web pages. The search engine crawls the site and examines the web pages for the presence of industry standard structured data formats. The industry standard structured data formats identify the data field values using a set of data field names. A method for extraction of structured data from a page containing a visible and invisible data record at a site using an identifiable invisible data and layout format is shown when a web browser device button is pressed on the web page. The data record is located in a set of HTML tag(s) with corresponding data field names. An aspect of the present invention provides that a 3^rdparty predefined set of data field names are used to enclose the data field values on the page. 3^rdparty data field names are placed in attributes next to the data field values in the HTML tags.

Turning now to FIG. 3, the product record information in the online store database 324 at the affiliate marketing FTP website 325 is accessed by the ftp down loader 326 which fetches the product record data feed 327. The downloaded product records are then sent to the data processing pipeline. A product information web site 302 is connected to the remote web service 329 that reads remote template(s) 328 containing the data field name variables, and remote online store database 324 to generate the online store site 302. The page downloader or crawler 306 reads a list of sites or pages from the online store URL list 305 and downloads the product pages 307.

The downloaded pages are then used in conjunction with the selected corresponding site template 336 from template database 303 by the automatic extractor 308 which extracts the product records from all pages matching the site template. A site may have more than one site template. The product pages are processed by the automatic extractor which sends the root URL of each page that it is processing to the extraction template database 303 and retrieves the web browser device extraction template. The web browser device extraction template is converted to an automatic extraction template. The automatic extractor extracts the structured data record from each product information page using the automatic extraction template and creates a product record 309.

The affiliate downloaded product records 327 and automatically extracted product records 309 each are read by the cleaner 310. The cleaner analyses each downloaded product record and produces a cleaned product record 311. The cleaner moves data field values and partial data field values from one data field to another, removes extraneous text, verifies the correctness of the data field values, and calculates statistics on the number of good/bad data field values using semantic checking and stores the stats in the product record. Cleaned product records are then classified by the product classifier 312. The product classifier matches data records to one or more product classification tuples from the product classification tuple list using words from the data record which are product classification base or synonym words. The classified data records 313 are normalized and grouped by the normalizer 314. The normalizer will de-duplicate the product record stream, group records together which are the same record found at different sources (e.g. stores, shopping engines, socially curated sites, blogs, and manufacturer sites), refine the classification of a group of the same product records from different sources using methods such as voting. Further normalization steps can also be performed. The automatic extraction, cleaner, product classifier, normalizer and grouper stages communicate with the dictionary database 304. The dictionary looks up token(s) and returns semantic type information. Synonyms are converted to base words. The dictionary information is used by each pipe stage to process the data record. The resulting cleaned, classified and normalized product records 315 are saved 316 in the affiliate product database 319 or in the extracted product database 318 depending on the source of the product record.

The user runs the web browser device 345 in a web browser 300 and creates a new extraction template 333 and a product record 331 from a product information web page 334 which is inserted into the extraction template database 303. The web browser device new extraction template is converted to an automatic structured data extraction process template which is used to do the structured data extraction 308 of all pages matching the page layout at the site that the web browser device extraction template was created from. All pages are downloaded from the site. Each web page from the same site is tested to see if it matches the structured data extraction template(s). If there is a match the data record is extracted from the matching pages. The extracted record is cleaned, classified, normalized, and stored in a database or index.

The image URL from the cleaned product record 311 is used to download the image 337. The Downloaded image 338 is then processed in the image signature computation flow 339 and the image record 340 is generated. The image record contains product id 341, image URL 342 and the computed image signature 343. The image records are stored in the image database 344. The web browser device extracted product records database 317, the extracted product database 318, the affiliate product database 319 and the image database 344 are merged by the database merger 320 and a merged and normalized product database 321 is created. The merged product database is then indexed by the indexer 322 and an index 323 is created.

The user 348 can optionally search for a product using the web browser device 345. The web browser's device panel 330 sends the product image (image URL and/or the image byte) 342 and/or the product record 347 from the current web page 334 to the web service 332. The web service queries the index. The product search index 323 is looked up 301 and the search results are returned 349. The product search result is displayed in the browser 300. The user can then select a specific product by clicking the URL, navigating to the remote URL and then viewing the remote product information. The advantage of this aspect of the invention is that the user can search for product information on remote product information web sites without leaving the product information web page i.e. the user does not have to cut information from the product page and paste it into the search box at Google and/or a shopping search engine.

The user or a previous user identifies the data field values (DFV) on the web page and associates each DFV with a data field name (DFN) which are converted into an extraction template. If the template server contains a template the template is downloaded to the browser. The template contains downloaded JavaScript used to extract the data record from the HTML page, and send the information to the template server. If the template server does not contain the extraction template for the web page then the user will be prompted to specify the data field values (DFV's). The DFV's be used in the product record search on the server. In either case after the information in the page is extracted to the web browser device panel the user presses search and the product server looks up the product record information and returns the list of stores and their prices that contain the item. Additional information can be returned as well, such as specifications and other rich attributes and similar products.

FIG. 4 describes the image signature computation flow 402. An image 401 is transmitted to the image processing service and prepared for the processing. Image preprocessing 403 scales the image to a predefined size (the scaled image) and creates a gray-scale copy of the scaled image. Certain parts of the algorithm use the gray-scale model. The background type, solid color, gradient or transparent is detected in step 404. The filter selection 405 for the object boundary detection is determined by the background type. If the background is transparent and the pixel is transparent then it's a part of the background. Otherwise, it's a part of the object. In case of the solid background the edge between the background and the object is detected. If the background is a gradient the background between the gradient and object is detected. Various industry standard edge detection algorithms can be used to detect the boundary (minimum bounding box). Then, the binary search lookup 406 is performed along the rays to define the intersection between the background and the edge of the object. Using the bounding box, the image in original color space and the gray-scale image are then cropped to the bounding box edges and prepared for the further processing 407.

The external image signature creation 408 projects lines from each corner angled in the gray-scale cropped image at 45° to the minimum bounding box which intersects the line. Rays bisecting the image edges are projected and the intersection with the minimum bounding box is detected using the same method as the 45° intersection. Traversal lines form other characteristic points on edges perpendicular to the edge they are on. Then, the first intersection with the object on each traversal line starting from the line origin is found. Next the lengths from line origin at the edge of the image to the object intersection on each line are found. The x, y coordinates of the intersection point are equal for 45° lines. A single value (x or y) is used in the image signature for each 45° and 90° lines in the implementation. The number of lines can be increased for accuracy.

The first phase in the internal image signature computation 409 is taking the eight traversal lines starting from the image (original color space cropped) center in eight directions. The first line is perpendicular and directed to the top edge and each subsequent angled at 45° to the previous one in clockwise rotation direction. Then the first color changes with large differences in intensity on each traversal line, along the line direction (exceeding certain threshold) is found. Next step is to calculate the lengths from line start to the color change point on each line.

Color histogram 410—in the cropped image in original color space several pixel samples in characteristic positions relative to the image are taken. Then, color value intervals of equal lengths for each sample are made and occurrences of values from each interval are counted. The following is the example of the color histogram. For each pixel the RGB pixel values are converted to luv color space.

FF 10 20 6A D7 AD R G B L U V

Then occurrences of each (L, U, V) number in each set of 3 lines is counted,

L U V OCC 0 0 0 152 C8 5B 8A 295 8A 5B 8A 198 60 48 2A 90 3C 5A 70 65 6A D7 AD 170

and finally first three colors by occurrence are selected, in order of occurrence.

L U V OCC C8 5B 8A 295 8A 5B 8A 198 6A D7 AD 170

Next, a table is made containing the top three colors by occurrence for all four sampling directions. That makes the internal color signature.

The external image signature 408, internal image signature 409 and color histogram 410, along with the bounding box dimensions are passed to the image signature generator 411 which produces the image signature 412. The image signature can be computed using the traditional feature detection algorithms, such as BRISK. People in state of the art in image feature detection are familiar with BRISK algorithm and its computational efficiency. BRISK is created to match images with a high level of detail and has a configurable (but large by default) number of keypoints that are used in comparison. Hence, the performance in the use case of product images with lower level of detail can yield a lower number of keypoints needed for comparison and therefore almost proportionally lower computation time. Another performance enhancement may be made by using only the image part within the cropped and scaled image 407. Then, scale-space calculation phase in BRISK algorithm can be omitted, as the scale dimension is invariant.

Consequently, the product detection in images, besides being performed by the proposed image signature algorithm, can also be done by some other familiar algorithms in the field, in conjunction with or as a replacement for the image signature algorithm, whilst satisfying conditions for more efficient utilization than for a regular use case for the algorithms as shown in FIGS. 5A-5J. Shown are the original image 501 and the scaled, gray-scale image 502. After the background type is detected 503 the bounding box is found in the analyzed image 504. The analyzed image is cropped to bounding box edges 505. 506 shows the scaled bounding box. Two image signatures are computed: the external 507 and the internal image signature 508. 509 shows the ray color sample and the 510 shows the color sample.

Turing now to FIGS. 6A-6I, the user navigates to a product information site web page containing the product image, product information and additional images 602, in browser 601. Previously installed web browser device will be displayed in the browser address bar as an icon 603. When a user presses the icon the panel with images found on the product information web page will be displayed 604. The user can then, in browser 605, select an image 606 to lookup. The selected image will be highlighted 607. Pressing the “Done” button will send the product information and product image URL (optionally the image bytes will be sent as well) to the web service. The web browser device icon will be updated 608 to show the current status of the lookup 609. When a number of found results appears 610 user can click 611 on an icon to see the lookup results. The lookup results list 613 in browser 612 can contain the same and/or similar products found on different store pages.

In the case where the product database is normalized and the same product from different retailers are grouped together the user can be presented with a list of single products which might match the product and/or the image on the page that is being searched for. The user then selects one of the normalized products and the user is then presented with a list of the stores that carry that single product. If the user is interested in similar products the user can also indicate that they want to see similar products. This search is facilitated by preprocessing the products and grouping similar products by image characteristics, product classification and the same products by product record and image signature. Brands make products in certain categories so it is possible to group different manufacturer's products by category.

This direct image search and product information lookup from remote shopping engine, retailer, manufacturer and other shopping related pages provides an efficient method for shoppers to find out competing prices, additional product information, and other locations where the product can be purchased. User has an option 614 to select a lookup result from the result list 615 and the store page containing the selected product will be shown in a popup window 617 on the browser 616.

FIG. 7 represents the product identification by an image. Product record and image URL and/or image bytes 702 are sent from the remote product information web site 701 to the widget extraction flow 703. The image is then processed in the image processing service 704 which produces the image record 705 containing the computed image signature 706. Product and image records are stored in the product data store and image data store 707. Other social bookmarking widgets 708 can be used on the same remote sites to extract the image from the remote product information website 709 and save it on the social bookmarking website 710. Users that come to the social bookmarking website 710 can use the web browser device to perform a remote lookup 711 on the selected image. Image URL and/or image bytes 712 are used to run the lookup 713 which will query the image and the product record data store 707. The data store 707 will return the product information and/or images 714. The search results 715 can be displayed on the remote page 716 or the advertising platform 717 or the services for external social bookmarking 718 can be built.

The product recognition consists of the following steps. First, logging into a first remote site which contains a web browser device and installing the web browser device. Second, executing a web browser program on a computer device with a screen, a microprocessor, volatile memory and persistent storage such as a hard disk drive or flash memory. Navigating the Internet via web browser to find a second remote site to find a URL containing a single product page by searching, browsing or directly typing in a known URL. The URL is sent to a remote computer server, which contains a microprocessor, volatile memory and persistent storage such as a hard disk drive or flash memory, over a network connection to the Internet to retrieve the single product page (the web page contents—the HTML) and send it over a network connection to the Internet and rendering the HTML for the second remote site in the browser. The user can optionally indicate that similar products be included in the search results. Third, pressing the web browser device button. Pressing the “find” button in the web browser device panel. Selecting a product image from the multi-image view to lookup. Fourth, sending the product image signature and optional product information from the client computer to the first remote site's server. Fifth, the first remote site's server performs the image signature computation which produces an image signature. Sixth, sending the image signature to the image signature lookup which finds a list of product records for the same or similar products with matching (a range check is performed to allow for image artifacts such as noise) image signatures. Seventh, performing the product information lookup which finds a list of product records matching the client side product information. Eighth, optionally combining the image signature and product lookup results. Further refining the combined search results by checking for the same or similar products in the combined list, using such checks as a range check on the price, similar categories. In the event that two similar products have the same signature the product information is used to verify that the combined results contain the same product. If the user requested that similar products be returned then a combined result including similar products is returned to the user. Ninth, sending the resulting product list from the first remote site via a JSON file over the world wide web and displaying it in the user's browser which is executing on their client computing device. Tenth, optionally adding the product(s) in the search results in the web browser executing on the client computing device to the user's collection on the first remote site. And eleventh, the user selects retailer sites to visit by clicking on the links in the returned search results.

The MN, M#, PN, UPC product information is extracted from different places in the HTML code of the product page. The places include “alt” attribute of the “img” tag, URL and title. The product information is also being extracted from the paragraphs, headings, breadcrumb and menu links and tables containing product name, description, category, specifications, retailer name, etc.

The information extracted from product information site web pages is used to create clusters of different images of the same product. The textual information is used to find potentially similar product records. The images in the similar product records are then analyzed by the image processing service to join existing clusters and/or add products to clusters and/or create new clusters. Comparison of image signatures can thus be used in conjunction with limited, semi, and/or complete product record information to identify products in product information sites (i.e., manufacturer, retailer sites, blogs and social catalog).

Matching images on a product information site to a product record facilitates the serving of ads on the social catalog site, brand analytics on the social catalog site, conversion of links on the social catalog site to affiliate marketing links for commission based programs so that when the user clicks on the link to the page at the original site contains the image a cookie is set on the user's computer and if the user buys something at the site the store pays a commission to the referring site. Additional advantages include adding meta-information about the product to the visible text on the page to give the viewer additional information about the product. Another advantage of the system is setting keywords in meta-tags and descriptions for search engines to index. Other SEO and SEM advantages that adding keywords to pages have are not described here but are well understood in the Internet community.

Furthermore, the merging of structured data and social networking information greatly increases the accuracy of search results where qualitative results are desired. The probability of finding useful information in response to search keywords is significantly greater. Moreover, because the database contains more complete information, such as numeric attribute information which describe the database elements (e.g., the size of an object) and qualitative information (e.g., an expert's opinion of the durability of an object), searches can be conducted using general descriptions of the objects (e.g., search for a digital SLR which is within a certain dimension range and longevity) or searches can be conducted using the category, brand, store, and social rating of the former. Conventional search engines, by contrast, return results that require the user to manually validate, sort, and filter the search results. In the case of conventional search engines that return links based on popularity, the user must search through the list of links to find relevant web pages and manually search social networking services to find corresponding qualitative data.

With reference now to FIG. 8, portions of the technology for providing computer-readable and computer-executable instructions that reside, for example, in or on computer-usable media of a computer system. That is, FIG. 8 illustrates one example of a type of computer that can be used to implement one embodiment of the present technology.

Although computer system 800 of FIG. 8 is an example of one embodiment, the present technology is well suited for operation on or with a number of different computer systems including general purpose networked computer systems, embedded computer systems, routers, switches, server devices, user devices, various intermediate devices/artifacts, standalone computer systems, mobile phones, personal data assistants, and the like.

In one embodiment, computer system 800 of FIG. 8 includes peripheral computer readable media 801 such as, for example, a floppy disk, a compact disc, and the like coupled thereto.

Computer system 800 of FIG. 8 also includes an address/data bus 810 for communicating information, and a processor 8091 coupled to bus 810 for processing information and instructions. In one embodiment, computer system 800 includes a multi-processor environment in which a plurality of processors 8092, 8093 are present. Conversely, computer system 800 is also well suited to having a single processor such as, for example, processor 8091. Processors 8091, 8092, 8093 may be any of various types of microprocessors. Computer system 800 also includes data storage features such as a computer usable volatile memory 806, e.g. random access memory (RAM), coupled to bus 810 for storing information and instructions for processors 8091, 8092 and 8093.

Computer system 800 also includes computer usable non-volatile memory 808, e.g. read only memory (ROM), coupled to bus 810 for storing static information and instructions for processors 8091, 8092, 8093. Also present in computer system 800 is a data storage unit 807 (e.g., a magnetic or optical disk and disk drive) coupled to bus 810 for storing information and instructions. Computer system 800 also includes an optional alpha-numeric input device 812 including alpha-numeric and function keys coupled to bus 810 for communicating information and command selections to processor 8091, 8092, 8093. Computer system 800 also includes an optional cursor control device 813 coupled to bus 810 for communicating user input information and command selections to processor 8091 or processors 8091, 8092, 8093. In one embodiment, an optional display device 811 is coupled to bus 810 for displaying information.

Referring still to FIG. 8, optional display device 811 of FIG. 8 may be a liquid crystal device, cathode ray tube, plasma display device or other display device suitable for creating graphic images and alpha-numeric characters recognizable to a user. Optional cursor control device 813 allows the computer user to dynamically signal the movement of a visible symbol (cursor) on a display screen of display device 811. Implementations of cursor control device 813 include a trackball, mouse, touch pad, joystick or special keys on alphanumeric input device 812 capable of signaling movement of a given direction or manner of displacement. Alternatively, in one embodiment, the cursor can be directed and/or activated via input from alpha-numeric input device 812 using special keys and key sequence commands or other means such as, for example, voice commands.

Computer system 800 also includes an I/O device 814 for coupling computer system 800 with external entities. In one embodiment, I/O device 814 is a modem for enabling wired or wireless communications between computer system 800 and an external network such as, but not limited to, the Internet. Referring still to FIG. 8, various other components are depicted for computer system 800. Specifically, when present, an operating system 802, applications 803, modules 804, and data 805 are shown as typically residing in one or some combination of computer usable volatile memory 806, e.g. random access memory (RAM), and data storage unit 807. However, in an alternate embodiment, operating system 802 may be stored in another location such as on a network or on a flash drive. Further, operating system 802 may be accessed from a remote location via, for example, a coupling to the internet. In one embodiment, the present technology is stored as an application 803 or module 804 in memory locations within RAM 806 and memory areas within data storage unit 807.

The present technology may be described in the general context of computer-executable instructions stored on computer readable medium that may be executed by a computer. However, one embodiment of the present technology may also utilize a distributed computing environment where tasks are performed remotely by devices linked through a communications network.

It should be further understood that the examples and embodiments pertaining to the systems and methods disclosed herein are not meant to limit the possible implementations of the present technology. Further, although the subject matter has been described in a language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the Claims.

Since other modifications and changes varied to fit particular operating requirements and environments will be apparent to those skilled in the art, the invention is not considered limited to the example chosen for purposes of disclosure, and covers all changes and modifications which do not constitute departures from the true spirit and scope of this invention.

Claims

1. A method for extracting a data record from a web page, said method comprising:

a. accessing said web page with a web browser;

b. activating a web browser device in said web page;

c. associating an extraction template with a data record type on said web page;

d. extracting the data record associated with said data record type;

e. downloading an image associated with an image url;

f. creating an image signature for said image;

g. associating said image signature with said data record;

h. storing said image signature in a third data store;

i. storing the association between said image signature, said image and said data record in the fourth data store; and

j. storing said data record in a first data store wherein there is an association between a first data field name in said data record in said first data store and a second data field name in said extraction template in said second data store.

2. The method of claim 1 wherein said data record is a hidden data record.

3. The method of claim 1 wherein said data record is a visible data record on said web page, further comprising extracting said data record associated with said data record type by:

i. selecting a data field value on said web page;

ii. associating said first data field name with said data field value;

iii. displaying a visible rectangle around said data field value and displaying said first data field name;

iv. calculating an XPATH value of said data field value on said web page,

wherein said extraction template is created utilizing said first data field name and said XPATH value using said web browser device; and

v. storing said extraction template in said first data store.

4. The method of claim 1 further comprising automatically retrieving said extraction template for said web page.

5. The method of claim 2 further comprising storing said hidden data record in an industry standard format and associating said hidden data record with a hidden data record template and XPATH location of the hidden data record and is associated with a root URL for a web site associated with said web page.

6. The method of claim 1 further comprising automatically displaying said data record in said web browser device panel, accepting a description and a collection from a user, and submitting said data record, said description and said collection to said first data store.

7. The method of claim 2 further comprising checking validity of said extraction template by re-extracting a current data field value and comparing to said data field value and finding any data field names present in the web page which are missing in the extraction template or the hidden data record.

8. A method for displaying errors and missing template elements in a data record from a web page, said method comprising:

a. logging in as an administrator;

b. accessing said web page with a web browser;

c. activating a web browser device in said web page;

d. associating an extraction template with a data record type on the web page;

e. accessing the error report from the template error report server for the data record type in said web page;

f. highlighting the errors and missing elements in said web page;

g. highlighting data field name/data field value pairs in said web browser device panel that contain errors or are missing from the template or should not be in the template;

h. correcting the web page template errors by i. associating a data field name with said data field value or ii. removing said data field values;

i. creating an extraction template comprising said data field name and an XPATH value using said web browser device; and

j. storing said extraction template in a first data store.

k. extracting said data record associated with said data record type;

l. storing said data record in a second data store wherein there is an association between said data field name in said data record in said first data store and a second data field name in said extraction template in said second data store;

9. The method of claim 1 further comprising computing the image signature using one of the following methods

a. compute an image signature from standard manufacturer image used by stores; i. using an external and internal image signature and color histogram; ii. using a industry standard signature such as BRISK for the entire image;

b. compute an image signature from a random image which displays the product from different angles using a industry standard signature such as BRISK for the entire image.

10. The method of claim 9 further comprising computing an external image signature by finding a minimum bounding box around a product in a manufacturer or retailer image by projecting rays from the edge of the image and finding the intersection of the ray with the edge of the product in the image.

11. The method of claim 10 further comprising computing an image signature by finding a minimal bounding box around a product in a manufacturer or retailer image using a binary search to find the closest point from the product object to the edge of each of the four sides of an image;

12. The method of claim 11 further comprising creating said image signature from points indicating the intersection between the rays and the minimum bounding box.

13. The method of claim 12 further comprising finding an internal image signature by finding the center of said minimal bounding box and projecting rays from the center to the edges terminating the rays at the boundary between two different colors/features.

14. The method of claim 10 further comprising accepting from a user an indication that said data field value is a constant wherein said constant becomes part of said extraction template or hidden data record template, and said constant is displayed in subsequent extraction processes.

15. The method of claim 1 further comprising storing said data field value with said data field name, said XPATH value and associating a root URL name in an extraction template in said first data store.

16. The method of claim 1 further comprising classifying said data field value using a product classifier and assigning a product classification to said data field value.

17. The method of claim 1 further comprising aggregating a plurality of said data field names and said data field values in said second data store into user defined collections.

18. The method of claim 8 further comprising, associating plurality of said extraction templates with a user for measuring the quality and quantity of extraction templates generated by said user.

19. The method of claim 1 further comprising allowing a second user accessing said web page from which the data record was extracted or said extraction template was created or retrieved to extract a current data field value from said web page.

20. The method of claim 1 further comprising extracting all of the elements of a list associated with said data field value using a repeating structured pattern associated with said data field name and said XPATH value.

21. The method of claim 1 further comprising selecting said data field value using a predefined extraction template retrieved from said first data store.

22. The method of claim 1 further comprising selecting said data field value extracted from the hidden data record.

23. The method of claim 1 further comprising selecting said data field value using by searching for a predefined data field name on said web page.

24. The method of claim 1 further comprising converting said extraction template from said first data store into an automatic data extraction template to extract current data field values from all web pages at the root web site which matches said template.

25. The method of claim 1 further comprising converting said hidden record data template from said first data store into an automatic data extraction template to extract current data field values from all web pages at the root web site which matches said template.

26. The method of claim 1 further comprising cleaning said data field value, classifying said data field value, normalizing said data field value, storing said data field value and indexing said data field value.

27. The method of claim 1 further comprising adding date and purchase location information associated with said data field value to said second data store.

28. The method of claim 1 further comprising comparing a plurality of data field values from said second data store by a user in the in a social network or a shopping engine and storing the comparison for viewing by said user or other social network members.

29. A method for implementing a browser based information transmission method comprising:

a. extracting a data record from a web page;

b. adding said data record to a user profile on a social network; and

c. sharing said data record with a plurality of users wherein each of said users can comment, copy, compare, vote on, or access the web page.

30. The method of claim 29 further comprising combining said data record with plurality of other extracted data records to form a collection.

31. The method of claim 29 further comprising storing said collection in a searchable index.

32. The method of claim 29 further comprising finding a product search result from a product image on a web page by

a. accessing the web page with a web browser;

b. activating a web browser device on the web page in a web browser;

c. The image identifier associated with the web browser device automatically finds the images greater than a certain size;

d. The selected images are shown in a pop up;

e. The user selects a single image in the pop up and presses “done”;

f. extracting the image url/bytes from the web page;

g. transmitting the extracted image url/bytes to the web service controller;

h. querying a image data store and associating the image with a product search result;

i. returning the product search result from the web service controller to the web browser device;

j. displaying the product search result in the web browser device.

33. The method of claim 29 wherein the data record is a visible data record, further comprising: transmitting a root URL from the web browser device to a web service controller; associating the root URL with an extraction template; returning the extraction template from the web service controller to the web browser device; and extracting a product record from the web page using the extraction template.

34. The method of claim 29 wherein JavaScript is inserted into the web page such that when said user navigates to said web page said JavaScript is activated and identifies the hidden data records, product images or product words; transmits the information to the web server which looks up the information and returns product information about the image which includes the list of stores the product can be purchased at, similar brands, other products from the brand, other products from the store, products from the same category with affiliate links or paper-click links that when activated by a user result in the commission being payed to the end user and to the service provider.