Search engine having navigation path and orphan file features
A search engine (100) has a top down transversal algorithm (112) that distinguishes active objects of a website from orphan files depicted in graphs of HTML files of a graph database of the objects and their HTML relations. A collection building utility (120) assembles a batch collection of solely the active objects for retrieval by a search query, which prevents retrieval of an orphan file that would provide a website visitor with incorrect information.
A computer program listing, submitted at the end of the specification herein, implements a function: f_GetStaticNavPath.
FIELD OF THE INVENTIONThe present invention relates to a search engine for assembling a batch collection of objects or nodes corresponding to URL objects of a website, and more particularly, to a search engine that prevents a search query from retrieving inactive objects or nodes.
BACKGROUNDA search engine enables a website visitor to search for an object or node of the website by using a search query, such as, a key word. The visitor inputs the search query at an appropriate location on the website, using the visitor's web browser on a work station computer. In response the search engine retrieves the object or node matching the query from a batch collection, and displays the object or node on a display device of the computer. The retrieved object or node is the equivalent of an object of the website having a URL, uniform resource locator, an address location for the object on the Internet.
The terminology, “object” refers to a valid active object of the website that is retrieved by using a search path provided by the website, for example, by executing a series of computer commands, such as, mouse clicks, on a series of hyperlinks that navigate to successive web pages, until reaching the object. Further, the terminology, “object” refers to an object that is in a batch collection assembled by a search engine. The terminology, “object” is interchangeable with the terminology “node.” Further, the terminology “node” connotes a Hypertext Markup Language node, HTML node, i.e. object, in a hierarchical navigation path of HTML relations, as well as an object at a hierarchical end of a navigation path, i.e. a leaf node. The leaf node can be an HTML object, or other formatted file, such as, *.PDF, *.DOC, *.PPT, . . . (*.*). The terminology, “navigation path,” refers to all HTML hierarchical relations, or links, connecting a node along the navigation path.
A search engine has a web crawler that searches through the web directories of a URL website and organizes the objects or nodes as data files in a database. The search engine assembles the database files into a batch collection. The search engine makes the batch collection searchable by search queries. The advantage is that a visitor to the website can quickly retrieve a desired object by using a search query, which saves the visitor from the task of having to conduct a trial and error search on the website itself to find the object.
Prior to the invention, a search engine assembled a batch collection of objects without including their navigation paths, or links. An object that was retrieved from the batch collection was displayed on the visitor's computer display without a navigation path that the visitor could follow to verify the object as an active object of a website. Thus, a retrieved object that was an inactive object could display obsolete or otherwise incorrect information.
A valid active object is one that is included together with a starting node in a navigation path. A starting node is reachable by beginning with the home page. An inactive object is not reachable by conducting a search from the home page. Prior to the invention, a batch collection would contain an inactive object even when the equivalent inactive object was not retrievable by searching the website from the home page. Thus, a batch collection may have been assembled with one or more inactive objects, which are orphan files.
A search engine must be able to prevent retrieval of an orphan file that would provide a visitor with incorrect information. For example, an orphan file could show obsolete information or erroneous information pertaining to a product, or to a manufacturing drawing or to a manufacturing process, which the visitor would detrimentally rely upon.
Prior to the invention, a batch collection assembled by a search engine did not have the capability of identifying orphan files. Thus, an orphan file was capable of being retrieved from a batch collection assembled by a search engine, which could have provided a visitor with incorrect information. Further, orphan files could not be singled out as candidates for deletion from the data base.
U.S. Pat. No. 6,144,962 discloses a graph data base having files of URLs or objects, as nodes, and their links or navigation paths. The database files build node tree graphs, comprised of the nodes and their links or navigation paths that connect the nodes in a hierarchy. The graphs are mapped and are subjected to URL filtering features to find common website problems, such as links in need of repair and missing URLs.
The web crawler (100) searches according to the following process.
-
- 1. Search the web directory of HTMLs Hyperlink information and Referenced Node Title from character string “, A . . . HREF=“Hyperlink information”>Referenced Node Title</A”.
- 2. Translate the relative path of Hyperlink information and Referenced Node Title into an absolute one, i.e., a single path name. For Example, translate “../../online/index.htm” in “/html/ECx/intro_promo/a.htm” into “/online/index.htm”.
- 3. Handle the file in name with space characters which are changed to %20 in URL.
- 4. Build graphs of the HTML hierarchical relations between each HTML parent object node and each HTML referenced, HTML child object node. The web crawler (100) builds the hierarchical relations as structural data depictions that extend between a parent node and each child object node. Thereby, the web crawler (102) defines the start HTML nodes that are the obvious starting nodes in the root home page of the website.
The present invention relates to a method of assembling a collection of retrievable objects of a website, by distinguishing active objects of the website from orphan files depicted in graphs of graph database files having the objects and their HTML relations; and by assembling solely the active objects of the website in a batch collection for retrieval by a search query.
According to an embodiment of the invention, the invention method implements a recursive function on graphs built by a graph database, and discovers the object hierarchy in a website, and distinguishes active objects from inactive objects of the website.
According to a further embodiment of the invention, the method further builds the shortest navigation path of each of the active objects to a home page of the website, wherein the shortest navigation path excludes intervening nodes between the active objects and the home page.
According to a further embodiment of the invention the method further associates the shortest navigation path, as described above, with easy to understand information for retrieval together with a corresponding object matching a search query. The navigation path is easily understood and followed, which verifies that an object is in a navigation path with the home page.
The present invention further relates to a search engine that builds a graphical database of all HTML files and their HTML hierarchical relations, and that builds a graphical database collection of all nodes and their HTML hierarchical relations from the start node root categories in the web site, and that builds a collection of all HTML hierarchical relations in a graphical database.
Embodiments of the invention will now be described by way of example with reference to the accompanying drawings.
BRIEF DESCRIPTION OF THE DRAWINGS
With further reference to
With reference to
With further reference to
Following the operation of the top down transversal algorithm (112), a bottom up navigation path getting algorithm (116) implements the recursive function: f_GetStaticNavPath, according to the computer program listing at the end of the specification herein, by visiting, i.e., scanning and parsing, the graphs created by the web crawler (102), beginning with the child nodes in the order determined by the order of the child nodes in the graphs. The bottom up navigation path getting algorithm (116) constructs a database of the shortest navigation paths from respective child nodes to one pseudo root node (202). No navigation paths will be constructed for orphan files previously distinguished from valid active nodes. The database of the shortest navigation paths will not have orphan files. Thus, the bottom up navigation path getting algorithm (116) constructs the shortest navigation path for each node that corresponds to a valid active object of the website, which are stored in a storage device (118) labeled, shortest navigation path of objects.
With reference to
NavPath(to Child from pseudo root node)=NavPath(from Child to One Parent with Navigation Path to pseudo node)+Link (from Parent to Child).
For example:
NavPath(S11)=A
NavPath(N22)=NavPath(S11)+B=A+B=AB
NavPath(L31)=NavPath(N22)+C=AB+C=ABC
The bottom up navigation path getting algorithm (116) stores, the data for the shortest navigation path of objects to a pseudo node, in a storage device (118). For example, the data include:
Node S11 with its shortest navigation path A to a pseudo node,
Node N22 with its shortest navigation path AB to a pseudo node,
Node L31 with its shortest navigation path ABC to a pseudo node.
An advantage is that the shortest navigation path of each node to a pseudo root node (202) is defined without including intervening nodes. Further, with respect to those objects that have navigation paths originating from root category starting nodes, i.e., pseudo root nodes (202), the database includes information that indicates the objects are active objects of the website. Easy to understand information is generated to describe each of the shortest navigation paths. The easy to understand information is suggestive of corresponding objects represented by the nodes.
Further, for example, the easy to understand information comprises information labels that are identical to the hyperlink labels displayed by the website. The hyperlink labels identify the hyperlinks for receiving click-on commands to retrieve the objects. Further the hyperlink labels are easily understood, and are suggestive of corresponding objects to be retrieved. Further, the hyperlink labels are HTML files in one of the web directories (104) and (106). The bottom up navigation path getting algorithm (116) retrieves the HTML hyperlink labels, i.e. the easy to understand information, and cross references them to the HTML nodes. The data is stored by the bottom up navigation path getting algorithm (116) in a storage device (118).
A collection building utility (120) of the search engine (100) retrieves objects and retrieves the shortest navigation paths from the storage device (118). The collection building utility (120) assembles object collections, together with their shortest navigation paths, and stores them in a storage device (122). The object collections excludes orphan files, which exclude each object that has obsolete or otherwise incorrect information.
A search results reporting utility (124) generates a report of search results of one or more objects that match a search query submitted by a visitor to the website. Further, each object is reported together with its shortest navigation path, as determined by the combined operations of the top down transversal algorithm (112) and the bottom up navigation path getting algorithm (116).
Further, the search results reporting utility (124) reports the shortest navigation path as having easy to understand information. Further, the search results reporting utility (124) reports the shortest navigation path as an HTML navigation path without intervening HTML objects in the navigation path. Thus, a navigation path reported on the report is a direct navigation path to the home page pseudo root node (202). By performing a single, mouse click, command on the shortest navigation path, the equivalent object of the website will be displayed on the visitor's computer display device. Thereby, the navigation path is easily followed to verify that the object included with the navigation path is a valid active object of the home page.
With further reference to
Although the invention has been described in terms of exemplary embodiments, it is not limited thereto. Rather, the appended claims should be construed broadly, to include other variants and embodiments of the invention, which may be made by those skilled in the art without departing from the scope and range of equivalents of the invention.
Claims
1. A method of assembling a collection of retrievable URL objects of a website, comprising the steps of:
- distinguishing active objects of the website from orphan files depicted in graphs of HTML files of a graph database of the objects and their HTML relations; and
- assembling solely the active objects of the website in a batch collection for retrieval by a search query.
2. The method of claim 1, and further comprising the step of: implementing a recursive function top down on the graphs, and discovering the object hierarchy in a website, which hierarchy distinguishes the active objects from orphan files.
3. The method of claim 1, and further comprising the step of: making a shortest navigation path of each active object of the website to the home page of the website, wherein the shortest navigation path is retrievable together with a corresponding object that matches the search query.
4. The method of claim 1, and further comprising the step of: making a shortest navigation path of each active object of the website to the home page of the website, by implementing a recursive function bottom up on the graphs, wherein the shortest navigation path is retrievable together with a corresponding object that matches the search query.
5. The method of claim 1, and further comprising the steps of:
- making a shortest navigation path of each active object of the website to the home page of the website, and
- associating the shortest navigation path with easy to understand information for retrieval together with a corresponding object that matches the search query.
6. The method of claim 1, and further comprising the steps of:
- storing session values in response to the search query;
- obtaining a run time navigation path of an object that matches the session values, by implementing a recursive function top down and bottom up on the graphs for said object; and
- impressing the run time navigation path with the session values for retrieval in response to another search query for the session values.
7. A search engine, comprising:
- a web crawler that searches a website directory and builds graphs having URL objects of the website as nodes, and hierarchial hierarchical relations between nodes as structural elements;
- a top down transversal algorithm distinguishing active URL objects on the graphs from orphan files on the graphs, and
- a collection building utility assembling a batch collection of solely the active URL objects for retrieval by a search query.
8. The search engine of claim 7 and further comprising: a bottom up navigation path getting algorithm building a shortest navigation path of each active object to a website home page.
9. The search engine of claim 7 and further comprising:
- a bottom up navigation path getting algorithm building a shortest navigation path of each active object to a website home page; and
- a search results reporting utility.
Type: Application
Filed: Aug 6, 2003
Publication Date: Feb 10, 2005
Inventors: Ching-Chung Chang (Hsinchu City), Frank Sung (Hsinchu), Cheng-hui Chiu (Hsinchu City)
Application Number: 10/636,936