METHOD AND SYSTEM FOR BUILDING ENTITY HIERARCHY FROM BIG DATA

Info

Publication number: 20140046653
Type: Application
Filed: Jan 31, 2013
Publication Date: Feb 13, 2014
Applicant: XURMO TECHNOLOGIES PVT. LTD. (BANGALORE)
Inventors: SRIDHAR GOPALAKRISHNAN (BANGALORE), SUJATHA RAVIPRASAD UPADHYAYA (BANGALORE)
Application Number: 13/755,069

Abstract

The various embodiments herein provide a method and a system for building an entity hierarchy. The method comprises extracting a plurality of entities from a bin data, determining a parent entity by understanding a context in which the entity is used, resolving the entities by bringing the synonymous entities together and holding the polysemous entities apart based on a semantic context and a syntactic context and building a hierarchical structure of entities using knowledge repositories, ontologies and language repositories along with natural language processing techniques. The method of extracting entities from the structured data comprises identifying each data point as an entity and identifying entities based on a relationship defined with other entities. The method of extracting entities from unstructured data includes a self-learning process and training based learning process to learn new parent entities from domain specific documents using new entity recognition models.

Description

Description

CROSS-REFERENCE TO RELATED APPLICATION

The present application claims priority of Indian provisional application serial number 3286/CHE/2012 filed on Aug. 10, 2012, and that application is incorporated in its entirety at least by reference.

BACKGROUND

1. Technical Field

The embodiments herein generally relate to data mining and particularly relates to extracting and resolving entities from a large collection of data. The embodiments herein more particularly relates to a method and system for extracting entities from big data and building an entity hierarchy using language and domain models.

2 Description of the Related Art

A big data is a large collection, of information which derives its data content from plurality of structured, unstructured and semi-structured data sources. The big data requires a paradigm shift in the way the data is looked at in the past. The data cannot anymore reside in pockets and not talk each other. It is imperative that all of it to be considered as one and then be processed. Recognizing the entities and the relationships they share is a first step toward understanding data. Entity extraction and entity type or parent entity recognition are the building blocks of analyzing big data. Therefore, it is imperative that entity extraction and recognition should be done with least manual intervention and hence, a self learning procedure is required.

An entity is an atomic unit of data which has an independent self-explanatory meaning, and is also referred as an object that makes an independent sense. Entities could be named and unnamed or concepts, and include names of living and non living things, concepts, theories or simply the language units that make independent sense. In a database context, entities and relationships help in structurally storing the contents of a big data.

Entity extraction means processing data to identify, tag and properly account for those elements that are the names of person, numbers, organizations, locations, and expressions such as a telephone number, among other items. An entity can consist of a single word or a bound sequence of words. The challenge of figuring out entities is tough one for several reasons as many entities exist only in richly varied forms.

Many researches have been conducted for finding and identifying entities in a data. An existing system discusses about extraction of named entities only. Therefore the current systems are limited by the relationships that exist between named entities and never consider the relationship between concepts or a concept and a named entity. The existing literature does suggest building an entity hierarchy but limits itself to entity extraction and resolution.

The existing data analysis and information extraction techniques are usually designed to target at a particular media type and not applicable to data generated by a different media type. For example, existing entity extraction techniques focus on textual data. Entities of interest, such as protein and gene names, chemical names and formulae, drug names etc., are automatically extracted from the textual part of a document.

The existing extraction tools merely identity and extract information based on pre-specified relations and relation-specific human-tagged examples. The existing literatures do not refer to the self-learning capabilities of entity extractors. Further, the existing literature does not bring in domain ontologies and knowledge bases for semantic resolution in the context of entity extraction.

Accordingly, there is a need for an entity extraction method and system which is robust enough to identify new entities from big data. There is also a need for a method and system for categorizing entities in a hierarchical order to efficiently handle pattern query. Further there is also a need for a method and system for extracting entities from various data sources irrespective of the domain.

The above mentioned shortcomings, disadvantages and problems are addressed herein and which will be understood by reading and studying the following specification.

SUMMARY

The primary object of the embodiments herein is to provide a method and system for building entity hierarchy from a collection of structured, unstructured and semi structured data.

Another object of the embodiments herein is to provide a method and system for extracting a plurality of entities by analyzing big data.

Another object of the embodiments herein is to provide a method and system for facilitating an accurate and efficient pattern query relating to entities.

Another object of the embodiments herein is to provide a method and system for extracting named and unnamed entities from a collection of structured, semi-structured and unstructured data in a self learning manner.

Another object of the embodiments herein is to provide a method and system for extracting entities and building entity hierarchy from extracted entities with least manual intervention.

Another object of the embodiments herein is to provide a method and system for extracting entities which is domain independent.

These and other objects and advantages of the present embodiments will become readily apparent from the following detailed description taken in conjunction with the accompanying drawings.

The various embodiments herein provide a method for building an entity hierarchy. The method comprises extracting a plurality of entities from a big data, determining its parent entity or the entity type by understanding a context in which the entity is used, resolving the entities by bringing the synonymous entities together and holding the polysemous entities apart based on a semantic context and a syntactic context and building a hierarchical structure of entities using knowledge repositories, ontologies and language repositories along with natural language processing techniques.

According to an embodiment herein, the big data comprises structured, semi structured and unstructured data.

According to an embodiment herein, each entity is associated with a parent entity.

According to an embodiment herein, the entity is at least one of named entities and unnamed entities. The named entities belong to one of the parent entities and include names of person, organization, locations, time expressions, quantities, money values quantities, monetary values and the like. The unnamed entities include nouns, verbs, combinations of nouns and verbs, a concept or a language unit with independent meaning. The unnamed entities belong to the parent entity concept, however, there can be hierarchy among the various concept entities.

According to an embodiment herein, extracting the plurality of entities from the structured data comprises identifying each data point as an entity and identifying entities based on a relationship defined with other entities. The data point classes at least one of a table entity, a value entity, an attribute entity, and a database entity.

According to an embodiment herein, extracting, the plurality of entities from unstructured data comprises recognizing the named entities and the unnamed entities from data sources using a natural language processor based entity tagger, passing the named entities and unnamed entities through multiple entity recognition models, determining the parent entity and storing the entities along with respective parent entity and context specific information in an entity store.

According to an embodiment herein the entity extraction from unstructured data is a combination of a self-learning process and training based learning process.

According to an embodiment herein the self-learning entity extraction process comprises performing entity recognition by tagging entities without explicitly knowing the parent entity using a natural language processing technique, passing the tagged entities through trained Entity Recognition (ER) models to learn the parent entity associated with the tagged entity, detecting the parent entity using a voting, procedure, and storing the entities whose parent entities are detected in the entity store.

According to an embodiment herein, the self-learning entity extraction process further comprises feeding the data containing the entities whose parent entities are not detected to a Natural Language Processor (NLP) based entity detector which involves parsing documents containing domain specific knowledge and learn the parent entities from the explicit or implicit facts stated in the documents, building new entity recognition models, passing the entities through multiple entity recognition models until the parent entity is obtained and populating the entity recognition models with now entity recognition models built by learning from samples containing new entities whose parent entities are identified in domain specific documents through the NLP based entity detectors.

According to an embodiment herein, the training based entity extraction process comprises passing the data containing the tagged entities through multiple trained entity recognition models, determining one or more parent entities associated with the entities, and recognizing the appropriate parent entity based on a voting procedure.

According, to an embodiment herein, the training based entity extraction process further comprises providing additional training samples and documents that are tagged with new domain specific entities, and populating the training sample with new kind of parent entities suggested by the NLP based entity detectors to build new entity recognition models.

According to an embodiment herein, the entity recognition models to detect the parent entity use at least one of, maximum entropy model (maxent), conditional random fields (CRF), classification and clustering techniques, and NLP based techniques.

According to an embodiment herein, resolving the plurality of entities comprises at least one of a, word sense disambiguation technique, contextual resolution technique, syntactic similarity, and semantic similarity.

According to an embodiment herein, the entity extraction from semi-structured data is a combination of extracting entities from structured data and unstructured data.

Embodiments herein further provide a system for building an entity hierarchy. The system comprises an entity extractor to extract a plurality of entities from a big data, a Language and Domain model to conceptualize the entities in accordance with a structured context or semi-structured context, an entity resolver to resolve the entities by gathering the synonymous entities together and polysemous entities apart based on syntactic and semantic contexts, and an entity hierarchy builder to build a hierarchical structure of entities using natural language processing techniques in conjunction with the Language and Domain models.

According to an embodiment herein, the entity extractor comprises an entity tagger to tag named entities and unnamed entities in a data source and a parent entity detector to determine assertions of parent entity in data sources. The entities are passed through multiple entity recognition models to determine the parent entity based on a voting procedure.

According to an embodiment herein, the entity recognition models to detect the parent entity use at least one of a maximum entropy model (maxent), conditional random fields (CRF), classification and clustering techniques and NLP based techniques.

According to an embodiment herein, the entity tagger is adapted to tag, the named entities and the unnamed entities, and tau the named entities with explicit mention of the parent entity.

According to an embodiment herein, the entity resolver understands the context in which the entities are being used and determine the parent entity.

According to an embodiment herein, the entity resolver performs a contextual resolution using the Language and Domain models which comprise at least one of a, language repositories, domain ontologies, and knowledge repositories in combination with Natural Language Processing (NLP) techniques.

These and other aspects of the embodiments herein will be better appreciated and understood when considered in conjunction with the following, description and the accompanying drawings. It should be understood, however, that the following descriptions, while indicating preferred embodiments and numerous specific details thereof, are given by way of illustration and not of limitation. Many changes and modifications can be made within the scope of the embodiments herein without departing from the spirit thereof, and the embodiments herein include all such modifications.

BRIEF DESCRIPTION OF THE DRAWINGS

The other objects, features and advantages will occur to those skilled in the art from the following description of the preferred embodiment and the accompanying drawings in which:

FIG. 1 illustrates a block diagram of a system for building entity hierarchy from big data, according to an embodiment of the present disclosure.

FIG. 2 is a block diagram illustrating a self-learning entity extraction process, according to an embodiment of the present disclosure.

FIG. 3 is a block diagram illustrating a training based entity extraction process, according to an embodiment of the present disclosure.

FIG. 4 is a block diagram illustrating a process for resolving entities, according to an embodiment of the present disclosure.

FIG. 5 is a flow chart illustrating a method for building entity hierarchy, according to an embodiment of the present disclosure.

FIG. 6 is a flow chart illustrating a method for extracting entities from structured data sources, according to an embodiment of the present disclosure.

FIG. 7 is a flow chart illustrating a method for extracting entities from unstructured data sources through a self-learning entity extraction process, according to an embodiment of the present disclosure.

FIG. 8 is a flow chart illustrating a method for extracting entities from unstructured data sources through a training based entity extraction process, according to an embodiment of the present disclosure.

Although the specific features of the present embodiments are shown in some drawings and not in others. This is done for convenience only as each feature can be combined with any or all of the other features in accordance with the present embodiments.

DETAILED DESCRIPTION OF THE EMBODIMENTS

In the following detailed description, a reference is made to the accompanying drawings that form a part hereof, and in which the specific embodiments that can be practiced is shown by way of illustration. These embodiments are described in sufficient detail to enable those skilled in art to practice the embodiments and it is to be understood that the logical, mechanical and other changes can be made without departing from the scope of the embodiments. The following detailed description is therefore not to be taken in a limiting sense.

The various embodiments herein provide a method or building an entity hierarchy, the method comprises extracting a plurality of entities from a big data determining a parent entity by understanding a context in which the entity is used resolving the entities by bringing the synonymous entities together and holding the polysemous entities apart based on a semantic context and a syntactic context, and building a hierarchical structure of entities using knowledge repositories, ontologies, and language repositories along with natural language processing techniques. The big data comprises structured, semi-structured and unstructured data.

The entity is at least one of named entities and unnamed entities where each entity is associated with a parent emit. The named entities belong to one of the parent entities and include names of person, organization, locations, time expressions, quantifies, money values quantities, monetary values and the like. The unnamed entities include nouns, verbs, combinations of nouns and verbs, a concept or a language unit with independent meaning.

The plurality of entities are extracted from the structured data by identifying each data point as an entity and identifying entities based on a relationship defined with other entities. The data point classes at least one of a table entity, a value entity, an attribute entity, and a database entity.

The plurality of entities are extracted from the unstructured data by recognizing the named entities and the unnamed entities from data sources using a natural language processor based entity tagger, passing the named entities and unnamed entities through multiple entity recognition models determining the parent emits and storing the entities along with respective parent entity and context specific information in an entry store.

The entity extraction process from unstructured data herein is a combination of a self-learning process and training based learning process.

The self-learning entity extraction process comprises performing entity recognition by tagging entities without explicitly knowing the parent entity using a natural language processing technique, passing the tagged entities through trained Entity Recognition (ER) models to learn the parent entity associated with the tagged entity, detecting the parent entity using a voting procedure, and storing the entities whose parent entities are detected in the entity store.

The self-learning entity extraction process further comprises feeding the data containing the entities whose parent entities are not detected to a Natural Language Processor (NLP) based entity detector which involves parsing documents containing domain specific knowledge and learn the parent entities from the explicit, or implicit facts stated in the documents, building new entity recognition models, passing the entities through multiple entity recognition models until the parent entity is obtained and populating the entity recognition models with new entity recognition models built by learning from samples containing new entities whose parent entities are identified in domain specific documents through the NLP based entity detectors.

The training based entity extraction process comprises passing the data containing the tagged entities through multiple trained entity recognition models, determining one or more parent entities associated with the entities, and recognizing the appropriate parent entity based on a voting procedure.

The training based entity extraction process further comprises, providing additional training samples and documents that are tagged with new domain specific entities, and populating the training sample with new kind of parent entities suggested by the NLP based entity detectors to build new entity recognition models.

The entity recognition models to detect the parent entity use at least one of, maximum entropy model (maxent), conditional random fields (CRF), classification and clustering techniques and NLP based techniques.

The embodiments herein use at least one of a word sense disambiguation technique, contextual resolution technique, syntactic similarity, and semantic similarity method for resolving the plurality of entities.

The entity extraction from semi-structured data is a combination of extracting entities from structured data and unstructured data.

The system for building an entity hierarchy comprises an entity extractor to extract a plurality of entities from a big data, a Language and Domain model to conceptualize the entities in accordance with a structured context or semi-structured context, an entity resolver to resolve the entities by gathering the synonymous entities together and polysemous entities apart based on syntactic and semantic contexts, and an entity hierarchy builder to build a hierarchical structure of entities using natural language processing techniques in conjunction with the Language and Domain models.

The entity extractor comprises an entity tagger to tag named entities and unnamed entities in a data source, and a parent entity detector to determine assertions of parent entity in data sources. The entities are passed through multiple entity recognition models to determine the parent entity based on a voting procedure.

The entity recognition models to detect the parent entity use at least one of, maximum entropy model (maxent), conditional random fields (CRF), classification and clustering techniques, and NLP based techniques.

The entity tagger is adapted to tag the named entities and the unnamed entities, and tag the named entities with explicit mention of the parent entity.

The entity resolver understands the context in which the entities are being used and determine the parent entity.

The entity resolver performs a contextual resolution using the Language and Domain models which comprise at least one of a, language repositories, domain ontologies, and knowledge repositories in combination with Natural Language Processing (NLP) techniques.

FIG. 1 illustrates a block diagram of a system for building entity hierarchy, according to an embodiment of the present disclosure. The system comprises an entity hierarchy builder 101, a Language and Domain model 102, an entity resolver 103 and an entity extractor 104. The entity extractor 104 assists in processing and extracting entities from big data. The entity extractor 104 represents an input to the system in all forms of data including structured, semi-structured and unstructured data from heterogeneous sources. The entity extractor 104 continues to learn from the available data through different earning algorithms.

The entity extractor 104 comprises an entity tagger 105 and a parent entity detector 106. The entity tagger 105 tags named entities and unnamed entities in a data source and the parent entity detector 106 determines assertions of parent entity in data sources. The parent entity detector 106 passes the entities through multiple entity recognition models 107 to determine the parent entity based on a voting procedure.

The entity recognition models 107 herein use at least one of a maximum entropy model (maxent) conditional random fields (CRF), classification and clustering techniques and NLP based technique to detect the parent entity.

The Language and Domain model 102 is a repository used to understand the context in which the entity is being used and determines a parent entity/entity type of the entity. The Language and Domain model 102 comprises one or more language repositories 102a, a domain ontologies 102b and knowledge repositories 102c. The Language and Domain model 102 is also used to resolve the entities in structured and semi-structured context.

The entity resolver 103 resolves the entities by gathering the synonymous entities together and polysemous entities apart based on syntactic and semantic contexts. The entity resolution strategies are based on resolving the syntactic and semantic context. The entity resolver uses standard domain ontologies, knowledge repositories, language repositories and natural language processing techniques to establish resolution.

The entity hierarchy builder 101 arranges and stores the plurality of entities in a hierarchical manner by using a plurality of Natural Language Processing (NLP) techniques with the support of Language and Domain model 102.

FIG. 2 is a block diagram illustrating a self-learning entity extraction process, according to an embodiment of the present disclosure. The named and unnamed entities are recognized in every source of data by a natural language processor based entity tagger 105. The entity tagger 105 recognizes an entity to be a possible entity without determining the parent entity. The entity tagger 105 tags a named entity and an unnamed entity in a data source. The tagged entities are passed through multiple entity recognition models 107 which use different techniques to determine the parent entity of the tagged entity based on a voting procedure.

Based on the requirement, one or more ER models 107 are used. The one or more ER models either use a same technique or a different technique, but learn different types of names. For instance, a first model learns medicine names, a second model learns location names and the like. The detected entities are then passed through a voting based parent entity detector 201 to check if the parent entity is detected or not. The entities whose parent entity is detected 202 is stored in an entity storage 203. The entities whose parent entity is still unknown undergoes a process of entity resolution. The entity resolution is executed by a Manual/Domain specific NLP based Parent Entity Detectors 204. The entity resolution uses 1either a manual or an automatic parent entity detector that searches for assertions of parent entities in domain specific document collection and structured data. The Manual/Domain specific NLP based Parent Entity Detectors 204 finds out new parent entities and also identifies entities with respect to the new parent entities. The entities whose parent entities are still not determined are sent to the collection of NER models through a training sample 205. The model 107 keeps receiving new models built by learning from new entities whose parent entities are resolved through the NLP based parent entity detectors, new training samples and documents that are tagged with new/domain specific entities. The entities with unknown parent entity keep going through the parent entity detection processes until the parent entity is detected (205).

FIG. 3 is a block diagram illustrating a training based entity extraction process, according to an embodiment of the present disclosure. The training based models use a set of training samples 206 which are already trained to detect some parent entities of from a big data 301. The big data 301 represents an input to the system in the form of data including structured, semi-structured and unstructured data from heterogeneous sources. The entities are tagged in the data and the data containing the tagged entities are passed through multiple trained entity recognition models. The one or more parent entities associated with the tagged entities are then determined. The parent entities are then passed through a voting based parent entity detector to identify the appropriate parent entity based on a voting procedure. The entities whose parent entity is detected 303 is stored in the entity storage 204.

To resolve the entities whose parent entity is not detected, additional training samples and documents that are tagged with new domain specific entities are generated and the training, samples 205 is populated with new kind of parent entities suggested by the NLP based entity detectors to build new entity recognition models 302.

The training based entity recognition is also referred to as automatic learning, because the entity recognition is not explicitly included in the training set as long as the entities are of the designated type.

FIG. 4 is a block diagram illustrating a process for resolving entities, according to an embodiment of the present disclosure. The entity extractor 104 extracts entities from big data and determines the parent entity associated with each entity. The entity extractor 104 then passes the unresolved entities to an entity resolution 103 for entity disambiguation.

The entity resolver 103 comprises a plurality of resolution modules 401 such as entity resolution module 1, entity resolution module 2 . . . to entity resolution module n for resolving the extracted entities. The entity resolver 103 understands the context in which the entity is being used to determine the parent entity. The entity resolution 103 uses any one or a combination of a word sense disambiguation technique, a contextual resolution technique, a syntactic similarity and a semantic similarity for resolving the entities.

The entity extraction process is a combination of automatic learning and training based learning. An initial set of named entities and concepts are identified based at certain rudimentary NLP based rules and a parent entity of identified entities and concepts is discovered. Parent entity learning is also facilitated by using tagged data for training. As more than one method is used for learning, a voting based entity resolution is performed which establishes entity recognition by a maximum scare. A voting based entity resolver 402 conducts a voting procedure on the output of various entity resolvers 103 and provides resolved entities for further processing.

FIG. 5 is a flow chart illustrating a method for building entity hierarchy, according to an embodiment of the present disclosure. The method comprises extraction of entities, resolution of entities and then building a hierarchy of entities. At first, a big data is taken as input which is processed for extracting a plurality of entities. The big data comprises structured, semi-structured and unstructured data from plurality of heterogeneous us data sources. The entities extracted are any one of a named entity and unnamed entity (501). Then, a parent entity or a super entity of each extracted entities is determined by understanding the context in which the extracted entity is used (502). After the determination of the parent entity, the entities are resolved by bringing the synonymous entities together and holding the polysemous entities apart based on a semantic context and a syntactic context (503). Finally, a hierarchical structure of entities is built by using knowledge repositories, ontologies, and language repositories along with Natural Language Processing (NLP) techniques (504).

FIG. 6 is a flow chart illustrating a method for extracting entities from structured data sources, according to an embodiment of the present disclosure. The method comprises identifying each data point as an entity at 601. Here each data point, is any one of a table entity, at value entity, an attribute entity and a database entity. After identifying the data points as entities, the entities are identified based on a relationship defined with other entities at 602.

FIG. 7 is a flow chart illustrating a method for extracting entities from unstructured data through a self-learning entity extraction process, according to an embodiment of the present disclosure. The entities are tagged from big data without explicit recognizing the parent entity. Further entity recognition is performed by tagging entities without explicitly knowing the parent entity using, a natural language processing technique at 701. The entity recognition is performed using at least one of a maximum entropy model (maxent), a conditional random field (CRF) and a classification and clustering techniques. The tagged entities are passed through trained entity recognition models at 702 to learn the parent entity associated with the tagged entity at 703. Further it is checked if the parent entity of the tagged entity is detected or not at 704. If the parent entity is detected, the entities are stored in an entity storage along with the parent entity information and other context specific information at 705. If the parent entity is not detected, the entities are passed through NLP based entity detector 706. The NLP based entity detector parses documents that contain domain specific knowledge and learn from the explicit statements that are present. For instance, if the NLP based entity detector comes across sentences like, . . . “medicines such as penicillin”, then, the term penicillin is learnt as a medicine name and there on penicillin is tagged as medicine rather than just a named entity. Further, one or more new entity recognition models are built which include information of the new entities and new parent entities at 707. The entities are then passed through multiple new entity recognition models until the parent entity is obtained 708. Further, the entity recognition models are populated with the new entity recognition models at 709. The new entity recognition models are built by learning from samples containing new entities whose parent entities are identified in domain specific documents through the NLP based entity detectors are added to existing entity recognition models.

FIG. 8 is a flow chart illustrating a method for extracting entities from unstructured data sources through a training based entity extraction process, according to an embodiment of the present disclosure. The data containing tagged entities are passed through multiple trained entity recognition models at 801. Then, one or more parent entities associated with a tagged entity is determined at 802 and appropriate parent entity for the tagged entity is recognized based on a voting procedure at 803. Further, additional training samples and documents that are tagged with new domain specific entities are provided at 804. Finally, the training sample is populated with new kind of parent entities suggested by the NLP based entity detectors in order to build new recognition models at 805.

The embodiments herein extracts entities based on certain NLP rules. The entity extractor continues to learn from the available data through different learning algorithms. The inclusion of concepts among entities supports a wider scope for querying the data and the inclusion of the ability to recognize concepts and resolving them gives a much higher expressiveness to model semantics. The entity hierarchy helps in bringing in entities related to the queries mentioned in the query. Building an entity hierarchy, functions as query enrichment (query enrichment with semantic resolution) that allows any query to encompass all the entities of interest and eliminate the ones that are not pertinent.

The present disclosure finds relevant entities and relationships, even though the entity names are not mentioned explicitly in the big data. The entity hierarchy is useful when but not limited to, a user has to search/query about entities and their relationships/interactions with other named entities/concepts. The entity hierarchy encompasses all the named and unnamed entities that exist in the big data. The embodiments of the present disclosure provide immense benefit in Retail, Health and Pharmaceutical services, Banking and Insurance etc.

Implementation of the techniques, blocks, steps and means described above can be done in various ways. For example, these techniques, blocks, steps and means can be implemented in hardware, software, or a combination thereof. For a hardware implementation, the processing units ma be implemented within one or more application specific integrated circuits), digital signal processing devices, programmable logic devices, field programmable gate arrays (FPGAs), processors, controllers, micro-controllers, microprocessors, other electronic units designed to perform the functions described above and/or a combination thereof.

Also, it is noted that the embodiments can be described as a process which is depicted as a flowchart, a flow diagram, a data flow diagram, a structure diagram, or a block diagram. Although the flowcharts describe the operations as a sequential process, many of the operations can be performed in parallel or concurrently. In addition, the order of the operations can be rearranged. A process is terminated when its operations are completed, but could have additional steps not included in the figure. A process can correspond to a method, a function, a procedure, a subroutine, a subprogram, etc. When a process corresponds to a function, its termination corresponds to a return of the function to the calling function or the main function.

Furthermore, embodiments can be implemented by hardware, software, scripting languages, firmware, middleware, microcode, hardware description languages and/or any combination thereof. When implemented in software, firmware, middleware, scripting language and/or microcode, the program code or code segments to perform the necessary tasks can he stored in a machine readable medium, such as a storage medium. A code segment or machine-executable instruction may represent a procedure, a function, a subprogram, a program, a routine, a subroutine, a module, a software package, a script, a class, or any combination of instructions, data structures and/or program statements. A code segment can he coupled to another code segment or a hardware circuit by passing and/or receiving information, data, arguments, parameters and/or memory contents. Information, arguments, parameters, data, etc. can be passed, forwarded, or transmitted via any suitable means including memory sharing, message passing, token passing, network transmission, etc.

For a firmware and/or software implementation, the methodologies can be implemented with modules e.g., procedures, functions, and so on) that perform the functions described herein. Any machine-readable medium tangibly embodying instructions can be used in implementing the methodologies described herein.

The foregoing description of the specific embodiments will so fully reveal the general nature of the embodiments herein that others can, by applying current knowledge, readily modify and/or adapt for various applications such specific embodiments without departing from the generic concept, and, therefore, such adaptations and modifications should and are intended to be comprehended within the meaning and range of equivalents of the disclosed embodiments. It is to be understood that the phraseology or terminology employed herein is for the purpose of description and not of limitation. Therefore, while the embodiments herein have been described in terms of preferred embodiments, those skilled in the art will recognize that the embodiments herein can be practiced with modification.

Claims

1. A method of building an entity hierarchy comprises:

extracting a plurality of entities from a big data;

determining a parent entity by understanding a context in which the entity is used;

resolving the entities by bringing the synonymous entities together and holding the polysemous entities apart based on a semantic context and a syntactic context; and

building a hierarchical structure of entities using knowledge repositories, ontologies, and language repositories along with natural language processing techniques.

2. The method of claim 1, wherein the big data comprises structured, semi-structured and unstructured data.

3. The method of claim 1, wherein each entity is associated with a parent emit.

4. The method of claim 1, wherein the entity is at least one of named entities and unnamed entities:

where the named entities belong to one of the parent entities and includes names of person, organization, locations, time expressions, quantities, money values quantities, monetary values and the like; and

the unnamed entities includes nouns, verbs, combinations of nouns and verbs, a concept or a language unit with independent meaning.

5. The method of claim 1, wherein extracting the plurality of entities from the structured data comprises:

identifying each data point as an entity; and

identifying entities based on a relationship defined with other entities;

wherein the data point comprises at least one of a entity, a value entity, an attribute entity, and a database entity.

6. The method of claim 5, wherein extracting the plurality of entities from unstructured data comprises:

recognizing the named entities and the unnamed entities from data sources using a natural language processor based entity tagger;

passing the named entities and unnamed entities through multiple entity recognition models;

determining the parent entity; and

storing the entities along with respective parent entity and context specific information in an entity store.

7. The method of claim 6, wherein the entity extraction from unstructured data is a combination of a self-learning process and a training based learning process.

8. The method of claim 7, wherein the self-learning entity extraction process comprises:

performing entity recognition by tagging entities without explicitly knowing the parent entity using a natural language processing technique;

passing the tagged entities through trained Entity Recognition (ER) models to learn the parent entity associated with the tagged entity;

detecting the parent entity using a voting procedure; and

storing the entities whose parent entities are detected in the entity store.

9. The method of claim 7, further comprises:

feeding the data containing the entities whose parent entities are not detected to a Natural Language Processor (NLP) based entity detector which involves parsing documents containing domain specific knowledge and learn the parent entities from the explicit or implicit facts stated in the documents;

building new entity recognition models;

passing, the entities through multiple entity recognition models until the parent entity is obtained; and

populating the entity recognition models with new entity recognition models built by learning from samples containing new entities whose parent entities are identified in domain specific documents through the NLP based entity detectors.

10. The method of claim 7, wherein the training based entity extraction process comprises:

passing the data containing the tagged entities through multiple trained entity recognition models;

determining one or more parent entities associated with the entities; and

recognizing the appropriate parent entity based on a voting procedure.

11. The method of claim 10, further comprises:

providing additional training samples and documents that are tagged with new domain specific entities; and

populating the training sample with new kind of parent entities suggested by the NLP based entity detectors to build new entity recognition models.

12. The method of claim 1, wherein the entity recognition models to detect the parent entity use at least one of:

maximum entropy model (maxent);

conditional random fields (CRF);

classification and clustering techniques; and

NLP based techniques.

13. The method of claim 1, wherein resolving the plurality of entities comprises at least one of a:

word sense disambiguation technique;

contextual resolution technique;

syntactic similarity; and

semantic similarity.

14. The method of claim 1, wherein the entity extraction from semi-structured data is a combination of extracting entities from structured data and unstructured data.

15. A system for building an entity hierarchy comprises:

an entity extractor to extract a plurality of entities from a big data;

a Language and Domain model to conceptualize the entities in accordance with a structured context or semi-structured context;

an entity resolver to resolve the entities by gathering the synonymous entities together and polysemous entities apart based on syntactic and semantic contexts; and

an entity hierarchy builder to build a hierarchical structure of entities using natural language processing techniques in conjunction with the Language and Domain models.

16. The system of claim 15, wherein the entity extractor comprises:

an entity tagger to tag named entities and unnamed entities in a data source; and

a parent entity detector to determine assertions of parent entity in data sources, where the entities are passed through multiple entity recognition models to determine the parent entity based on a voting procedure.

17. The system of claim 16, wherein the entity recognition models to detect the parent entity use at least one of:

maximum entropy model (maxent);

conditional random fields (CRF);

classification and clustering techniques; and

NLP based techniques.

18. The system of claim 15, wherein entity tagger is adapted to:

tag the named entities and the unnamed entities; and

tag the named entities with explicit mention of the parent entity.

19. The system of claim 15, wherein the entity resolver understands the context in which the entities is being used and determine the parent entity.

20. The system of claim 15, wherein the entity resolver performs a contextual resolution using the Language and Domain models which comprise at least one of a:

language repositories;

domain ontologies; and

knowledge repositories in combination with Natural Language Processing (NLP) techniques.