SMART TERMINOLOGY MARKER SYSTEM FOR A LANGUAGE TRANSLATION SYSTEM
A terminology marker system integrates a terminology analytical component for quantifying the amount of linguistic noise found in the translation output as measured against a dictionary; further, correlating the noise measured on a continuous basis enables the analytical component to build terminology predictive models used in a feedback loop to upstream components of the supply chain to improve future translation of new content. The system also provides a smart terminology assessment component for assessing linguistic assets and improving the quality of those assets to assist in translation. The system also provides a smart terminology evaluation component that is able to analyze MT output to make smart decisions on reducing the amount of post editing corrections needed for delivering a persistent level of translation quality. The integration and configuration of the system component within a translation supply chain assists in delivering a reliable level of translation quality by reducing the linguistic noise across all components of the supply chain.
This application is a continuation of and claims priority from U.S. patent application Ser. No. 14/991,025, filed on Jan. 8, 2016, entitled “SMART TERMINOLOGY MARKER SYSTEM FOR A LANGUAGE TRANSLATION SYSTEM,” the content of which is incorporated herein by reference in its entirety.
BACKGROUNDThe present disclosure relates to language translation systems and more particularly to a smart terminology marker system of the language translation system.
Companies typically develop written material such as web pages, user interfaces, marketing materials and others in a native language and subsequently employ a language translation service to translate the company's web pages (as one example) into different languages. Language translation services may utilize a translation supply chain (TSC) that may include an integration of linguistic assets/corpuses, translation automated systems, computer-aided translation editors, professional linguists, and operational management systems.
The TSC may include three stages. The first stage may be a linguistic asset optimization stage that may parse source language content into source segments, and search a repository of historical linguistic assets for the best suggested translations per language and per a domain within the language. Linguistic assets may be historical translation memories (i.e., bi-lingual segment databases), dictionaries, and/or language specific metadata to optimize downstream stages. The second stage of the TSC may be a machine translation stage that customizes a translation model using domain specific linguistic assets of a given language, and provides machine generated suggested translations of original content based upon the customized translation model. The third stage may be a post-editing stage that may use a computer-aided translation (CAT) editor to review the suggested translations (i.e., called matches) to produce a final translation. The professional linguist (i.e., human) may accept one of the suggested matching translations, may modify one of the suggested matching translations, or may generate a completely new translation and delivers final human fluent translated content to the company.
Machine translation systems typically implement phased-based translations that have limited sensitivity to morphological, syntactical and/or semantic differences between the source and target languages. The process of customizing (i.e., training) a phased-based statistical machine translation system is common where bilingual corpuses are used to prioritize the statistical hits of correct translations within the statistical machine translation, phased-based, translation. Rule based machine translation is customized by managing a lexicon of terms aligned to a subject area. Terminology assets refer to the set of dictionaries/databases per language that may have the following properties: highly structured information; morphological, syntactical, and semantic information; and, enterprise international business metadata. Improvements in the overall quality of the translations on a consistent basis is desirable.
SUMMARYIn accordance with an embodiment, a computer implemented method is provided in which a Smart Term Assessment subsystem (STA-SS) embeds a Smart Term Index marker within a plurality of segments (i.e., previous learning corpuses and/or new content) based on a reference domain dictionary; the Smart Term Index markers may improve the training and optimization of downstream components (e.g., MT), thus producing better translations.
In accordance with an embodiment, a computer implemented method is provided in which a Smart Term Evaluation subsystem (STE-SS) analyzes the embedded Smart Term Index markers contained across a plurality of matches (potential language translations) against the reference domain dictionary and the terminology predictive models to filter and qualify the matches (i.e., the STE-SS may remove matches deemed to be of poor quality).
In accordance with an embodiment, a computer implemented method is provided in which a Smart Term Linguistic Analytical subsystem (STLA-SS) analyzes using a plurality of post editing logs (PE logs) to generate a match dictionary that can be correlated with the original reference domain dictionary and final (post PE) dictionary.
In one embodiment the STLA-SS provides methods for
- a) generating a Best Term Index (BTI) by using the plurality of best matches across the plurality of source and target language segments and the respective final dictionary,
- b) generating a Perfect Term Index (PTI) by using a plurality of final translations across the plurality of source and target language segments and the respective final dictionaries,
- c) generating a Final Term Index (FTI) by using the plurality of final translations across the plurality of source and target language segments and using the respective original reference dictionaries,
- d) generating a Machine Term Index (MTI) by using the plurality of best matches across the plurality of source and target language segments and using respective match dictionaries,
- e) generating a Final Match Term Index (FMTI) by using the plurality of final translations across the plurality of source and target language segments and using the respective match dictionaries, and
- f) generating a plurality of terminology predictive models by analyzing the patterns and correlations between the dictionary terms and the computed terminology indexes (BTI, PTI, FTI, MTI and FMTI).
In accordance with another embodiment, a computer implemented method for translating a language includes parsing source and target language content into segments, searching a repository of linguistic assets, creating a translation model using domain specific linguistic assets of the language, providing machine generated suggested matches of the source and target language segments based upon the customized translation model, using a computer-aided translation editor to review the suggested matches to produce a final translation, and applying smart terminology markers generated by a smart terminology marker system to reduce linguistic noise.
In accordance with a further embodiment, a computer program product for language translation applications may include a translation supply chain and a smart terminology marker system. The translation supply chain includes an asset optimization (i.e., translation memory) component configured to parse source language content into a plurality of source segments and searches a repository of historical linguistic assets. The asset optimization component produces a plurality of matches classified into any one of Exact match, Fuzzy match or other matches. A machine translation (MT) component configured to deliver a plurality of machine matches corresponding to the plurality of source segments optimized against a custom domain MT model. A post editing component configured to correct and produce the final translation segments against the respective source segments by utilizing human professional linguists editing and correcting with any given embodiment of a computer aided translation editor. The smarter terminology marker system is configured to use at least one of business analytics and terminology memory mining to reduce linguistic noise across the translation supply chain.
The subject matter which is regarded as the invention is particularly pointed out and distinctly claimed in the claims at the conclusion of the specification. The forgoing and other features, and advantages of the invention are apparent from the following detailed description taken in conjunction with the accompanying drawings in which:
In accordance with exemplary embodiments of the disclosure, methods, systems and computer program products for a language translation system 20 are provided. Referring to
In the present disclosure, terminology assets applied via a feedback loop across the TSC 22 by the STMS 24 functions to reduce linguistic noise and may improve the overall quality of the translations on a consistent basis. More specifically, the use of smart terminology markers may identify and assist in eliminating hidden linguistic noise (i.e., terminology noise) found in the translation assets (e.g., translation memory/bilingual corpus assets) during customization. By quantifying the terminology noise found in source segments and the plurality of potential target matches, the systems and methods outlined, herein, may allow an operational team to improve the creation of terminology-enriched training materials. It is understood that the term “linguistic noise” is a measurable unit corresponding to the human labor expended (i.e., mechanical and/or cognitive effort) to correct faults in translation memory and/or machine translation matches such that the final translated content is of human fluency quality levels. By utilizing the STMS 24, operational teams are able to manage and optimize the terms used within the dictionaries, thereby reducing linguistic noise and improving efficiency of the TSC 22.
Translation Supply Chain:In one embodiment, the TSC 22 facilitates translation workflows that may be used in the delivery of high quality fluent language translations. The TSC 22 may include a translation memory (TM) component 26, a machine translation (MT) component 28, and a post editing (PE) component 30. It is understood that use of the term ‘component’ may infer a stage of a process and/or method that may utilize computer-based processor(s) and associated computer readable memory to accomplish a given task.
The TM component 26 may also be referred to as a linguistic asset optimization component or stage that may parse source language content into source segments, and search a repository of historical linguistic assets for the best suggested translations per language and per a domain within the language. Linguistic assets may be historical translation memories (i.e., bi-lingual segment databases), dictionaries, and/or language specific metadata used to optimize downstream components 28, 30. More specifically, the TM component 26 may manage the delivery of high quality/domain specific linguistic assets optimized for the downstream components 28, 30. The assets may include: a plurality of high quality and certified previously translated translation memory matches that aid the human professional linguist in making corrections more efficiently in the PE component 30; a plurality of ‘learning translation memory’ datasets containing a plurality of previously translated bilingual segments that are used to train and tune the MT component 28 (i.e., services); and, a terminology database (DB) (i.e., Language Dictionary) for a given domain.
The translation memory component 26 may generally be any system and/or method involved in the production of potential translation matches (e.g., Exact matches, Fuzzy matches and/or other matches) corresponding to the plurality of new content source segments used to improve the efficiency of downstream components (e.g., MT component 28). The translation memory component 26 may use the plurality of previously translated segments and/or dictionaries for a given language as an ‘asset optimization’ for downstream components. It is understood that the term ‘segment’ may mean a plurality of words or terms that may, for example, be a sentence or a partial sentence.
The MT component 28 may deliver a plurality of machine matches corresponding to the plurality of new content source segments optimized against a custom domain machine translation service. The MT component 28 may integrate an increasing number of linguistic subcomponents. For instance, an MT component 28 building custom domain MT models may be dependent on the quality of the linguistic asset data service 38 used as input to the customization components for a specific domain (i.e., subject discourse).
The PE component 30 may utilize human professional linguists to review, correct, and perform quality control on the new content source segments and the respective matches (e.g., Exact Match 46, Fuzzy Match 48 and/or Machine Match 50, see
Linguistic assets may be any data set considered to be representative of the space, domain or subject matter existing ‘prior’ to the translation of new language content. Typically, linguistic assets may be bi-lingual pairs of historical translations contained within a data set that may be called a translation memory (i.e., at a segment and/or sentence granularity) and/or a Dictionary (i.e., at a word/term or simple phrase granularity).
When applying linguistic assets, new language content may be broken down into segments with the goal of producing a translation per segment of optimal accuracy and with no post editing. The production of suggested translation candidates may be referred to as matches. Referring to
The value or quality of linguistic assets may generally be measured by the quantity of linguistic noise. The language translation system 20 may include or implement techniques of statistical process analytic and control that analyze metadata supplied from the TM component 26, the MT component 28 and/or the PE component 30. By analyzing the metadata from the PE component 30 logs (i.e., at the end of the TSC 22 flow), the operational analytical systems are able to provide visualization and model the efficiency of the downstream components across the whole TSC 22.
Linguistic Vectors and Linguistic Noise Per Classset:Referring to
For each shipment, the STLA-SS 32 analyzes the plurality of metadata metrics across the plurality of editing events collected within the shipment's PE Log 92 (see
Referring to
There are many factors (observed and hidden) that may contribute toward linguistic noise, and such elements may include: quality of content, consistency of terminology, complexity of subject area, format of original content, tags and in-line tags, MT 28 settings, language specific algorithms and rules, post editing practices, human errors, computer aided translation skills, cultural and domain knowledge, spending too much time evaluating bad MT matches, and others. Each component 26, 28, 30 may supply input markers metadata for correlating and analyzing against linguistic markers and thereby assess and model its contribution of linguistic noise to the overall TSC 22 linguistic noise.
Smart Terminology Marker System:Referring to
The linguistic asset store 42 of the linguistic asset component 38 may store a Language Dictionary 43 (ie. terminology store) as a linguistic asset for use by any component 26, 28, 30 of the TSC 22. The Language Dictionary 43 may generally be a plurality of words associated with a single language. The smart terminology marker system 24 may use business analytics to add translation supply chain analytical metadata to each term (i.e., word). Such metadata may contain, but is not limited to: frequency of each term within the TM component 26; the classification of whether the term is a non-prescribed word within the language; and the average linguistic noise associated with the plurality of translation segments containing the respective term. This may be a rolling measurement representative of translations over a previous period of time.
The linguistic asset store component 42 may store a Reference Domain Dictionary 52 accessible by any component (e.g., components 26, 28, 30) of the TSC 22. The Domain Dictionary may generally be a plurality of words for a given language associated with a specific subject area, discourse or discipline. The plurality of terms within the Domain Dictionary is a subset of the plurality within the Language Dictionary. The union of all Domain Dictionaries within the TSC 22 composes the significant set of terms in the Language Dictionary. The STMS 24 may store additional information about each term such as, but not limited to: the frequency of each term across all domain assets within the asset optimization component 26; the classification of whether the term is a non-prescribed word for the specific domain within the language; and, the average linguistic noise associated with the plurality of translation segments containing the respective term. This may be a measurement that is updated as new translations are performed over a period of time. Such information per term may be referred to as the term's metadata.
Referring to
Consistent terminology may be a key driver of quality translation across the whole TSC 22. Given a plurality of language dictionaries and a plurality of domains dictionaries per language the operational team of a TSC 22 needs the ability to visualize and track the linguistic noise in the management of the dictionaries of a TSC 22. Thus the STMS 24 introduces a Smart Term Index value that is used to measure the alignment between a plurality of segments and/or matches with the Reference Domain Dictionary 52.
The STMS 24 defines the systems and methods computing a “Smart Term Index” value on a per segment/match level and embedding within translation memories as a linguistic marker such that the marker passes thru the TSC 22 and can then be analyzed by the STLA-SS 32 to measure the linguistic noise contributed from misaligned terminology across the TSC 22.
Referring to
where ‘m’ is a given match, ‘RefDictSourceTerms’ is the plurality of terms in the Reference Domain Dictionary, ‘PrescribedTargetTerms’ represents the plurality of prescribed target terms of a given match within the Reference Domain Dictionary 52 associated with respective source terms that are found in the Reference Domain Dictionary 52, and ‘MatchCoefficient’ is a numerical value between zero and one that is used to weight a specific ‘TermIndexm based on external factors. In one embodiment, the ‘MatchCoefficient’ may be the Levenstein Edit Distance between a match source string and the respective original source segment that may be called the fuzzy score.
Each match may be assigned a ‘Term_INDEX1’ range from 0.0 to 1.0. A score of 1.0 means that one-hundred percent of the prescribed target translations were found in a match. A score of 0.0 means that none of the prescribed target translations were found in the match.
Referring to
Referring to
Given a TSC 22 managing multiple dimensions (i.e., variables) the STLA-SS 32 enables a translation operational team to use Smart Term Index markers for reducing the smart term linguistic vector for a plurality of segments associated within a given classset, and building a statistical model(s) that enables the STE-SS 36 to predict the minimal FTI given a plurality of OTI for a given classset.
Smart Term Linguistic Analytic Subsystem:Referring to
-
- The smart term linguistic vector as a representation of noise caused by terminology changes from a Reference Domain Dictionary 52.
- Measure the amount of linguistic noise attributed from a plurality of Smart Term Index markers passed from downstream linguistic components.
- Assess and weight the importance of terms for a given Reference Domain Dictionary across a plurality of shipments within a TSC 22.
- Create smart term models and identify patterns for Smart Term Index for a plurality of Reference Domain Dictionaries.
- Enable predictive analytics to alert when linguistic asset (memories/termDB) are no longer aligned with a Final Dictionary (terminology changes) relative to a Reference Domain Dictionary indicating when action is needed to harmonize the two.
The smart term evaluation subsystem 36 may perform the various tasks illustrated in
The STLA-SS 32 uses the Linguistic Analytic Business Data Services (LABA) 39 to retrieve PE log 92 event data. The PE logs 92 supports aggregating events across a Majorkey 330 of a multi category classset 300. The Majorkey 330 category may be languages, shipments (per language), documents (per shipment), or segments (per document), term domain or any other dimension.
Referring to
As step 102 in
gives for a given Reference Dictionary ‘RefDict’:
where ‘SourceWords’ is the plurality of terms within a source segment, ‘TargetWords’ is the plurality of terms within a target segment, ‘SourceTerms’ is the plurality of terms within the Reference Domain Dictionary 52 (i.e., RefDict), ‘PrescribedTerms’ is the plurality of target translation terms associated with the respective set of SourceTerms, and ‘Coefficient’ is a number from 0.0 to 1.0 reflecting the percentage of source terms within the Reference Domain Dictionary 52.
It is further noted that if the Source Count is zero, then an NA (i.e., not any) value is assigned to the Smart Term Index. Moreover, when the Source Count and the Prescribed Count are close to each other, then it would reflect a value close to 1.0 without the Coefficient value.
Step 104 illustrated in
Creation of the Match Dictionary 70 from the plurality of matches may include:
where ‘m’ equals the number of matches, ‘n’ equals the number of target translation words per match, and ‘MatchBiLingualPair’ is a source and target term where the target term is a prescribed equivalent term within a match found within a domain or language dictionary. The plurality of MatchBiLingualPair source terms is the set of source terms for respective prescribed translations within the domain or language dictionary.
Creation of a Final Dictionary 72 using the plurality of final segments may include:
where ‘m’ is equal to the number of final segments, ‘n’ is equal to the number of target translation words per final segment, ‘FinalBiLingualPair’ is a source and target term where the target term is a prescribed equivalent term within a final translation segment found within a domain or language dictionary, and the plurality of ‘Final BiLingualPair source terms’ is the set of source terms for respective prescribed translations within the domain or language dictionary.
The OTI 76 for each child classset (M,S) associated with the original referenced domain dictionary is computed as follows:
where ‘M’ is the set of match types, and ‘S’ is the set of segment scope (size: Small, Medium, Complex).
The BTI 80 is computed using the plurality of best matches 78 and the Final Dictionary 72 as the Reference Domain Dictionary 52. The FMTI 88 is computed using the plurality of final translation segments 74 and the Match Dictionary 70 as the Reference Domain Dictionary 52. The FTI 84 is computed using the plurality of final translation segments 74 and the original Reference Domain Dictionary 52.
As step 106 in
TVectorSM=(1−TTermIndexSM)×TLinguisticVectorSM
or
TVectorSM=√{square root over ((1−TTermIndexSM)2+TLinguisticVectorSM
Such that:
where, in one embodiment, the child classset would be defined by ‘T’ equal to the Match Type [EM, FM, MT], ‘S’ equal to the Segment scope [Small, Medium, Complex], and ‘M’ equal to the Majorkey.
In the first embodiment of a Vector, the Term Index is a multiplier of the noise represented by a Linguistic Vector 90 (
In the second embodiment, a Vector is a composite of a Term Index 91 and a Linguistic Vector 90 which is useful for visualizing how Term Index 91 works with other metrics across the TSC 22. If the Term Index 91 is 1.0, the Vector still reflects some noise value but zero is attributed to any terminology misalignment. When aggregating statistical models, the second embodiment helps to bring in a multi-dimensional perspective. Both Vector embodiments are valid as each defines a different space for visualizing linguistic noise attributed to terminology misalignment.
As step 108 in
Referring to
Referring to
Referring to
Referring to
Task 124, the asset optimization component 26, sends a Customize Domain request to the STA-SS 34 to customize the learning assets for a given reference domain dictionary. Task 126 is an import of the Reference Domain Dictionary and the Smart Term terminology predictive model. The STA-SS 34 imports the Reference Domain Dictionary containing a plurality of bilingual terms using the linguist asset component 38. The STA-SS 34 uses the linguistic analytics component 40 to import the Smart Term terminology predictive model. Task 128 computes the Smart Term Index 91 for each segment and uses the Smart Term terminology predictive model to insert a Smart Term marker(s) containing a Smart Term Index and other terminology metadata within each target translation within each segment such that downstream components could use the embedded Smart Term marker to evaluate a plurality of learning segments.
In one embodiment, the STA-SS 34 may create a Reference Domain Monolingual Dictionary for the source language (using the plurality of the source terms within the learning assets) and for a target language (using the plurality of the source terms within the learning assets). More specifically, a task 130 may create a Term Learning Policy for consumers of learning assets. The STA-SS 34 uses the Smart Term terminology predictive model to define the Term Learning Policy that identifies the best segments based on Term Index per segment.
In one embodiment, the Smart Term terminology predictive model may establish a threshold for Term Index per segment for a given Reference Domain Dictionary such that a Term Index which is greater than the RefDict_Threshold would be selected. In a specific embodiment, the STA-SS 34 would utilize the RefDict_Threshold to remove a plurality of segments that fall below the threshold.
In a second embodiment, the STA-SS 34 may use the Smart Term terminology predictive model to establish multi-tier ranges that would divide the plurality of learning segments into Low, Medium and High learning predictive ranges such that MT customization would do a three-tier learning operation. The STA-SS 34 may store the multi-tier ranges as a Term Learning Policy reference for downstream components.
A task 132 may include the export of segments having Term Index metadata. The STA-SS 34 may store the optimized learning assets into the linguistic analytical data store 44 via the Linguistic Analytical data services 39 for downstream consumption using a unique identifier, and return the unique identifier to the asset optimization component 26.
As task 134, the MT component 28 optimizes the learning assets using the Term Index metadata. The MT component 28 imports the learning assets along with the Term Index per segment and any Smart Term metadata to optimize the MT domain model and store.
Referring to
In one embodiment, the STA-SS 34 may create a Reference Domain Monolingual Dictionary for the source language (using the plurality of the source terms within the learning assets) and for a target language (using the plurality of the source terms within the learning assets). As task 150, matches may be filtered based on the Term Index. The STA-SS 34 uses the Smart Term terminology predictive model to filter out matches predicted to not be efficient during downstream translation.
In one embodiment, the Smart Term terminology predictive model could establish a threshold for the Term Index per match for a given reference domain dictionary such that if the match TermIndex is less than the RefDict_threshold, it would be removed from the list of matches.
In a second embodiment, the STA-SS 34 may use the Smart Term terminology predictive model to convert the Term Index per match into a Term Confidence Score that may be embedded within the match. A downstream computer aided translation editor (CAD) may use the Term Confidence Score to assist a human professional linguist in the evaluation of the match. A task 152 may include the exportation of segments with Term Index metadata. The STA-SS 34 stores the matches into the linguistic analytic component 40 for downstream consumption using a unique domain reference identifier, and returns the unique identifier to the asset optimization component 26. A final task 154 is optimization of the translation by the MT and/or PE components 28, 30 using the Term Index metadata. The MT component 28 imports the learning assets along with the Term Index per segment and any Smart Term metadata to optimize the MT domain model and store.
Smart Term Evaluation Subsystem:Referring to
Referring to
For each segment to be translated there may be one (1) to ‘N’ matches (e.g. Exact, Fuzzy, Machine or others). Moreover, other linguistic markers may be embedded within the source package that may be recognized by the STLA-SS 32. Examples of other linguistic markers may include any combination of the following:
-
- a. In one embodiment, the MT matches may contain an MT:Metric score 220 linguistic marker providing a confidence score of the MT match as defined by the MT component 28.
- b. In one embodiment, each MT match may include a Smart Term Index based on the terminology Term Index from the STA-SS 152.
A following task 164 may include the import of the Reference Domain Dictionary and Smart Term terminology predictive model(s). The STLA-SS 32 imports the Reference Domain Dictionary containing a plurality of bilingual terms using the linguist assets component 38. The STA-SS 34 uses the linguistic (business) analytic component 40 to import the Smart Term terminology predictive model.
The next task 166 is a computation of the Term Index of the match dictionaries. In one embodiment the STLA-SS 32 first creates an MT Match Dictionary 701 using the plurality of MT matches obtained from the source package and creates an EM Match Dictionary 703 using the plurality of exact matches.
Referring to
-
- The OTI-MT 722 (OTI 76) for all MT matches 710 against the Reference Domain Dictionary 52.
- The OTI-EM 724 for all EM matches 712 against the Reference Domain Dictionary 52.
- The ETI-MT 728 for all MT matches 710 against the Exact Match Dictionary 704.
- The MTI-EM 726 for EM matches 712 using the exact matches against the MT Match Dictionary.
- A Smart Term Index 78 per MT match 710 using the MT matches 710 against the Reference Domain Dictionary 52.
Referring to
In one embodiment, the STLA-SS 32 analyzes the Term Index of each match in relationship to MT:Metric score (see element 220 in
In another embodiment, the STLA-SS 32 computes a Smart Term Area (see
-
- a. x=OTI-EM, y=ETI-EM, z=MTI-EM, fixed point based on EM match
- b. x=OTI-base, y=ETI-base, z=MTI-base, fixed point on the STLA-SS's baseline (average of all sampled historical matches).
- c. x=OTI-MT, y=ETI-MT, z=1.0, fixed point based on MT match
where OTI-EM 724 and OTI-MT 722 may reflect the TermIndex of the EM match and MT match, respectively, using the Reference Domain Dictionary 52, where ETI-MT 728 is the Smart Term Index of the MT matches using the MT match respectively against the Exact match dictionary 704, and where MTI-EM 726 is the Smart Term Index of the EM matches against the MT match directory 702.
The STLA-SS 32 then invokes analytical streams to build one or more Smart Term MT predictive models by analyzing the plurality of MT matches and assessing which MT matches and respective Term Index will need terminology correction in downstream post editing component. The area of the Smart Term Area triangle (see
Task 170 (see
The STA-SS 34 uses the Smart Term MT predictive models to filter out matches predicted to not be efficient during downstream translation. In one embodiment, the Smart Term MT predictive model establishes a threshold for the Term Index per match for a given reference domain dictionary such that if Term Index of the MT match is less than the RefDict_Threshold, it would be removed from the list of matches. In a second embodiment, the STA-SS 34 uses the Smart Term Area value as a linguistic marker to be embedded within the MT match. A downstream CAT editor (i.e., PE component 30) may use the Smart Term Area linguistic marker to assist a human professional linguist on the evaluation of the match.
A following task 172 entails the export of segments having Term Index metadata. The STA-SS 34 returns the updated MT matches to the STE-SS 36. The STE-SS 36 stores the new MT matches and linguistic markers into the source package for use and consumption by a downstream component, and returns the unique identifier to the asset optimization component 26.
A task 174 performs post editing (PE) on each new content segment assisted by the embedded Smart Term Indexes
Features and benefits of the present disclosure include the STMS 24 and related methods that provide a Smart Term Index as a foundation for measuring noise from terminology misalignment and when linguistic assets are not aligned with a Referenced Domain Dictionary. As demonstrated, the higher the Smart Term Index the more aligned a linguistic asset is with a Referenced Domain Dictionary, driving higher quality and consistency within a Translation Supply Chain 22. Other features include measuring the Smart Term Index 78 associated with the MT matches and/or EM matches relative to a Reference Domain Dictionary 52, a MT Match Dictionary, and/or a EM Match.
Further features and benefits include an STLA-SS 32 that provides system and methods for measuring Smart Term Linguistic Vectors to reflect the terminology noise within the multi-dimensional measurement system of Linguistic Noise within a Translation Supply Chain 22, building statistical models that enable evaluation of MT matches containing Term Indexes, providing system and methods to predict the smallest Final Term Index for a given final translation given a plurality of OTI 76 for a given classset, measuring the amount of Linguistic Noise attributes from a plurality of Smart Term Index markers passed from downstream linguistic components, assessing and weighting the importance of terms for a given Reference Domain Dictionary across a plurality of shipments within a Translation Supply Chain 22, creating Smart Term models and identifying patterns for a Smart Term Index for a plurality of Reference Domain Dictionaries, and enabling predictive analytics to alert when linguistic assets (memories/TermDB) are no longer aligned with a Final Dictionary (terminology changes) relative to a Reference Domain Dictionary indicating when action is needed to harmonize the two.
Other benefits include: human professional linguists efficiency improvements by providing a stable and reliable terminology measurement and evaluation system that is correlated to the labor spent correcting linguistic assets per domain, an STA-SS 34 that produces a plurality of Smart Term Linguistic Markers that enable MT services to maximize the quality of MT output using downstream terminology analytics, an STE-SS 36 that evaluates and analyzes matches from downstream components (e.g. MT) to predict which matches should be filtered, and that evaluates and analyzes matches from downstream components (e.g. MT) to assist human professional linguist with managing terms during the post editing session.
The present disclosure may be a system, a method, and/or a computer program product. The computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention.
The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing. A computer readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network. The network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.
Computer readable program instructions for carrying out operations of the present invention may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++ or the like, and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.
Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.
These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.
Claims
1. A computer program product for language translation applications comprising a non-transitory computer-readable medium storing computer-executable instructions comprising a translation supply chain and a smart terminology marker system, wherein the computer-executable instructions are executable by a processing circuit to cause the processing circuit to perform a method comprising:
- parsing, by a translation memory component of the translation supply chain, source language content into a plurality of source segments;
- searching a repository of historical linguistic assets to identify one or more domain-specific assets;
- generating, by a machine translation component of the translation supply chain, a plurality of machine translation matches corresponding to the plurality of source segments using a custom domain machine translation model optimized with respect to the one or more domain-specific assets;
- correcting and performing quality control, by a post editing component of the translation supply chain, on at least one of the one or more domain-specific assets and the translation model for optimizing translation capability; and
- reducing, by the smart terminology marker system, linguistic noise across the translation supply chain using at least one of business analytics and terminology memory mining, wherein the smart terminology marker system includes a smart term linguistic analytical subsystem configured to generate a plurality of term indexes, a smart term assessment subsystem for generating at least one term index, and a smart term evaluation subsystem configured to predict a minimal final term index given a plurality of original term indexes for a given language domain.
2. (canceled)
3. The computer program product set forth in claim 1, wherein the smart terminology marker system includes a linguistic asset store component for storing a plurality of Dictionaries.
4. The computer program product set forth in claim 3, wherein the plurality of Dictionaries include a Language Dictionary, a Domain Dictionary and a Reference Dictionary.
5. The computer program product set forth in claim 4, wherein the smart terminology marker system is configured to calculate a multi-dimensional linguistic vector associated with an amount of linguistic noise.
Type: Application
Filed: Oct 13, 2016
Publication Date: Jul 13, 2017
Inventors: Christophe D. Chenon (Paris), Marc P. Drapeau (Deux-Montagnes), Francis X. Rojas (Austin, TX)
Application Number: 15/292,734