COMBINED STATISTICAL AND RULE-BASED PART-OF-SPEECH TAGGING FOR TEXT-TO-SPEECH SYNTHESIS

- Apple

In response to a word of a text sequence, a first part-of-speech (POS) tag is generated using a statistical part-of-speech (POS) tagger based on a corpus of trained text sequences, each representing a likely POS of a word for a given text sequence. A second POS tag is generated using a rule-based POS tagger based on a set of one or more rules associated with a type of an application associated with the text sequence. A final POS tag is assigned to the word of the text sequence for TTS synthesis based on the first POS tag and the second POS tag.

Skip to: Description  ·  Claims  · Patent History  ·  Patent History
Description
FIELD OF THE INVENTION

Embodiments of the invention relate generally to the field of text-to-speech (TTS) synthesis; and more particularly, to part-of-speech (POS) tagging for TTS.

BACKGROUND

In corpus linguistics, part-of-speech (POS) tagging is the process of marking up the words in a text (corpus) as corresponding to a particular part of speech, based on both its definition, as well as its context—i.e., relationship with adjacent and related words in a phrase, sentence, or paragraph. It is a necessary pre-processing step for many natural language processing (NLP) tasks. As POS tags augment the information contained within words by explicitly indicating some of the structures inherent in language, their accuracy is often critical to down-stream NLP applications. For example, in concatenative text-to-speech (TTS) synthesis, POS tags are heavily relied upon in the context of prosody modeling; they greatly influence how natural synthetic speech sounds. It is therefore crucial that they be correct.

With the growing availability of NLP training resources in recent years, POS tagging has increasingly involved some forms of data-driven processing. State-of-art models based on conditional random fields (CRFs), for instance, are trained to identify the most likely sequence of tags for the observed set of words in a given sentence. These models rely on feature functions acting as marginal constraints to ensure that important characteristics of the empirical training distribution are reflected in the trained model. With well chosen functions covering sufficiently rich features of the training data, and given adequate initial conditions, CRF taggers can achieve a very high level of tag accuracy on general NLP corpora.

In some specific applications, however, such taggers may be too generic to fit the problem requirements. Most tasks involve slightly different sets of features functions, whose extraction may be impossible to perform on standard NLP collections if they have not been annotated to support it. This is the case for TTS speech synthesis, for which features typically considered in mainstream NLP are not sufficient. Conventional POS tagging for TTS therefore tends to rely on rule-based systems, which can easily be developed from smaller, special-purpose databases. Such rule-based taggers tend to be more brittle than statistical models trained on large collections.

Given a natural language sentence including L words, POS tagging aims at assigning to each observed word wi some suitable POS pi, 1≦i≦L. Representing the overall sequence of words by W and the corresponding sequence of POS by P, CRF taggers directly maximize the conditional probability Pr (P|W) over all possible POS sequences P. This is done via log-linear modeling of feature functions expressing important aspects of the empirical training distribution, as observed on a large annotated corpus. The size and pertinence of the training corpus is thus critical to the quality of the resulting models.

There is, however, an inherent trade-off between size and pertinence. Standard NLP corpora tend to be suitably extensive, but fairly generic in terms of supported tag set and associated annotation. Most of them use the default Penn Treebank POS tag set, which is not optimal for a TTS synthesis application. For example, in the sentence:

    • She is coming tomorrow, she is, she really is!

The three instances of the word “is” would normally be assigned the same tag (e.g., VBZ). Yet, they are realized three different ways. The first instance is unaccented and reduced; the second one is accented; and the third one is unaccented but with full vowed quality. Any synthetic version not respecting these rendition patterns would not sound natural. It thus stands to reason that a TTS system would benefit from a POS assignment system which reflects such distinctions. At the very least, the first instance of “is” should be assigned a POS that typically carries no accent, such as auxiliary, and the second a POS that typically carries an accent, such as (non-modal) verb.

The problem is that special-purpose corpora created with such specific application in mind tend to be too small for the reliable estimation of CRF parameters. This is why POS tagging for speech synthesis typically relies on rule-based taggers. They can easily take into account the kind of distinctions exemplified in a typical statistical model POS tagger, including the case of the third instance of “is”, which is clearly very specific to the application at hand. On the other hand, they suffer from several potential drawbacks, including lack of portability, maintenance difficulties, and the risk of over-generalization from a small number of exemplars.

SUMMARY OF THE DESCRIPTION

According to one aspect, in response to a word of a text sequence, a first part-of-speech (POS) tag is generated using a statistical part-of-speech (POS) tagger based on a corpus of trained text sequences, each representing a likely POS of a word for a given text sequence. A second POS tag is generated using a rule-based POS tagger based on a set of one or more rules associated with a type of an application associated with the text sequence. A final POS tag is assigned to the word of the text sequence for TTS synthesis based on the first POS tag and the second POS tag.

According to another aspect, an apparatus for text-to-speech (TTS) synthesis includes a statistical POS tagger, in response to a word of a text sequence, to generate a first part-of-speech (POS) tag based on a corpus of trained text sequences, each representing a likely POS of a word for a given text sequence, a rule-based POS tagger to generate a second POS tag based on a set of one or more rules associated with a type of an application associated with the text sequence, and a text analyzer coupled to the statistical POS tagger and the rule-based POS tagger to assign a final POS tag to the word of the text sequence for TTS synthesis based on the first POS tag and the second POS tag.

Other features of the present invention will be apparent from the accompanying drawings and from the detailed description which follows.

BRIEF DESCRIPTION OF THE DRAWINGS

Embodiments of the invention are illustrated by way of example and not limitation in the figures of the accompanying drawings in which like references indicate similar elements.

FIG. 1 is a block diagram illustrating a TTS system according to one embodiment of the invention.

FIG. 2 is a flow diagram illustrating a method for POS tagging in synthesis TTS according to one embodiment of the invention.

FIG. 3 is a flow diagram illustrating a method for POS tagging in synthesis TTS according to another embodiment of the invention.

FIG. 4 is a block diagram of a data processing system, which may be used with one embodiment of the invention.

DETAILED DESCRIPTION

Various embodiments and aspects of the inventions will be described with reference to details discussed below, and the accompanying drawings will illustrate the various embodiments. The following description and drawings are illustrative of the invention and are not to be construed as limiting the invention. Numerous specific details are described to provide a thorough understanding of various embodiments of the present invention. However, in certain instances, well-known or conventional details are not described in order to provide a concise discussion of embodiments of the present inventions.

Reference in the specification to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in conjunction with the embodiment can be included in at least one embodiment of the invention. The appearances of the phrase “in one embodiment” in various places in the specification do not necessarily all refer to the same embodiment.

According to some embodiments, a TTS synthesis system combines rule-based POS tagging and statistical POS tagging techniques. Complementing a rule-based system with a statistical tagger solves many of the problems described above. The rules can now be focused on situations that are high-value for the application considered; in principle they can be fewer, simpler, and therefore more manageable. At the same time, generic NLP training data can be leveraged to increase tagging robustness, without sacrificing specific requirements for the task at hand. An embodiment of the TTS system adopts a hybrid system where the two tagging approaches render independent assessments of each input word, one of which is then selected based on the underlying conditions in order to produce the final POS tag for the word.

FIG. 1 is a block diagram illustrating a TTS system according to one embodiment of the invention. Referring to FIG. 1, system 100 is configured to assign POS tags to words to perform natural language processing. For example, the POS tags are assigned to words to perform a concatenative TTS synthesis. System 100 includes, but not limited to, text analysis unit 102, processing unit 103, speech generation unit 104, statistical POS tagger 106 and rule-based POS tagger 107. Text analysis unit 102 is configured to receive text input 101, for example, one or more sentences, paragraphs, and the like, and to analyze the text to extract words. Text analysis unit 102 is configured to determine characteristics of a word, for example a pitch, duration, accent, and POS characteristic. The POS characteristic typically defines whether a word in a sentence is, for example, a noun, verb, adjective, preposition, and/or the like. The POS characteristics may be very informative, and sometimes are the only way to distinguish a word from the word candidates for speech synthesis. In one embodiment, text analysis unit 102 determines input word's characteristics, such as a pitch, duration, and/or accent based on the POS characteristic of the input word. In one embodiment, text analysis unit 102 analyzes text input 101 to determine a POS characteristic of a word of input text 101 using combined statistical and rule-based POS tagging techniques.

In one embodiment, in response to a word of a text sequence such as input text 101, text analysis unit 101 is configured to invoke statistical POS tagger 106 and rule-based POS tagger 107 to generate a first POS tag and a second POS tag, respectively. Based on the first POS tag and the second POS tag, a final POS tag is selected from one of the first and second POS tags based on certain underlying conditions and the final POS tag is then assigned to the word for TTS synthesis process.

The statistical POS tagging is implemented using a statistical tagger, which determines parameters by computing statistics on words used in a sample portion of a corpus. Once the statistics are computed, the statistical tagger relies on them when analyzing the large corpus. With the statistical approach, a statistical tagger is initially operated in a training mode in which it receives input strings that have been annotated by a linguist with tags that specify parts of speech, and other characteristics. The statistical tagger records statistics reflecting the application of the tags to portions of the input string. After a significant amount of training using tagged input strings, the statistical tagger enters a tagging mode in which it receives raw untagged input strings. In the tagging mode, the statistical tagger applies the learned statistics assembled during the training mode to build trees for the untagged input string. Statistical approaches usually require a training corpus that has been tagged with part-of-speech information, manually and/or automatically through feedback.

A rule-based tagger stores knowledge about the structure of language in the form of linguistic rules. The rule-based tagger makes use of syntactic and morphological information about individual words found in the dictionary or “lexicon” or derived through morphological processing. Successful tagging requires that the tagger has the necessary rules and a lexical analyzer provides all the details needed by the tagger to resolve as many ambiguities as it can at that level.

Referring to FIG. 1, statistical POS tagger 106 can be any of the probabilistic model based POS tagger, such as, for example, a memory-based tagger, a hidden Markov model (HMM) based tagger, and a maximum entropy Markov model (MEMM) based tagger. In one embodiment, statistical POS tagger 106 is a CRF-based tagger. A CRF is a type of discriminative probabilistic model most often used for labeling or parsing of sequential data, such as natural language text or biological sequences. Specifically, CRFs find applications in shallow parsing, named entity recognition and gene finding, among other tasks, being an alternative to the HMM model. Further detailed information concerning the CRF model can be found in article entitled “Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data”, which is incorporated by reference herein in its entirety.

In one embodiment, statistical POS tagger 106 includes POS tag generator 108, training corpus 109, confidence score calculator 110, and histogram data 111. Given a word of a text sequence, POS tag generator 108 is configured to generate a POS tag based on the relationships between that word and other words in the text sequence in view of training corpus 109. Training corpus 109 includes a pool of training words and training word sequences. The POS tag represents a part of speech that most likely the word can represent in view of the training corpus 109, which can be implemented based on the Penn Treebank corpus or the like. Histogram data 111 is configured to store statistics of application of each training word and/or word sequence in corpus 109 concerning whether that particular word or word sequence has been applied successfully. Success/failure is typically determined based on some held-out data (e.g., a fairly small annotated corpus that would not be sufficient to train a statistical training corpus, but is adequate for this purpose). Confidence score calculator 110 is configured to calculate a confidence score for each of the words and word sequences, where the confidence score represents a successful rate of the application in the past. The confidence scores may be statically calculated and stored in a machine readable storage medium such as a memory or alternatively, the confidence score may be calculated dynamically (e.g., on the fly) during the parsing mode.

Similarly, according to one embodiment, rule-based POS tagger 107 includes POS tag generator 112, a set of rules 113, confidence score calculator 114, and histogram data 115. Given a word of a text sequence, POS tag generator 112 is configured to generate a POS tag based on the relationships between that word and other words in the text sequence in view of rules 113, which have been constructed previously. Histogram data 115 is configured to store statistics of application of each of the rules 113 concerning whether that particular word or word sequence has been applied successfully. Confidence score calculator 114 is configured to calculate a confidence score for each of the words and word sequences, where the confidence score represents a successful rate of the application of a particular rule in the past. The confidence scores may be statically calculated and stored in a machine readable storage medium such as a memory or alternatively, the confidence score may be calculated dynamically.

Once the words have been tagged with one of the tags generated by statistical tagger 106 and rule-based tagger 107, text analysis unit 102 passes the extracted words having assigned POS tags to processing unit 103. Processing unit 103 may concatenate the extracted words together, smooth the transitions between the concatenated words, and pass the concatenated words to speech generating unit 104 to enable the generation of a naturalized audio output 105, for example, an utterance, spoken paragraph, and the like.

According to some embodiments, by adopting a hybrid system where the statistical and rule-based tagging approaches tender independent assessments of each input word, one of which is then selected based on the underlying conditions in order to produce a final POS tag for the word, there could be at least three situations dependent upon the level of consistency between the two models.

The first situation is referred to as a consistent POS situation in which both statistical and rule-based approaches render the same assessment in terms of POS tag (e.g., same tag), possibly after the tag conversion if the two underlying tag sets are different. Tag conversion involves a table that translates symbols from a particular tag set (e.g., “NN” in the Penn Treebank tag set) into symbols from another tag set (e.g., “Noun” in another tag set such as one from Apple Inc.) Most cases are fairly straightforward, though some may be more complex (e.g., “IN” in Penn Treebank maps to either “Prep” or “Conj” in another) Since the two tagging techniques agree on a common tag, according to one embodiment, the final POS tag is selected to be that common tag.

The second situation is referred to as a rule default situation in which the rule-based system did not find a suitable rule to apply to the input context. As a result, a default tag is generated by the rule-based system. This typically forces an over-generalization, which is the source of most errors in rule-based methods. In this situation, the default tag generated from the rule-based system should not be relied upon. Rather, according to one embodiment, the tag generated from the statistical system is utilized as the final POS tag.

Another situation is referred to as a tag disagreement situation in which the rule-based system found a suitable rule to apply to the input context and returned a valid assessment, but the statistical system returned a different tag (even after a tag conversion). In this situation, according to one embodiment, a confidence score of the rule associated with the tag generated by the rule-based system is utilized to evaluate whether the rule-based tag can be selected as the final tag applied to the input context.

According to one embodiment, during development, a confidence score is calculated by confidence score calculator 114 for each rule in the rule-based system based on the histogram data 115 collected over time. Specifically, all such disagreements observed are collected on some suitable development data (typically a relatively small application-specific training collection comparable to, but distinct from, the one used to establish the rules). For each rule r, the instances are tabulated where it was right and wrong, and the confidence score may be calculated as follows according to one embodiment:

c r = n r , i n r , i + n r , j ,

where nr,i and nr,j denote the number of times the rule r was observed to be right and wrong, respectively. Thus, confidence score cr represents the successful rate of applying a particular rule in a particular application. According to one embodiment, the rules may be ranked or sorted based on their respective confidence scores.

According to one embodiment, comparing with the statistical assessment, any rule with a confidence score that is below a predetermined threshold, such as, for example, 50%, may be considered as unreliable; otherwise, the rule may be considered as reliable. In one embodiment, a tag generated by rule-based tagger 107 may be selected as the final POS tag if its corresponding confidence score is greater than a predetermined threshold; otherwise, a tag generated by statistical tagger 107 may be selected as the final POS tag. In a particular embodiment, the predetermined threshold is 0.5.

Optionally, according to another embodiment, information concerning the selection of final POS tag may be fed back to the scoring mechanism such as score calculator 114 and/or histogram data 115 of rule-based tagger 107 to adjust the corresponding rule confidence score for subsequent reference. The confidence scores for the rules may be adjusted over time and a rule having a low confidence score may be removed from rule database 113. As a result, rule database 113 can be maintained in a relatively small size. Similarly, such information may also be fed back to the statistical tagger 106 to adjust the related parameters (e.g., CRF parameters) for training purposes. Note that these operations may be performed either manually (e.g., via user inputs), automatically (e.g., data driven via machine learning), or a combination thereof.

According to another embodiment, similar to rule-based tagger 107, confidence score calculator 110 of statistical tagger 106 is also configured to calculate a confidence score for each member of training corpus 109 based on histogram data 111. Similar to a rule-based confidence score, a confidence score for a member of training corpus 109 may be determined as follows:

c s = n s , i n s , i + n s , j ,

where ns,i and ns,j denote the number of times a particular member of the corpus was observed to be right and wrong, respectively. Thus, confidence score cs also represents a successful rate of applying a particular member in POS tagging.

According to one embodiment, confidence scores of tags generated by rule-based tagger 107 and statistical tagger 106 may be compared. Based on the comparison, a tag having a higher confidence score may be selected as the final POS tag. In one embodiment, the comparison may be performed only when the rule-based confidence score is less than a predetermined threshold. That is, when the rule-based confidence score is less than the predetermined threshold, the confidence score of the statistical tag may also be evaluated in view of the rule-based confidence score by comparing the confidence scores of the rule-based tag and statistical tag. A tag having a higher confidence score may be selected as the final POS tag. For example, when the rule-based confidence score is less than 0.5, there could be a situation in which the statistical confidence score may be worst (e.g., 0.3). In this situation, the rule-based tag may be a better candidate as the final POS tag, even if the corresponding confidence score were less than 0.5.

Note that some or all of the components as shown in FIG. 1 may be implemented in software, hardware, or a combination of both. For example, system 100 may be implemented as part of an operating system stored and/or executed in a machine readable storage medium (e.g., memory) by a processor of a data processing system. In addition, the confidence score calculator and/or histogram data of any one or both of the statistical tagger 106 and rule-based tagger 107 may be maintained by text analysis unit 102. Alternatively, statistical tagger 106 and/or rule-based tagger 107 may be integrated with text analysis unit 102. Statistical tagger 106 and/or rule-based tagger 107 may be provided by a third party and they may be invoked by text analysis unit 102 via an application programmable interface (API) or over a network. Other configurations may exist.

FIG. 2 is a flow diagram illustrating a method for POS tagging in synthesis TTS according to one embodiment of the invention. For example, method 200 may be performed by system 100 of FIG. 1. Referring to FIG. 2, at block 201, an input having a word of a text sequence is received for TTS analysis. At block 202, a first POS tag is generated using a statistical POS tagger based on a corpus of trained text sequences representing a likely POS for a word of a given text sequence. At block 203, a second POS tag is generated using a rule-based POS tagger based on a set of rules specifically designed for a type of an application associated with the text sequence. At block 204, a final POS tag is assigned to the word of the text sequence for TTS analysis based on an underlying condition of the first POS tag and the second POS tag.

FIG. 3 is a flow diagram illustrating a method for POS tagging in synthesis TTS according to another embodiment of the invention. Process 300 may be performed by system 100 of FIG. 1. Referring to FIG. 3, a word of text sequence 301 is input to rule-based POS tagger 304 and statistical tagger 305 independently and/or concurrently. A rule-based POS tag is generated by rule-based POS tagger 304 based on a set of rules that have been generated via application-specific training 302. Similarly, a statistical POS tag is generated by statistical POS tagger 305 based on a corpus that has been generated via NLP generic training 303. At block 306, the rule-based POS tag and the statistical POS tag are compared. If they are identical, at block 307, either one of them is selected as a final POS tag to be assigned to the input word. At block 308, if there is no rule found by rule-based POS tagger 304, the statistical POS tag is selected as the final POS tag; otherwise, the confidence score of the rule-based POS tag is examined at block 310. If the confidence score of the rule-based POS tag is greater than a predetermined threshold such as 0.5, at block 311, the rule-based POS tag is selected as the final POS tag. Otherwise, at block 312, statistical POS tag is selected as the final POS tag. Alternatively, the confidence scores of the rule-based tag and statistical tag are compared to determine which one should be selected as the final POS tag. The tag that has a higher confidence score may be selected as the final POS tag.

In addition, at block 313, it is determined whether the result of the current process should be adapted by the system. If so, optionally, at block 314, the associated rule or rules are adjusted which are fed back to rule-based POS tagger 304. Similarly, associated parameters of statistical tagger 305 may also be adjusted. For example, based on the current result, the confidence scores of the corresponding rule(s) of rule-based POS tagger 304 and the corresponding member(s) of the training corpus of statistical POS tagger 305 may be adjusted. Further, a rule having a significantly low (based on a predetermined threshold) confidence score may be removed from the rule database of rule-based POS tagger 304.

FIG. 4 is a block diagram of a data processing system, which may be used with one embodiment of the invention. For example, the system 400 shown in FIG. 4 may be used as system 100 of FIG. 1. Note that while FIG. 4 illustrates various components of a computer system, it is not intended to represent any particular architecture or manner of interconnecting the components; as such details are not germane to the present invention. It will also be appreciated that network computers, handheld computers, cell phones and other data processing systems which have fewer components or perhaps more components may also be used with the present invention. The computer system of FIG. 4 may, for example, be an Apple Macintosh computer or MacBook, or an IBM compatible PC.

As shown in FIG. 4, the computer system 400, which is a form of a data processing system, includes a bus or interconnect 402 which is coupled to one or more microprocessors 403 and a ROM 407, a volatile RAM 405, and a non-volatile memory 406. The microprocessor 403 is coupled to cache memory 404. The bus 402 interconnects these various components together and also interconnects these components 403, 407, 405, and 406 to a display controller and display device 408, as well as to input/output (I/O) devices 410, which may be mice, keyboards, modems, network interfaces, printers, and other devices which are well-known in the art.

Typically, the input/output devices 410 are coupled to the system through input/output controllers 409. The volatile RAM 405 is typically implemented as dynamic RAM (DRAM) which requires power continuously in order to refresh or maintain the data in the memory. The non-volatile memory 406 is typically a magnetic hard drive, a magnetic optical drive, an optical drive, or a DVD RAM or other type of memory system which maintains data even after power is removed from the system. Typically, the non-volatile memory will also be a random access memory, although this is not required.

While FIG. 4 shows that the non-volatile memory is a local device coupled directly to the rest of the components in the data processing system, the present invention may utilize a non-volatile memory which is remote from the system; such as, a network storage device which is coupled to the data processing system through a network interface such as a modem or Ethernet interface. The bus 402 may include one or more buses connected to each other through various bridges, controllers, and/or adapters, as is well-known in the art. In one embodiment, the I/O controller 409 includes a USB (Universal Serial Bus) adapter for controlling USB peripherals. Alternatively, I/O controller 409 may include an IEEE-1394 adapter, also known as FireWire adapter, for controlling FireWire devices.

Some portions of the preceding detailed descriptions have been presented in terms of algorithms and symbolic representations of operations on data bits within a computer memory. These algorithmic descriptions and representations are the ways used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. An algorithm is here, and generally, conceived to be a self-consistent sequence of operations leading to a desired result. The operations are those requiring physical manipulations of physical quantities.

It should be borne in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise as apparent from the above discussion, it is appreciated that throughout the description, discussions utilizing terms such as those set forth in the claims below, refer to the action and processes of a computer system, or similar electronic computing device, that manipulates and transforms data represented as physical (electronic) quantities within the computer system's registers and memories into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission or display devices.

Embodiments of the invention also relate to an apparatus for performing the operations herein. Such a computer program is stored in a non-transitory computer readable medium. A machine-readable medium includes any mechanism for storing information in a form readable by a machine (e.g., a computer). For example, a machine-readable (e.g., computer-readable) medium includes a machine (e.g., a computer) readable storage medium (e.g., read only memory (“ROM”), random access memory (“RAM”), magnetic disk storage media, optical storage media, flash memory devices).

The processes or methods depicted in the preceding figures may be performed by processing logic that comprises hardware (e.g. circuitry, dedicated logic, etc.), software (e.g., embodied on a non-transitory computer readable medium), or a combination of both. Although the processes or methods are described above in terms of some sequential operations, it should be appreciated that some of the operations described may be performed in a different order. Moreover, some operations may be performed in parallel rather than sequentially.

Embodiments of the present invention are not described with reference to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of embodiments of the invention as described herein.

In the foregoing specification, embodiments of the invention have been described with reference to specific exemplary embodiments thereof. It will be evident that various modifications may be made thereto without departing from the broader spirit and scope of the invention as set forth in the following claims. The specification and drawings are, accordingly, to be regarded in an illustrative sense rather than a restrictive sense.

Claims

1-20. (canceled)

21. A computer-implemented method for text-to-speech (TTS) synthesis, the method comprising:

in response to a word of a text sequence, generating a first part-of-speech (POS) tag using a statistical POS tagger based on a corpus of trained text sequences, each representing a likely POS of a word for a given text sequence, wherein the first POS tag is selected from a first POS tag set;
generating a second POS tag using a rule-based POS tagger based on a set of one or more rules associated with a type of an application associated with the text sequence, wherein the second POS tag is selected from a second POS tag set that is different from the first POST tag set; and
assigning a final POS tag to the word of the text sequence for TTS synthesis based on the first POS tag and the second POS tag.

22. The method of claim 21, wherein assigning a final POS comprises assigning either the first POS tag or the second POS tag as the final POS tag if the first POS tag and the second POS tag are identical.

23. The method of claim 21, wherein assigning a final POS comprises assigning the first POS tag as the final POS tag if the set of one or more rules do not contain a suitable rule corresponding to the text sequence.

24. The method of claim 21, further comprising:

calculating a first confidence score for the second POS tag based on a statistic data of applying a rule associated with the second POS tag;
designating the second POS tag as the final POS tag if the first confidence score is greater than or equal to a first predetermined threshold; and
designating the first POS tag as the final POS tag if the first confidence score is less than the first predetermined threshold.

25. The method of claim 24, wherein the first confidence score is calculated based on a percentage of successful applications of the rule in previous TTS synthesis.

26. The method of claim 25, further comprising:

adjusting the second confidence score for the rule for future TTS synthesis based on whether the first POS tag has been selected as the final POS tag; and
removing the rule from the set of one or more rules if the first confidence score is below a second predetermined threshold.

27. The method of claim 24, further comprising:

calculating a second confidence score for the second POS tag based on a successful rate of application of the first POS tag using the statistical POS tagger;
designating the second POS tag as the final POS tag if the first confidence score is greater than or equal to the second confidence score; and
designating the first POS tag as the final POS tag if the first confidence score is less than the second confidence score.

28. The method of claim 27, further comprising adjusting one or more parameters of the statistical POS tagger for future usage based on whether the first POS tag has been selected as the final POS tag.

29. The method of claim 21, further comprising converting the second POS tag to a corresponding tag in the first POS tag set.

30. The method of claim 29, wherein converting the second POS tag includes using a table that translates tags between the first POS tag set and the second POS tag set.

31. The method of claim 21, further comprising converting the first POS tag to a corresponding tag in the second POS tag set.

32. A non-transitory machine-readable storage medium having instructions stored therein, which when executed by a machine, cause the machine to perform a method for text-to-speech (TTS) synthesis, the method comprising:

in response to a word of a text sequence, generating a first part-of-speech (POS) tag using a statistical POS tagger based on a corpus of trained text sequences, each representing a likely POS of a word for a given text sequence, wherein the first POS tag is selected from a first POS tag set;
generating a second POS tag using a rule-based POS tagger based on a set of one or more rules associated with a type of an application associated with the text sequence, wherein the second POS tag is selected from a second POS tag set that is different from the first POS tag set; and
assigning a final POS tag to the word of the text sequence for TTS synthesis based on the first POS tag and the second POS tag.

33. The machine-readable storage medium of claim 32, wherein assigning a final POS comprises assigning either the first POS tag or the second POS tag as the final POS tag if the first POS tag and the second POS tag are identical.

34. The machine-readable storage medium of claim 32, wherein assigning a final POS comprises assigning the first POS tag as the final POS tag if the set of one or more rules do not contain a suitable rule corresponding to the text sequence.

35. The machine-readable storage medium of claim 32, wherein the method further comprises:

calculating a first confidence score for the second POS tag based on a statistic data of applying a rule associated with the second POS tag;
designating the second POS tag as the final POS tag if the first confidence score is greater than or equal to a first predetermined threshold; and
designating the first POS tag as the final POS tag if the first confidence score is less than the first predetermined threshold.

36. The machine-readable storage medium of claim 35, wherein the first confidence score is calculated based on a percentage of successful applications of the rule in previous TTS synthesis.

37. The machine-readable storage medium of claim 36, wherein the method further comprises:

adjusting the first confidence score for the rule for future TTS synthesis based on whether the second POS tag has been selected as the final POS tag; and
removing the rule from the set of one or more rules if the first confidence score is below a second predetermined threshold.

38. The machine-readable storage medium of claim 35, wherein the method further comprises:

calculating a second confidence score for the first POS tag based on a successful rate application of the first POS tag using the statistical POS tagger;
designating the second POS tag as the final POS tag if the first confidence score is greater than or equal to the second confidence score; and
designating the first POS tag as the final POS tag if the first confidence score is less than the second confidence score.

39. The machine-readable storage medium of claim 38, wherein the method further comprises adjusting one or more parameters of the statistical POS tagger for future usage based on whether the first POS tag has been selected as the final POS tag.

40. An apparatus for text-to-speech (TTS) synthesis, comprising at least one processor and memory storing instructions for execution by the processor, the instructions including:

a statistical POS tagger, in response to a word of a text sequence, to generate a first part-of-speech (POS) tag based on a corpus of trained text sequences, each representing a likely POS of a word for a given text sequence, wherein the first POS tag is selected from a first POS tag set;
a rule-based POS tagger to generate a second POS tag based on a set of one or more rules associated with a type of an application associated with the text sequence, wherein the second POS tag is selected from a second POS tag set that is different from the first POS tag set; and
a text analyzer coupled to the statistical POS tagger and the rule-based POS tagger to assign a final POS tag to the word of the text sequence for TTS synthesis based on the first POS tag and the second POS tag.

41. The apparatus of claim 40, wherein either the first POS tag or the second POS tag is assigned as the final POS tag as the final POS tag if the first POS tag and the second POS tag are identical.

42. The apparatus of claim 40, wherein the first POS tag is assigned as the final POS tag if the set of one or more rules do not contain a suitable rule corresponding to the text sequence.

43. The apparatus of claim 40, wherein the rule-based POS tagger comprises a score calculator to calculate a first confidence score for the second POS tag based on a statistic data of applying a rule associated with the second POS tag, wherein the second POS tag is designated as the final POS tag if the first confidence score is greater than or equal to a first predetermined threshold; otherwise, the first POS tag is designated as the final POS tag.

Patent History
Publication number: 20140324435
Type: Application
Filed: Apr 30, 2014
Publication Date: Oct 30, 2014
Applicant: APPLE INC. (Cupertino, CA)
Inventor: Jerome R. BELLEGARDA (Saratoga, CA)
Application Number: 14/266,318
Classifications
Current U.S. Class: Image To Speech (704/260)
International Classification: G10L 13/02 (20060101);