Thematic segmentation of speech

Info

Publication number: 20040024598
Type: Application
Filed: Jul 2, 2003
Publication Date: Feb 5, 2004
Inventors: Amit Srivastava (Waltham, MA), Francis Kubala (Boston, MA)
Application Number: 10610679

Abstract

A thematic segmentation tool generates indications of thematically coherent segments within a document. The thematic segmentation tool includes a transcription component, a speaker boundary detection component, a linguistic detection component, and a topic classification component. Each of these components generates input information for a thematic decision component, which generates the thematic segmentation information. Multiple thematic segments may concurrently apply to a portion of a document. Additionally, the thematic segments may be hierarchical thematic segments.

Description

Description

RELATED APPLICATIONS

[0001] This application claims priority under 35 U.S.C. § 119 based on U.S. Provisional Application Nos. 60/394,064 and 60/394,082 filed Jul. 3, 2002 and Provisional Application No. 60/419,214 filed Oct. 17, 2002, the disclosures of which are incorporated herein by reference.

GOVERNMENT CONTRACT BACKGROUND OF THE INVENTION

[0003] A. Field of the Invention

[0004] The present invention relates generally to speech processing and, more particularly, to the segmentation of speech based on thematic classification.

[0005] B. Description of Related Art

[0006] Speech has not traditionally been valued as an archival information source. As effective as the spoken word is for communicating, archiving spoken segments in a useful and easily retrievable manner has long been a difficult proposition. Although the act of recording audio is not difficult, automatically transcribing and indexing speech in an intelligent and useful manner can be difficult.

[0007] Speech is typically received into a speech recognition system as a continuous stream of words without breaks. In order to effectively use the speech in information management systems (e.g., information retrieval, natural language processing, real-time alerting), the speech recognition system initially processes the speech to generate a formatted version of the speech. For example, the speech may be transcribed and linguistic information, such as sentence structures, may be associated with the transcription.

[0008] In addition to segmenting speech segments based on linguistic information, it may be desirable to also segment the speech based on thematic structure. For example, when archiving a continuous broadcast of a radio news program, it may be desirable to know the portions of the news program that discussed the weather and the portions that were about foreign affairs. The portion of the broadcast that was directed to foreign affairs may be further classified into European and Middle East news segments. Users can, thus, later browse or listen to an archive copy of the news broadcast based on topics of interest.

[0009] One technique for segmenting a continuous stream of speech based on thematic elements involves making thematic boundary decisions based on a word count within a moving window of text. FIG. 1 is a block diagram illustrating this technique in additional detail. Initial input audio information is transcribed by transcription component 101. The transcription may be performed manually or automatically. Transcription component 101 outputs a continuous stream of text. Windowing component 102 segments the text into chunks of texts of a predetermined length (e.g., 200 words) and generates a vector of the words that occur within the window. Words that occur more frequently within the window are weighted more heavily in the vector. Boundary decision component 103 detects changes in thematic segments based on the word count weighted vectors.

[0010] A problem with this technique is that it can produce erroneous or non-optimal thematic segments. Accordingly, there is a need in the art to improve thematic segmentation of speech.

SUMMARY OF THE INVENTION

[0011] Systems and methods consistent with the principles of this invention provide a thematic segmentation tool that acts on text augmented with additional information extracted from the spoken version of the text. The thematic segmentation tool may generate overlapping thematic segments for a single portion of text.

[0012] One aspect of the invention is directed to a thematic segmentation tool that includes a transcription component configured to receive spoken audio information and to convert the spoken audio information into a document of text corresponding to the audio information. A linguistic detection component generates linguistic information corresponding to the text produced by the transcription component. A topic classification component generates topics relevant to the document. A thematic decision component generates indications of thematic segments based on the linguistic information, the document, and the topics.

[0013] Another aspect of the invention is directed to a method for determining thematically coherent segments within a document. The method comprises receiving a document having associated linguistic information that describes linguistic features of the document and generating indications of thematically coherent segments within the document that occur at the linguistic features in the document.

[0014] Yet another aspect of the invention is directed to a computing device comprising a processor and a computer memory coupled to the processor. The computer memory contains program instructions that when executed by the processor associate linguistic information with a document. The linguistic information demarcates linguistic breaks within the document. The program instructions additionally generate, based on the linguistic breaks within the document, indications of thematically coherent segments, and output the thematically coherent segments associated with labels describing thematic content of the thematically coherent segments.

BRIEF DESCRIPTION OF THE DRAWINGS

[0015] The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate the invention and, together with the description, explain the invention. In the drawings,

[0016] FIG. 1 is a block diagram illustrating thematic segmentation using a conventional technique based on a word count within a moving window of text;

[0017] FIG. 2 is a diagram illustrating an exemplary system in which concepts consistent with the invention may be implemented;

[0018] FIG. 3 is a block diagram illustrating software elements in a thematic segmentation tool consistent with the invention; and

[0019] FIG. 4 is a diagram illustrating exemplary thematic segments for a document.

DETAILED DESCRIPTION

[0020] The following detailed description of the invention refers to the accompanying drawings. The same reference numbers may be used in different drawings to identify the same or similar elements. Also, the following detailed description does not limit the invention. Instead, the scope of the invention is defined by the appended claims and equivalents of the claim limitations.

[0021] Thematic segmentation of spoken audio is performed by a thematic segmentation tool on a transcribed version of the audio supplemented with additional information that further describes the audio. In one implementation, the transcription is supplemented with visible linguistic structural information, such as sentence demarcations and non-visible linguistic structural information such, as phrasal boundaries, topic lists, and speaker boundaries. The result of the thematic segmentation includes hierarchical and potentially overlapping thematic segments.

System Overview

[0022] Thematic segmentation, as described herein, may be performed on one or more processing devices or networks of processing devices. FIG. 2 is a diagram illustrating an exemplary system 200 in which concepts consistent with the invention may be implemented. System 200 includes a computing device 201 that has a computer-readable medium 209, such as random access memory, coupled to a processor 208. Computing device 201 may also include a number of additional external or internal devices, such as, without limitation, a mouse, a CD-ROM, a keyboard, and a display.

[0023] In general, computing device 201 may be any type of computing platform, and may be connected to a network 202. Computing device 201 is exemplary only. Concepts consistent with the present invention can be implemented on any computing device, whether or not connected to a network.

[0024] Processor 208 can be any of a number of well-known computer processors, such as processors from Intel Corporation, of Santa Clara, Calif. Processor 208 executes program instructions stored in memory 209.

[0025] Memory 209 contains an application program 215. In particular, application program 215 may implement the thematic segmentation tool described below. The thematic segmentation tool 215 may receive input data, such as linguistically segmented text, from other application programs executing in computing device 201 or other computing devices, such as those connected to computing device 201 through network 202. Thematic segmentation tool 215 processes the input data to generate indications of thematic segments.

Thematic Segmentation Tool

[0026] FIG. 3 is a block diagram conceptually illustrating software elements of thematic segmentation tool 215. Decisions relating to thematic segmentation are made by thematic decision component 310. Thematic decision component 310 implements a statistical framework that generates thematic segments for a “document.” The term document, as used herein, refers to a textual information and associated descriptive information relating to the document (e.g., speaker boundaries, phrasal boundaries, etc.). Although such a document may be generated from data from audio sources, it could be generated in other manners, such as from data from video or textual sources.

[0027] Thematic decision component 310 receives a number of inputs that describe the document. Specifically, as shown in FIG. 3, thematic decision component 310 receives a text transcript of the document from transcription component 320, speaker boundary information from speaker boundary detection component 321, linguistic information from linguistic detection component 322, and topic classifications from topic classification component 323. Although transcription component 320, speaker boundary detection component 321, linguistic detection component 322, and topic classification component 323 are illustrated as part of thematic segmentation tool 215, in other implementations, these components may be considered as providing input information to a thematic segmentation tool implemented by thematic decision component 310.

[0028] Transcription component 320 may be an automated or manual transcription tool that converts the audio input stream it receives into text. Transcription component 320 may use conventional techniques to perform the conversion.

[0029] Speaker boundary detection component 321 locates boundaries between speakers in the audio input stream. Knowledge of speaker changes in an audio stream may be a useful indicator of potential changes in thematic content. Automated speaker boundary detection techniques are known in the art. For example, speaker boundary detection is described in Liu et al., “Fast Speaker Change Detection for Broadcast News Transcription and Indexing,” Eurospeech '99, Budapest, Hungary, September 99, pp. 1031-1034.; and Chen et al., “Speaker, Environment, and Channel Change Detection and Clustering via the Bayesian Information Criterion,” Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop, Lansdowne, Va., 1998. Alternatively, instead of automatically detecting speaker boundaries, the speaker boundaries may be manually inserted into the document.

[0030] Linguistic detection component 322 receives the text generated by transcription component 320 and the audio input stream. Automated transcription techniques generally produce a simple stream of words without linguistic information (e.g., periods, exclamation marks, quotation marks) that would ideally be associated with the text. Linguistic component 322 annotates the text from transcription component 320 to include this linguistic information. In addition to visible linguistic information, such as periods, linguistic component 322 may associate non-visible linguistic information, such as phrasal boundaries, with the received text.

[0031] Techniques for generating both visible and non-visible linguistic information are described in detail in U.S. patent application Ser. No. ______ (Attorney Docket Number 02-4024), titled “Linguistic Segmentation of Speech,” filed ______, the contents of which are hereby incorporated by reference.

[0032] Topic classification component 323 generates topics selected from a predefined topic vocabulary that are relevant to the document. For example, a document may include any combination of words from a 60,000 word corpus. Topic classification component 323 examines the document and outputs one or more predefined topics, where the number of possible topics is less than the 60,000 word corpus (e.g., a 5,000 word topic vocabulary).

[0033] Topic classification component 323, in one implementation, uses a Bayesian framework to generate topics for a document. More particularly, topic classification component 323 may be implemented as a probabilistic Hidden Markov Model (HMM) whose parameters are estimated from training samples of documents with given topic labels. This model allows each word in a document to contribute different amounts to each of the topics assigned to the document. The output of topic classification component 323 may be a rank-ordered list of all possible topics and corresponding scores that indicate the estimated relevance of each topic. In general, automated topic classification systems are known in the art. See, for example, Makhoul et al., “Speech and Language Technologies for Audio Indexing and Retrieval,” Proceedings of the IEEE, vol. 88, no. 8, August 2000.

[0034] In another possible implementation, instead of estimating parameters based on training samples that have topics manually generated, topic classification component 323 can be constructed to generate topics based on unsupervised topic discovery.

[0035] Thematic decision component 310 uses the outputs of transcription component 320, speaker boundary detection component 321, linguistic detection component 322, and topic classification component 323 to generate indications of thematic segments in the input document.

[0036] Consistent with an aspect of the present invention, the thematic segments generated by thematic decision component 310 may include multiple overlapping thematic segments for a particular portion of a document. Additionally, thematic decision component 310 may label the thematic segments using a hierarchical labeling scheme such that a specific thematic segment (e.g., a thematic segment labeled “hurricane”) is organized as a subset of a more general thematic segment (e.g., the thematic segment labeled “weather”).

[0037] FIG. 4 is a diagram conceptually illustrating exemplary thematic segments for a document. Document 401 is conceptually illustrated as a series of lines that are assumed to correspond to text. Associated with the text in document 401 are linguistic cues such as periods 402 and commas 403. Although not shown, speaker boundaries, topics, and non-visible linguistic information may also be associated with document 401.

[0038] Thematic segments in FIG. 4 are illustrated by the bracketed segments 410-412. As shown, thematic segment 410 and 412 overlap one another. In general, thematic segments do not necessarily have to sequentially follow one another. Thematic segment 412 may be hierarchically related to thematic segment 410 as a subset of thematic segment 410, or thematic segment 412 may be an independent and concurrent thematic segment.

Detailed Description of Statistical Framework used by the Thematic Segmentation Tool

[0039] In general, when generating thematic segments, such as thematic segments 410-412, thematic decision component 310 honors the linguistic boundary information as basic constituents of the document. Thematic segments are formed as one or more of the sequential constituents (e.g., one or more sentences) determined by linguistic detection component 322.

[0040] A number of different techniques can be used to implement the statistical framework of thematic decision component 310. Some of these techniques, and the speech features on which they are based, will now be described in more detail.

[0041] Acoustic Features

[0042] Speech has a range of properties that make it very different from plain text. Thematic segmentation on speech-transcribed text gains from the fact that the problem now has access to the original signal from the speaker in addition to the textual content of what was spoken. Nuances in speech from the speaker are frequently very relevant indicators of changes in content as well as intent by the speaker, both of which can be used to effectively model the shift in themes in an episode. Prosodic features, such as pause, pitch, energy, and, speaking rate, can be used in statistical models for detecting changes in the speech that correspond to a change in the theme of the content.

[0043] Linguistic Features

[0044] Word repetition can be used alone or in conjunction with other features like word frequency and synonyms. In most cases, synonyms are identified using predefined word-tables or word thesaurus, both of which are hard-to-generate-and-generalize resources. Latent Semantic Analysis (LSA) is a known robust technique used to match words that are synonyms and better handle the multiple meanings of a term. An example of the use of LSA is given in T. Brants, “Topic-Based Document Segmentation with Probabilistic Latent Semantic Analysis,” Proceedings of the Conference on Information and Knowledge Management, Nov. 4-9, 2002, McLean, Va. LSA uses singular value decomposition to map the high-dimensional word-document count matrix to a lower dimensional latent ‘semantic’ space wherein terms and documents that are closely associated are placed near one another. LSA has the additional property that it can reduce the dimensions of the linguistic features space (typically of the order of 60,000 terms for conventional large-vocabulary speech recognition systems) to much more manageable size and do this intelligently such that the inherent similarities between the terms in the space is not only preserved, it is collated for better modeling. Additional linguistic features, like Minimum Description Length (MDL) phrases, and named-entity phrases, can be added to the linguistic sub-space and rely on the LSA technique to connect the terms and the phrases effectively.

[0045] Segmentation Approaches

[0046] Segmentation techniques can compare the distance between two blocks of text and select segmentation points based on the similarity values between pairs of adjacent blocks. For Example, M. A. Hearst, “Multi-paragraph Segmentation of Expository Text,” Proceedings of the Association for Computational Linguistics, 1994, uses a sliding window and computes similarities between adjacent blocks based on their term frequency vectors. For thematic segmentation in speech, sliding windows of text can be used with a similarity measure based on the persistence of statistical-model-based hypothesized topics between pairs of adjacent blocks of windowed-text. The smallest unit for the segmentation process is an elementary block. Sentences can be used as the elementary blocks for defining the segmentation candidates. The text can be broken into blocks, i.e., sequences of consecutive elementary blocks, where each block includes some number of elementary blocks. In the training documents, these blocks are variable-sized, non-overlapping and generally do not cross segment boundaries. However, in the documents to be segmented, these blocks may be overlapping, as in the use of a sliding window. The set of positions between every pair of adjacent blocks compose the segmentation candidates.

[0047] Mathematical Models

[0048] There are a number of mathematical models that can be used to determine the relationship between varied features and shifts in thematic content. The models can operate on the pure acoustic features or the pure linguistic features, as well as the combined acoustic/linguistic features. For example, statistical learn-by-example techniques can be trained on roughly annotated training data for every domain and language may be used.

[0049] Neural Networks

[0050] Neural Networks can additionally be effective in approximating the complex, non-linear relationships that exist between features of various types, such as continuous, discrete, and in some cases even Boolean, as well as the change in structure in the underlying speech. Neural Networks can be used to model the acoustic features and produce an estimate of the similarity or dissimilarity between adjacent blocks on either side of a segmentation candidate. With the help of LSA, high-dimensional linguistic features can be mapped onto a low dimensional compact sub-space. The mapped features can be used with the prosodic information in a combined neural network to detect changes in themes.

[0051] Probabilistic Latent Semantic Analysis (PLSA)

[0052] Probabilistic Latent Semantic Analysis can be used to model high-dimensional linguistic features. The PLSA model can be used quite effectively with the combined feature space since it is highly adept at finding the subtle cross-correlations between features that expose the inter-relations between the terms and underlying themes. PLSA is a statistical latent class model that may provide better results than LSA for term matching in retrieval applications. In PLSA, the conditional probability between documents d and feature terms f is modeled through a latent variable z, which can be loosely thought of as a class or topic. A PLSA model is parameterized by P(f|z) and P(z|d), and the words may belong to more than one class and a document may discuss more than one “topic”. The latent variable z can be thought of as an unknown variable in the context of Expectation-Maximization algorithms and thus the parameters of the PLSA model can be trained from a corpus of documents using the EM algorithm. The use of PLSA allows for a better representation of sparse information in a text block, such as a sentence or a sequence of sentences. A wide variety of similarity measures like the cosine distance, the Bhattacharya distance, as well as Kullback-Leibler divergence can be used with the scores generated from the PLSA model to determine the segmentation boundaries.

CONCLUSION

[0053] As described above, a thematic segmentation tool demarcates segments for a document that have similar thematic content. The thematic segmentation tool bases the thematic segments on a transcription of audio data augmented with additional information relating to linguistic and speaker descriptive properties of the audio. The thematic segments generated by the tool may be hierarchical and may include multiple different thematic segments for a portion of text.

[0054] The foregoing description of preferred embodiments of the invention provides illustration and description, but is not intended to be exhaustive or to limit the invention to the precise form disclosed. Modifications and variations are possible in light of the above teachings or may be acquired from practice of the invention.

[0055] Certain portions of the invention have been described as software that performs one or more functions. The software may more generally be implemented as any type of logic. This logic may include hardware, such as application specific integrated circuit or a field programmable gate array, software, or a combination of hardware and software.

[0056] No element, act, or instruction used in the description of the present application should be construed as critical or essential to the invention unless explicitly described as such. Also, as used herein, the article “a” is intended to include one or more items. Where only one item is intended, the term “one” or similar language is used.

[0057] The scope of the invention is defined by the claims and their equivalents.

Claims

1. A thematic segmentation tool comprising:

a transcription component configured to receive spoken audio information and convert the spoken audio information into a document of text corresponding to the spoken audio information;

a linguistic detection component configured to generate linguistic information corresponding to the document produced by the transcription component based on the document and the spoken audio information;

a topic classification component configured to generate topics relevant to the document; and

a thematic decision component configured to generate indications of a plurality of thematic segments based on the linguistic information, the document, and the topics.

2. The thematic segmentation tool of claim 1, wherein the thematic segments include a hierarchical arrangement of multiple thematic segments.

3. The thematic segmentation tool of claim 1, wherein the thematic segments include multiple concurrent thematic segments generated for a portion of the document.

4. The thematic segmentation tool of claim 1, wherein the linguistic information includes visible linguistic information.

5. The thematic segmentation tool of claim 4, wherein the visible linguistic information includes at least one of periods and commas.

6. The thematic segmentation tool of claim 1, wherein the linguistic information includes non-visible linguistic information.

7. The thematic segmentation tool of claim 6, wherein the non-visible linguistic information includes phrasal boundary information.

8. The thematic segmentation tool of claim 1, wherein the topic classification component generates topics after training for topic generation using an unsupervised topic discovery mechanism.

9. The thematic segmentation tool of claim 1, wherein the thematic segments begin and end on linguistic boundaries.

10. The thematic segmentation tool of claim 1, further including:

a speaker boundary detection component configured to generate indications of speaker boundaries for the spoken audio information, wherein the thematic decision component uses the indications of speaker boundaries when generating the indications of the plurality of thematic segments.

11. A method for determining thematically coherent segments within a document, the method comprising:

receiving a document having associated linguistic information describing linguistic features of the document; and

generating indications of thematically coherent segments within the document that occur at the linguistic features in the document.

12. The method of claim 11, further comprising:

generating the document by transcribing speech.

13. The method of claim 12, wherein the document is additionally associated with topic information that summarizes topics relevant to the document.

14. The method of claim 13, wherein the document is additionally associated with speaker boundary information.

15. The method of claim 11, wherein the thematically coherent segments include a hierarchical arrangement of thematic segments.

16. The method of claim 11, wherein the thematically coherent segments include multiple concurrent thematic segments generated for a portion of the document.

17. The method of claim 11, wherein the linguistic information includes visible linguistic information.

18. The method of claim 17, wherein the visible linguistic information includes at least one of periods and commas.

19. The method of claim 11, wherein the linguistic information includes non-visible linguistic information.

20. The method of claim 19, wherein the non-visible linguistic information includes phrasal boundary information.

21. The method of claim 11, wherein the thematically coherent segments begin and end on linguistic boundaries defined by the linguistic information.

22. A computing device comprising:

a processor; and

a computer memory coupled to the processor and containing programming instructions that when executed by the processor cause the processor to:

associate linguistic information with a document, the linguistic information demarcating linguistic breaks within the document,

generate, at a plurality of the linguistic breaks within the document, indications of thematically coherent segments, and

output the thematically coherent segments associated with labels describing thematic content of the thematically coherent segments.

23. The computing device of claim 22, wherein the program instructions, when executed by the processor, additionally cause the processor to:

generate the document by transcribing speech.

24. The computing device of claim 22, wherein the document is additionally associated with topic information that summarizes topics relevant to the document.

25. The computing device of claim 22, wherein the document is additionally associated with speaker boundary information.

26. The computing device of claim 22, wherein the thematically coherent segments include a hierarchical arrangement of thematic segments.

27. The computing device of claim 22, wherein the thematically coherent segments include multiple concurrent thematic segments generated for a portion of the document.

28. The computing device of claim 22, wherein the linguistic information includes visible linguistic information.

29. The computing device of claim 28, wherein the visible linguistic information includes at least one of periods and commas.

30. The computing device of claim 22, wherein the linguistic information includes non-visible linguistic information.

31. The computing device of claim 30, wherein the non-visible linguistic information includes phrasal boundary information.

32. A device comprising:

means for associating linguistic information with a document, the linguistic information demarcating linguistic breaks within the document; and

means for generating, at a plurality of the linguistic breaks within the document, indications of thematically coherent segments.

33. The device of claim 32, wherein the document is additionally associated with speaker boundary information and with topic information that summarizes topics relevant to the document.

34. A computer-readable medium containing program instructions for execution by a processor, the program instructions, when executed by the processor, cause the processor to perform a method comprising:

obtain a document having associated linguistic information describing linguistic features of the document; and

generate indications of thematically coherent segments within the document that occur at the linguistic features in the document.

35. The computer-readable medium of claim 34, wherein the document is additionally associated with speaker boundary information and with topic information that summarizes topics relevant to the document.