METHODS AND SYSTEMS FOR CORRECTING TRANSCRIBED AUDIO FILES
Methods and systems for correcting transcribed text. One method includes receiving audio data from one or more audio data sources and transcribing the audio data based on a voice model to generate text data. The method also includes making the text data available to a plurality of users over at least one computer network and receiving corrected text data over the at least one computer network from the plurality of users. In addition, the method can include modifying the voice model based on the corrected text data.
Latest VOVISION, LLC Patents:
The present application is a continuation of U.S. application Ser. No. 12/278,332, filed Aug. 5, 2008, which is an application filed under 35 U.S.C. §371 from International Application PCT/US2007/066791, filed Apr. 17, 2007, which claims priority to U.S. Provisional Application No. 60/792,640, filed Apr. 17, 2006, the contents of which are each incorporated by reference herein.
BACKGROUNDEach day individuals and companies receive multiple voice or audio messages. These voice messages can include personal greetings and information or business-related instructions and information. In either case, it may be useful or required that the voice messages be transcribed in order to create written records of the messages. For example, vendors may create paper versions of orders placed via voice messages, lawyers may create paper copies of messages received from clients, and federal agencies may create paper copies of voice messages for public records. In each situation, it is generally important that voice messages be transcribed correctly.
Software currently exists that generates written text based on audio data. For example, Nuance Communications, Inc. provides a number of software programs, trademarked “Dragon,” that take audio files in .WAV format, .MP3 format, or other audio formats and translate such files into text files. The Dragon software also provides mechanisms for comparing audio files to text files in order to “learn” and improve future transcriptions. The “learning” mechanism included in the Dragon software, however, is only intended to learn based on a voice dependent model, which means that the same person trains the software program over time. In addition, learning mechanisms in existing transcription software are often non-continuous and include set training parameters that limit the amount of training that is performed.
SUMMARY OF THE INVENTIONEmbodiments of the present invention provide methods and systems for correcting transcribed text. One method includes receiving audio data from one or more audio data sources and transcribing the audio data based on a voice model to generate text data. The method also includes making the text data available to a plurality of users over at least one computer network and receiving corrected text data over the at least one computer network from the plurality of users. In addition, the method includes modifying the voice model based on the corrected text data.
Embodiments of the present invention also provide systems for correcting transcribed text. One system includes a transcription server, at least one translation server, a correction interface, and at least one training server. The transcription server receives audio data from one or more audio data sources and the translation server can transcribe the audio data based on a voice model to generate text data. The correction interface is accessible by a plurality of users over at least one computer network and provides the plurality of user access to the text data. The correction interface also receives corrected text data from the plurality of users. The training server modifies the voice model based on the corrected text data.
Additional embodiments of the invention also provide methods of performing audio data transcription. One method includes obtaining audio data from at least one audio data source, transcribing the audio data based on a voice-independent model to generate text data, and sending the text data to an owner of the audio data as an e-mail message.
In the drawings:
Before any embodiments of the invention are explained in detail, it is to be understood that the invention is not limited in its application to the details of construction and the arrangement of components set forth in the following description or illustrated in the following drawings. The invention is capable of other embodiments and of being practiced or of being carried out in various ways.
In addition, it should be understood that embodiments of the invention include hardware, software, and electronic components or modules that, for purposes of discussion, may be illustrated and described as if the majority of the components were implemented solely in hardware. However, based on a reading of this detailed description, one of ordinary skill in the art would recognize that, in at least one embodiment, the electronic based aspects of the invention may be implemented in software. As such, it should be noted that a plurality of hardware and software based devices, as well as a plurality of different structural components, may be utilized to implement the invention. Furthermore, and as described in subsequent paragraphs, the specific configurations illustrated in the drawings are intended to exemplify embodiments of the invention. Other alternative configurations are possible.
In some embodiments, an audio data source 30 is connected to the transcription server 20 through a VoIP voice mail server 20a. For example, a user operating a telephone 30a dials an individual voice mail box associated with the VoIP voice mail server 20a and leaves a message (i.e., audio data). The VoIP voice mail server 20a converts the received message to a format recognizable and useable to the transcription server 20 (if necessary), and the VoIP voice mail server 20a transmits the message to the transcription server 20. It should be understood that, in some embodiments, the functionality of the VoIP voice mail server 20a is combined with the functionality of the transcription server 20 and is provided in a single server or device.
As shown in
As shown in
In addition, the transcription server 20 can obtain audio data from a client computer 30e. For example, a user of the client computer 30e can upload audio files stored on or accessible by the client computer 30e to the transcription server 20. In some embodiments, a user uses a recording application stored on or accessible by the client computer 30e to create audio files to be uploaded to the transcription server 20. The client computer 30e can upload the audio files to the transcription server 20 using various formats and/or protocols, such as the file transfer protocol (“FTP”).
A user can also e-mail an audio file to the transcription server 20. For example, the transcription server 20 can include or can be connected to an e-mail server that receives e-mail messages from the client computer 30e or other e-mail processing devices, such as personal digital assistants (“PDAs”) and hand-held communication devices (e.g., a cellular phone, a Blackberry device, etc.), and a user can forward or send an e-mail message that contains audio data to an e-mail address associated with the transcription server 20.
It should also be noted that, in some embodiments, the transcription server 20 obtains audio data from a TTY phone 30d or from a client computer 30e via a VoIP server 20a. In addition, the system 10 can allow a user involved in a telephone call to enter a code (e.g., via a keypad of the telephone) that initiates recording of the current telephone call by the transcription server 20 or another device of the system 10. For example, a user can enter a telephone number associated with a transcription server 20 or another device of the system 10 that “conferences in” the device so that the device obtains a substantially real-time stream of the audio of the telephone call. The device records the audio of the telephone call and creates corresponding audio data (e.g., one or more audio files).
The transcription server 20 or another device of the system 10 can also initiate a call to an external voicemail server and record voicemail messages stored by the voicemail server in order to obtain audio data for transcription. For example, the system 10 can provide an interface (e.g., a settings interface or website) that enables a user to provide a telephone number of a voicemail system and/or a telephone number (e.g., a cellular phone number), a voicemail passcode or password, and, optionally, a schedule for calling the voicemail server to record voicemail messages. The interface can also enable a user to manually initiate a call to the voicemail server. In addition, the interface can enable a user to listen to the voicemail messages as or before the transcription server 20 records and/or transcribes them. In some embodiments, the interface also enables a user to select which voicemails the transcription server 20 should transcribe.
As shown in
Once the transcription server 20 or separate polling computer receives one or more messages (received by request or otherwise), the transcription server 20 or separate polling computer places the messages into one or more queue servers or applications 60. The queue servers 60 look for an open or available processor or translation server 70. As shown in
In addition to transcribing messages as just described, some embodiments of a translation server 70 generate an index of keywords based upon the transcribed text. For example, in some embodiments, the translation server 70 removes those words that are less commonly searched and/or less useful for searching (e.g., I, the, a, an, but, and the like) from messages, which leaves a number of keywords that can be stored in memory available to the translation servers 70. The resulting “keyword index” includes the exact positions of each keyword in the transcribed text, and, in some cases, includes the exact location of each keyword in the corresponding audio message. This keyword index enables users to perform searches on the transcribed text of the message. For example, a user accessing the transcribed text of a message (whether for purposes of correcting any errors in the transcribed text or for searching within the transcribed text) can select one or more words from the keyword index of the message generated earlier. In so doing, the exact locations (e.g., page and/or line numbers) of such words can be provided quickly and efficiently—in many cases significantly faster and with less processing power than performing a standard search for the word through the entire text of the message. The system 10 can provide the keyword index to a user in any suitable manner, such as in a pop-up or pull-down menu included in an interface of the system 10 accessed by a user via a client computer 40 during text correction or searching of a transcribed message (described below).
Also, in some embodiments, a translation server 70 generates two or more possible candidates for a transcription of a spoken word or phrase from an audio message. The most likely candidate is displayed or otherwise used to generate the transcribed message, and the less likely candidate(s) are saved in a memory accessible by the translation server 70 and/or by another server or client computer 40 as needed. This capability can be useful, for example, during correction of the transcribed message (described below). In particular, if a word in the transcribed message is wrong, a user can obtain other candidate(s) identified by the translation server 70 during transcription, which can speed up and/or simplify the correction process.
Once a message is transcribed, the system 10 can allow a user to search a message for particular words and/or phrases. This searching capability can be used during correction of a transcribed message as described below or when the file is searched for particular words (whether a search for such words is performed on the file alone or in combination with one or more other files). For example, using the indexed message, a user viewing generated text data can select a word or phrase included in the text data and, in some embodiments, can hear the corresponding portion of the audio data from which the text data was generated. In some embodiments, the system 10 is adapted to enable a user to search some or all transcribed files accessible by the transcription server 20, regardless of whether such files have been corrected. Also, the system 10 can enable a user to search transcribed messages using Boolean and/or other search terms.
Search results can be generated in a number of manners, such as in table form enabling a user to select one or more files in which a word or phrase has been found and/or one or more locations at which a word or phrase has been found in a particular message. The search results can also be sorted in one or more manners according to one or more rules (e.g., date, relevance, number of instances in which the word or phrase has been found in a message, and the like) and can be printed, displayed, or exported as desired. In some embodiments, the search results also provide the text around the found word or phrase. The search results can also include additional information, such as the number of instances in which a word or phrase has been found in a file and/or the number of files in which a word or phrase has been found.
In the embodiment shown in
After the translation servers 70 index and translate audio data, the audio data and/or the generated text data is stored. The audio data and text data can be stored internally by the transcription server 20 or can be stored externally to one or more data storage devices (e.g., databases, servers, and the like). In some embodiments, a user (e.g., a user associated with a particular audio data source 30) decides how long audio data and/or text data is stored by the transcription server 20, after which time the audio data and/or text data can be automatically deleted, over-written, or stored in another storage device (e.g., a relatively low-accessibility mass storage device). An interface of the system 10 (e.g., a settings interface or website) enables a user to specify a time limit for audio data and/or text data stored by the transcription server 20.
As shown in
In some embodiments, the transcription server 20 sends audio data and/or corresponding, generated text data to a user as an e-mail message. The transcription server 20 can send an e-mail message to a user that includes the audio data and the text data as attached files. In other embodiments, the transcription server 20 sends an e-mail message to a user that includes a notification that audio data and/or text data is available for the user. The e-mail message can also include a link to the available audio data and/or text data. A user selects the link in order to listen to the audio data, view the text data, and/or to correct the text data. For example, a user selects the link included in the e-mail message in order to be transferred to a correction interface of the system 10, as described below with respect to
As described above, an e-mail message that includes an attached audio file is a possible source of audio data. If a user forwards or sends an e-mail message to the transcription server 20 that includes audio data, the transcription server 20 can send a return e-mail message to the user after the transcription server 20 transcribes the submitted audio file. The e-mail message can inform the user that the submitted audio data was transcribed and that corresponding text data is available. As previously noted, the e-mail message from the transcription server 20 can include the submitted audio data and/or the generated text data. Alternatively or in addition, the e-mail message from the transcription server 20 includes a link to the audio data, the generated text data, and/or an interface for listening to the audio data, viewing the text data, and/or correcting the text data.
In some embodiments, the system 10 sends audio data and/or corresponding text data to one or more predetermined destinations (e.g., a system, a data storage device, a file, etc.), any or all of which can be specified by a user of the system 10. For example, an interface of the system 10 (e.g., an administration interface or website) can enable a user to specify destination settings for audio data and/or text data. Using the interface, a user can specify a website, a blog, a document management or electronic medial record (“EMR”) system, an e-mail address, a remote printer, etc. where audio data and/or corresponding generated text data should be automatically sent (e.g., after being corrected). The destination settings can be set for individual users or groups of users (e.g., users with certain permissions).
The system 10 can also enable a user to use to provide destination settings for audio data and/or text data on a per-generated-text-data basis. In some embodiments, before or after audio data is transcribed, a user specifies a particular destination for the text data (e.g., from a drop down selection mechanism, a menu selection mechanism, or an input mechanism of the correction interface). Similarly, certain implementations allow a user to specify destination settings in an e-mail message. For example, if a user sends an e-mail message to the transcription server 20 that includes audio data, the user can specify destination information in the e-mail message. When a caller leaves a voice message (e.g., with the VoIP mail server 20a), the system 10 can also allow the caller to enter a code to designate a destination for the audio message and/or the generated text data. For example, a user can enter a number “4” (e.g., via a keypad of a telephone) to designate that the audio message and/or the generated text data should be delivered to a recipient via an e-mail message. The user can also enter an identifier of the recipient (e.g., a phone number, an e-mail address, etc.) who is to receive the audio data and/or the generated text data. For example, one or more speed dials can be established, and a user can enter a speed dial number after entering the destination code in order to identify a particular recipient. The speed dial numbers can be programmed via an interface of the system 10 (e.g., a settings interface or website). After the audio message is transcribed and the generated text data is corrected (if applicable), the transcription server 20 can send an e-mail message to the identified recipient (e.g., via a SMTP server).
In some embodiments, to protect the privacy and security of the audio and text data, the transcription server 20 transmits data (e.g., audio data and/or text data) to the client computer 40 or another destination device using file transfer protocol (“FTP”). The transmitted data can also be protected by a secure socket layer (“SSL”) mechanism (e.g., a bank level certificate).
As noted above, the system 10 can include a correction interface and a streaming translation server 80 that a user can access (e.g., via the client computer 40) to view generated text. As described below with respect to
The correction interface also enables a user to correct generated text data. For example, if a user listens to audio data and determines that a portion of the corresponding generated text data is incorrect, the user can correct the generated text data via the correction interface. In some embodiments, the correction interface automatically identifies potentially incorrect portions of generated text data. For example, the correction interface can display potentially incorrect portions of the generated text data in a particular color or other format (e.g., via a different font, highlighting in bold, italics, underline, or any other manner). Furthermore, the correction interface can display portions of the generated text in various colors or other formats depending on the confidence that the portion of the generated text is correct. The correction interface can also insert a placeholder (e.g., an image, an icon, etc.) into text that marks portions of the generated text where text is missing (i.e., the transcription server 20 could not generate text based on the audio data). A user can select the placeholder in order to hear the audio data corresponding to the missing text and can insert the missing text accordingly.
In order to assist a user in correcting generated text data, some embodiments of the correction interface automatically generate words similar to incorrectly-generated words. In this regard, a user selects a word (e.g., by highlighting, clicking, or by any other suitable manner) within generated text data that is or appears to be incorrect. Upon such selection, the correction interface suggests similar words, such as in a pop-up menu, pull-down menu, or in any other format. The user selects a word or words from the list of suggested words in order to make a desired correction.
In some embodiments, the correction interface provides audio data and/or text data in particular formats. For example, court reporters require certain statutory formatting of their documents that identify the speaker. The correction interface (e.g., when placed in “court” mode) enables a user to input speaker names for particular audio data and/or to insert corresponding symbols for each speaker name into the text data. The user then selects a “format” selection mechanism (e.g., a button, a radio button, a drop-down menu item, or the like) included in the correction interface, and the correction interface reformats the displayed text data using the provided speaker names and format guidelines.
In some embodiments, the translation server(s) 70 are configured to automatically determine speakers in an audio file. For example, the translation server 70 can process audio files for drastic changes in voice or audio patterns. The translation server 70 then analyzes the patterns in order to identify the number of individuals or sources speaking in an audio file. In other embodiments, a user or information associated with the audio file (e.g., information included in the e-mail message containing the audio data, or stored in a separate text file associated with the audio data) identifies the number of speakers in an audio file before the audio file is transcribed. For example, a user can use an interface of the system 10 (e.g., the correction interface) to specify the number of speakers in an audio file before or after the audio file is transcribed.
After identifying the number of speakers in an audio file, the translation server(s) 70 can generate a speaker list that marks the number of speakers and/or the times in the audio file where each speaker speaks. The translation server(s) 70 can use the speaker list when creating or formatting the corresponding text data to provide markers or identifiers of the speakers (e.g., Speaker 1, Speaker 2, etc.) within the generated text data. In some embodiments, a user can update the speaker list in order to change the number of speakers included in an audio file, change the identifier of the speakers (e.g., to the names of the speakers), and/or specify that two or more speakers identified by the translation server(s) 70 relate to a single speaker or audio source. Also, in some embodiments, a user can use an interface of the system 10 (e.g., a settings interface or website) to modify the speaker list or to upload a new speaker list. For example, a user can change the identifiers of the speakers by updating a field of the correction interface that identifies a particular speaker. For example, each speaker identifier displayed within generated text data can be placed in a user-editable field. In some embodiments, changing an identifier of a speaker in one field automatically changes the identifier for the speaker throughout the generated text data.
In some embodiments, the system 10 can also format transcribed text data based on one or more templates, such as templates adapted for particular users or businesses (e.g., medical, legal, engineering, or other fields). For example, after generating text data, the system 10 (e.g., the translation server(s) 70) can compare the text data with one or more templates. If the format or structure of the text data corresponds to the format or structure of a template and/or if the text data includes one or more keywords associated with a template, the system 10 can format the text data based on the template. For example, if the system 10 includes a template specifying the following format:
Date:
Type of Illness:
and text data generated by the system 10 is “the date today is September the 12th the year 2007, the illness is flu,” the system 10 can automatically apply the template to the text data in order to create the following formatted text data:
Date: 09/12/2007
Type of Illness: Flu
In some embodiments, the system 10 is configured to automatically apply a template to text data if text data corresponds to the template. Therefore, as the system 10 “learns” and improves its transcription quality, as described below, the system 10 also “learns” and improves its application of templates. In other embodiments, a user can use an interface of the system 10 (e.g., the correction interface) to manually specify a template to be applied to text data. For example, a user can select a template to apply to text data from a drop down menu or other selection mechanism included in the interface.
The system 10 can store the formatted text data and can make the formatted text data available for review and correction, as described below. In some embodiments, the system 10 also stores or retains the unformatted text data separately from the formatted text data. By retaining the unformatted text data, the text data can be applied to new or different templates. In addition, the system 10 can use the unformatted text data to train the system 10, as described below.
The system 10 can include one or more predefined templates. In some embodiments, a user can also create a customized template and can upload the template to the system 10. For example, a user can use a word processing application, such as Microsoft® Word®, to create a text file that defines the format and structure of a customized template. The user can upload the text file to the system 10 using an interface of the system 10 (e.g., the correction interface). In some embodiments, the system 10 reformats uploaded templates. For example, the system 10 can store predefined templates and/or customized templates in a mark-up language, such as XML or HTML.
Templates can be associated with a particular user or a group of users. For example, only users with certain permission may be allowed to use or apply particular templates. In other embodiments, a user can upload one or more templates that only he or she can use or apply. Settings and restrictions for predefined and/or customized templates can be configured by a user or an administrator using an interface of the system 10 (e.g., a settings interface or website).
In some embodiments, alternatively or in addition to configuring templates, the system 10 can also enable a user to configure one or more commands that replace transcribed text with different text. For example, a user can configure the system 10 to insert the current date into text data any time audio data and/or corresponding text data includes the word “date” or the phrases “today's date,” “current date,” or “insert today's date.” Similarly, a user can configure the system 10 to start a new paragraph within transcribed text data each time audio data and/or corresponding text data includes the word “paragraph,” the phrase “new paragraph,” or a similar identifier. The commands can be defined on a per user basis and/or on a group of users basis, and settings or restrictions for the commands can be set by a user or an administrator using an interface of the system 10 (e.g., a settings interface or website).
Some embodiments of the system 10 also enable a user correcting text data via the correction interface to create commands and/or keyboard shortcuts. For example, the user can use the commands and/or keyboard shortcuts to stream audio data, add common words or phrases to text data, play audio data, pause audio data, or start or select objects or functions provided through the correction interface or other interfaces of the system 10. In some embodiments, a user uses the correction interface (e.g., a settings interface or website) to configure the commands and/or keyboard shortcuts. The commands and/or keyboard shortcuts can be stored on a user level and/or a group level. An administrator can also configure commands and/or keyboard shortcuts that can be made available to one user or multiple users. For example, users with particular permissions may be allowed to use particular commands and/or keyboard shortcuts. In addition, in some embodiments, a user can connect an input device to the client computer 40, such as a foot pedal, a joystick, or a microphone, that the user can use to send commands to the correction interface. For example, a user can select a word or phrase in the text data (e.g., via a keyboard or a mouse connected to the client computer 40) in order to start playing the corresponding audio data and then can use the foot pedal or other input device to more forward or backward within the audio data, pause the audio data, play the audio data, insert common words or phrases into the text data, etc.
If a user uses a microphone as an input device, the correction interface can be configured to react to commands spoken by the user. For example, the system 10 can enable a user to create commands that when spoken by the user causes the correction interface to perform certain actions. In some embodiments, the user can say “play,” “pause,” “forward,” “backward,” etc. to control the playing of the audio data by the correction interface. A user can also say commands that cause the correction interface to insert, delete, or edit text in transcribed text data. For example, a user can say “date,” and the correction interface can insert date information into transcribed text data.
In some embodiments, the system 10 also performs translations of transcribed text data. For example, the correction interface or another interface of the system 10 can enable a user to request a translation of transcribed text data into another language. The transcription server 20 can include one or more language translation modules configured to create text data in a particular language based on generated text data in another language. An audio source (e.g., a caller to a voicemail box or an individual submitting an e-mail message with an attached audio file to the transcription server 20) can also request or specify a language translation when an audio file is submitted to the transcription server 20.
With continued reference to the illustrated embodiment of
In some embodiments, the system 10 transcribes audio files of a predetermined size (e.g., over 20 minutes in length) in pieces in order to “pre-train” the translation server(s) 70. For example, the transcription server 20 and/or the translation server(s) 70 can divide an audio file into segments (e.g., 1 to 5 minute segments). The translation server(s) 70 can then transcribe one or more of the segments and the resulting text data can be made available to a user for correction (e.g., via the correction interface). After the transcribed segments are corrected and any corrections are applied to the training server 90 in order to “teach” the system 10, the translation server(s) 70 transcribe the complete audio file. After the complete audio file is transcribed, the transcription of the complete audio file is made available to a user for correction. Using the small segments of the audio file to pre-train the translation server(s) 70 can increase the accuracy of the transcription of the complete audio file, which can save time and can prevent errors. In some embodiments, the complete audio file is transcribed before or in parallel with one or more smaller segments of the same audio file. Once the complete audio file is transcribed, a user can then immediately review and correct the text for the complete audio file or can wait until the individual segments are transcribed and corrected before correcting the text of the complete audio file. In addition, a user can request a re-transcription of the complete audio file after one or more individual segments are transcribed and corrected. In some embodiments, if the complete audio file is transcribed before or in parallel with smaller segments and the transcription of the complete audio file has not been corrected by the time the individual segments are transcribed and corrected, the transcription server 20 and/or the translation server(s) 70 automatically re-transcribes the complete audio file.
The voice independent model developed by the transcription server 20 can be shared and used by multiple transcription servers 20. For example, in some embodiments, the voice independent model developed by a transcription server 20 can be copied to or shared with other transcription servers 20. The model can be copied to other transcription servers 20 based on a predetermined schedule, anytime the model is updated, on a manual basis, etc. In some embodiments, a lead transcription server 20 collects audio and text data from other transcription servers 20 (e.g., audio and text data which has not been applied to a training server) and transfers the data to a lead training server 90. The lead transcription server 20 can collect the audio and text data during periods of low network or processor usage. The individual training servers 90 of one or more transcription servers 20 can also take turns processing batches of audio data and copying updated voice models to other transcription servers 20 (e.g., in a predetermined sequence or schedule), which can ensure that each transcription server 20 is using the most up-to-date voice model.
In some embodiments, individuals may be hired to correct transcribed audio files (“correctors”), and the correctors may be paid on a per-line, per-word, per-file, time, or the like basis, and the transcription server 20 can track performance data for the correctors. The performance data can include line counts, usage counts, word counts, etc. for individual correctors and/or groups of correctors. In some embodiments, the transcription server 20 enables a user (e.g., an administrator) to access the performance data via an interface of the system 10 (e.g., a website). The user can use the interface to input personal information associated with the performance data, such as the correctors' names, employee numbers, etc. In some embodiments, the user can also use the interface to initiate and/or specify payments to be made to the correctors. The performance data (and any related information provided by a user, such as an administrator) can be stored in a database and/or can be exported to an external accounting system, such as accounting systems and solutions provided by Paychex, Inc. or QuickBooks® provided by Intuit, Inc. The transcription server 20 can send the performance data to an external accounting system via a direct connection or an indirect connection, such as the Internet. The transcription server 20 can also generate a file that can be stored to a portable data storage medium (e.g., a compact disk, a jump drive, etc.). The file can then be uploaded to an external accounting system from the portable data storage medium. An external account system can use the performance data to pay the correctors, generate financial documents, etc.
In some embodiments, a user may not desire or need transcribed text data to be corrected. For example, a user may not want text data that is substantially accurate to be corrected. In these situations, the system 10 can allow a user to designate an accuracy threshold, and the system 10 can apply the threshold to determine whether text data should be corrected. For example, if generated text data has a percentage or other measurement of accurate words (as determined by the transcription server 20) that is equal to or greater than the accuracy threshold specified by the user, the system 10 can allow the text data to skip the correction process (and the associated training or learning process). The system 10 can deliver any generated text data that skips the correction process directly to its destination (e.g., directly sent to a user via an e-mail message, directly stored to a database, etc.). In some embodiments, the accuracy threshold can be set by a user using an interface of the system 10 (e.g., a website). The threshold can be applied to all text data or only to particular text data (e.g., only text data generated based on audio data received from a particular audio source, only text data that is associated with a particular destination, etc.).
After the audio data is indexed and transcribed, the audio data and/or generated text data is made available to a user for review and/or correction via a correction interface (step 120). If the text data needs to be corrected (step 130), the user makes the corrections and submits the corrections to the training server 90 of the transcription server 20 (step 140). The corrections are placed in a training queue and are prepared for archiving (step 150). Periodically, the training server 90 obtains all the corrected files from the training queue and begins a training cycle for an independent voice model (step 160). In other embodiments, the training server 90 obtains such corrected files immediately, rather than periodically. The training server 90 can be a server that is separate from the transcription server 20, and can update the transcription server 20 and/or any number of other servers on a continuous or periodic basis. In other embodiments, the training server 90, transcription server 20, and any other servers associated with the system 10 can be defined by the same computer. It should be understood that, as used herein and in the appended claims, the terms “server,” “queue,” “module, etc. are intended to encompass hardware and/or software adapted to perform a particular function.
Any portion or all of the transcription, correction, and training process performed by the system 10 can be performed by one or more polling managers (e.g., associated with the transcription server 20, the training server 90, or other servers). In some embodiments, the transcription server 20 and/or the training server 90 utilizes one or more “flags” to indicate a stage of a file. By way of example, only, these flags can include, without limitation or requirement: (1) waiting for transcription; (2) transcription in progress; (3) waiting for correction; (4) correction completed; (5) waiting for training; (6) training in progress; (7) retention; (8) move to history pending; and (9) history.
In some embodiments, the only action required by a user as a message moves through different stages of the system 10 is to indicate that correction of the message has been completed. In other embodiments, a less automated system can exist, requiring more input from a user during the transcription, correction, and training process.
Another example of a method by which messages are processed in the system 10 is illustrated in
With reference to the exemplary embodiment illustrated in
The archival process allows files to move out of the system 10 immediately or based at least in part upon set retention rules. Archived or historical files allow the system 10 to keep current files available quickly while older files can be encrypted, compressed, and stored. Archived files can also be returned to a user (step 222) in any manner as described above.
In some embodiments, an interface of the system 10 (e.g., the correction interface) shows the stage of one or more files in the transcription, correction, and training process. This process can be automated and database driven so that all files are used to build and train the voice independent model.
It should be noted that a database-driven system 10 allows redundancy within the system. Multiple servers can share the load of the process described above. Also, multiple servers across different geographic regions can provide backup in the event of a natural disaster or other problem at one or more sites.
After the audio data is transcribed, the transcription server 20 sends a correction notification to a user (step 252). In some embodiments, the correction notification includes an e-mail notification, as shown in
The transcription server 20 can send the correction notification to a user who is assigned to the correction of transcribed audio data associated with a particular owner or destination. For example, as the transcription server 20 transcribes voicemail messages for a particular member of an organization, the transcription server 20 can send a notification to a secretary or assistant of the member. An administrator can use an interface of the system 10 (e.g., a website) to configure one or more recipients who are to receive the correction notifications for a particular destination (e.g., a particular voicemail box). An administrator can also specify settings for notifications, such as the type of notification to send (e.g., e-mail, text, etc.), the addresses or identifiers of the notification recipients (e.g., e-mail addresses, telephone numbers, machine access control (“MAC”) addresses, etc.), the information to be included in the notifications, etc. For example, an administrator can establish rules for sending correction notifications, such as transcriptions associated with audio data received by the transcription server 20 from a particular audio data source should be corrected by particular users. In addition, as described above, an administration can set one or more accuracy thresholds, which can dictate when transcribed audio data skips the correction process.
To read the e-mail correction notification 254, a user can select the notification 254 (e.g., by clicking on, highlighting, etc.) in the inbox 255. After the user selects the notification 254, the e-mail application can display the contents of the notification 254, as shown in
Returning to
As shown in
After the user his or her credentials and/or identifying information, the correction interface 260 can verify the entered information, and, if verified, the correction interface 262 can display a main page 272, as shown in
As shown in
A user can select the settings selection mechanism 286 in order to access one or more setting pages (not shown) of the correction interface 260. The setting pages can enable a user to change his or her notification preferences, correction interface preferences (e.g., change a username and/or password, set a time limit for transcriptions displayed in a history page), etc. For example, as described above, a user can use the settings pages to specify destination settings for audio data and/or generated text data, configure commands and keyboard shortcuts, specify accuracy thresholds, configure the transcription server 20 to record voicemails from an external voicemail server, turn on or off particular features of the correction interface 260 and/or the system 10, etc. In some embodiments, the number and degree of settings configurable by a particular user via the settings pages are based on the permissions of the user. An administrator can use the setting pages to specify global settings, group settings (e.g., associated with particular permissions), and individual settings. In addition, an administrator can use a setting page of the correction interface 260 to specify users of the correction interface 260 and can establish usernames and passwords for users. Furthermore, as described above with respect to
As shown in
The view area 274 can also list additional information for each transcription. For example, as shown in
Returning to
In some embodiments, the correction view area 302 also includes a recording control area 304. The recording control area 304 can include one or more selection mechanisms for listening to or playing the audio data associated with the text data 303 displayed in the correction view area 302. For example, as shown in
As shown in
In some embodiments, the recording control area 304 can also include a speed control mechanism (not shown) that allows a user to decrease and increase the playback speed of audio data. For example, the recording control area 304 can include a speed control mechanism that includes one or more selection mechanisms (e.g., buttons, timelines, etc.). A user can select (e.g., click, drag, etc.) the selection mechanisms in order to increase or decrease the playback of audio data by a particular speed. In some embodiments, the speed control mechanism can also include a selection mechanism that a user can select in order to play audio data at normal speed.
In some embodiments, a user can hide the recording control area 304. For example, as shown in
The correction view area 302 can also include a save selection mechanism 316. A user can select the save selection mechanism 316 in order to save the current state of the corrected text data 303. A user can select the save selection mechanism 316 at any time during the correction process.
The correction view area 302 can also include a table 318 that lists, among other things, the system's confidence in its transcription quality. For example, as shown in
Returning to
A user can select the save and mark as complete selection mechanism 336 in order to save the corrections made by the user and move the transcription to the user's history. Once the corrections are saved and moved to the history folder, the user can access the corrected transcription (e.g., via the history page of the correction interface 260) but may not be able to edit the corrected transcription.
A user can select the save, mark as complete and send to owner selection mechanism 338 in order to save the corrected transcription, move the corrected transcription to the user's history folder, and send the corrected transaction and/or the associated audio data to the owner or destination of the audio data (e.g., the owner of the voicemail box). As described above, a destination for corrected transcriptions can include files, e-mail inboxes, remote printers, databases, etc. For example, the correction interface 260 can send a message notification to the owner of the transcription that includes the corrected transcription (e.g., as text within the message or as an attached file).
Once a user selects a save option, the user can select an accept selection mechanism 340 in order to accept the selected option or can select a cancel selection mechanism 342 in order to cancel the selected option. In some embodiments, if a user selects the cancel selection mechanism 342, the correction interface 260 returns the user to the correction page 300.
A user can also select a complete selection mechanism 296 included in the main page 272 of the correction interface 260 in order to submit or save transcriptions. In some embodiments, if a user selects a complete selection mechanism 296 included in the main page 272, the correction interface 260 displays the save options page 330 as described above with respect to
The embodiments described above and illustrated in the figures are presented by way of example only and are not intended as a limitation upon the concepts and principles of the present invention. As such, it will be appreciated by one having ordinary skill in the art that various changes in the elements and their configuration and arrangement are possible without departing from the spirit and scope of the present invention. For example, in some embodiments the transcription server 20 utilizes multiple threads to transcribe multiple files concurrently. This process can use a single database or a cluster of databases holding temporary information to assist in multiple thread transcription on the same or different machines. Each system or device included in embodiments of the present invention can also be performed by one or more machines and/or one or more virtual machines.
Various features and advantages of the invention are set forth in the following claims.
Claims
1. A method of correcting transcribed text, the method comprising:
- receiving audio data from one or more audio data sources;
- transcribing the audio data based on a voice model to generate text data;
- making the text data available to a plurality of user over at least one computer network;
- receiving corrected text data over the at least one computer network from the plurality of users; and
- modifying the voice model based on the corrected text data.
2. The method of claim 1, wherein receiving audio data from one or more audio data sources includes receiving audio data from a VoIP voicemail server.
3. The method of claim 1, wherein receiving audio data from one or more audio data sources includes receiving audio data from a client computer over at least one computer network.
4. The method of claim 1, wherein receiving audio data from one or more audio data sources includes receiving audio data in an e-mail message.
5. The method of claim 1, wherein receiving audio data from one or more audio data sources includes requesting the audio data from the one or more audio data sources.
6. The method of claim 5, further comprising prioritizing the audio data.
7. The method of claim 1, wherein transcribing the audio data based on a voice model to generate text data includes transcribing the audio data based on a voice independent model to generate text data.
8. The method of claim 1, further comprising sending a correction notification to at least one of the plurality of users.
9. The method of claim 8, wherein sending a correction notification to at least one of the plurality of users includes sending an e-mail correction notification to at least one of the plurality of users.
10. The method of claim 1, further comprising indexing the text data.
11. The method of claim 1, further comprising sending a message notification to a user;
12. The method of claim 11, wherein sending a message notification to a user includes sending an e-mail message notification to the user.
13. The method of claim 1, further comprising delivering the corrected text data to at least one destination.
14. The method of claim 13, further comprising receiving the at least one destination from a user.
15. A system for correcting transcribed text, the system comprising:
- a transcription server receiving audio data from one or more audio data sources;
- at least one translation server to transcribe the audio data based on a voice model to generate text data,
- a correction interface accessible by a plurality of users over at least one computer network and providing access to the text data and receiving corrected text data from the plurality of users; and
- at least one training server receiving the corrected text data and modifying the voice model based on the corrected text data.
16. The system of claim 15, wherein the one or more audio data sources includes a VoIP voicemail server.
17. The system of claim 15, wherein the voice model includes a voice independent model.
18. The system of claim 15, wherein the transcription server sends a correction notification to at least one of the plurality of users.
19. The system of claim 18, wherein the correction notification includes an e-mail correction notification.
20. The system of claim 15, wherein the correction interface provides access to the audio data.
Type: Application
Filed: Mar 14, 2013
Publication Date: Aug 1, 2013
Applicant: VOVISION, LLC (Stoughton, WI)
Inventor: VOVISION, LLC (Stouthton, WI)
Application Number: 13/803,733