Systems and Methods for Improved Digital Transcript Creation Using Automated Speech Recognition
This disclosure relates generally to systems, methods, and computer readable media for providing improved insights and annotations to enhance recorded audio, video, and/or written transcriptions of testimony. For example, in some embodiments, a method is disclosed for correlating non-verbal cues recognized from an audio and/or video recording of testimony to the corresponding testimony transcript locations. In other embodiments, a method is disclosed for providing testimony-specific artificial intelligence-based insights and annotations to a testimony transcript, e.g., based on the use of machine learning, natural language processing, and/or other techniques. In still other embodiments, a method is disclosed for providing smart citations to a testimony transcript, e.g., which track the location of semantic constructs within the transcript over the course of various modifications being made to the transcript. In yet other embodiments, a method is disclosed for providing intelligent speaker identification-related insights and annotations to an audio recording of a testimony transcript.
This disclosure relates generally to systems, methods, and computer readable media for providing improved insights and annotations for recorded audio, video, and/or written transcriptions of testimony.
BACKGROUNDFor years, court reporters have provided legal support services, including testimony transcription services, to the nation's top law firms, corporations, and governmental agencies. Traditionally, a human court reporter uses a specialized stenography machine, which is today connected to a computer running a specialized transcription application, to transcribe spoken testimony. As the court reporter transcribes the testimony that he or she is hearing in shorthand via the stenography machine, the shorthand version of the text is converted on-the-fly into plain English. This first pass at the transcription is commonly referred to as “the rough.” After the testimony has concluded, e.g., hours later or days later, the court reporter (or another entity) edits “the rough” file for clarity and/or to make any corrections noted at the time of the testimony or during the editing process. Once the transcript is in an acceptable format, the file is typically exported to a more widely-usable format, e.g., an ASCII text format. Later, e.g., in a post-production phase, corresponding audio and/or video files may be produced and synched to the transcript text, and multiple versions of the transcript text may be provided, based on a given client's needs (e.g., 1-page per letter page, 2-pages per letter page, 4-pages per letter page, etc.), e.g., in PDF, ASCII, a word processing document format, or other desired format.
One issue with this traditional model of transcription is the difficulty human beings have in being able to transcribe spoken words accurately at the typical rate of spoken testimony. Moreover, legal proceedings typically may involve many proper nouns, names and locations, and complex medical, legal, financial or technical terms. Another issue is the availability of human court reporters with sufficient skill to serve as transcribers in live testimony settings. For example, studies have shown that there is nearly a 98% dropout rate from court reporting school, and, in 2017, only 275 people graduated from stenographic court reporting schools in the United States, as many human court reporters become overwhelmed with the amount of practice required and the typical typing speeds needed to work in the field (e.g., between 200-250 words per minute). Also, typically, a 96% accuracy rate in transcription is required, with the remaining parts corrected for and filled in during a “scoping” process, as will be described in more detail later.
Thus, in the future, it is likely that human court reporters will need to be aided (to a significant extent) by intelligent machine labor. The subject matter of the present disclosure is directed to overcoming, or at least reducing the effects of, one or more of the problems set forth above. To address these and other issues, techniques that enable an AI-powered transcription system, e.g., that leverages data mining, pattern recognition, and natural language processing techniques, and has the ability to provide improved insights and annotations for recorded audio, video, and/or written transcriptions of testimony are described herein. Such systems may also ultimately hold the promise to extract deeper (and more valuable) inferences from testimony transcripts than merely the words that were spoken during the live testimony.
SUMMARYAccording to some embodiments described herein, a method is disclosed, comprising: obtaining an audio and/or video recording of a testimony; obtaining a transcript of the testimony; scanning the audio and/or video recording to locate one or more emotional or non-verbal cues; linking the located one or more emotional or non-verbal cues to corresponding portions of the transcript; and updating the corresponding portions of the transcript with indications of the corresponding located one or more emotional or non-verbal cues.
According to other embodiments, a method is disclosed, comprising: obtaining a transcript of a testimony; analyzing the transcript using one or more artificial intelligence (AI)-based techniques; detecting one or more logical inconsistencies in the transcript based, at least in part, on the analysis of the transcript; detecting one or more potential transcription errors in the transcript based, at least in part, on the analysis of the transcript; and updating the corresponding portions of the transcript with indications of the corresponding detected one or more logical inconsistencies and one or more potential transcription errors.
According to still other embodiments, a method is disclosed, comprising: obtaining a transcript of a testimony; identifying one or more semantic constructs in the transcript; tagging each of the one or more semantic constructs with a unique token identifier; associating a current position of each of the one or more semantic constructs with its unique token identifier; receiving one or more updates to the transcript; and updating one or more of the associated positions of one or more of the one or more semantic constructs based on the received one or more updates to the transcript.
According to yet other embodiments, a method is disclosed, comprising: obtaining a first audio recording of a first testimony, wherein the first audio recording comprises one or more first speaking parties; tagging the obtained first audio recording with one or more unique speaker identifiers, wherein each of the one or more unique speaker identifier corresponds to one of the one or more first speaking parties; determining at least one characteristic of the tagged obtained first audio recording; storing the tagged obtained first audio recording and corresponding at least one determined characteristic in a repository; obtaining a second audio recording of a second testimony, wherein the second audio recording comprises one or more second speaking parties; comparing the obtained second audio recording to one or more audio recordings stored in the repository; and in response to finding at least one matching audio recording in the repository, updating the obtained second audio recording with one or more speaker cues, wherein the one or more speaker cues are based, at least in part, on the at least one matching audio recording in the repository.
Various programmable electronic devices are disclosed herein as well, in accordance with the method embodiments enumerated above. Such electronic devices may include a display; a user interface; one or more processors; and a memory coupled to the one or more processors. Instructions may be stored in the memory, the instructions causing the one or more processors to execute instructions in accordance with the various method embodiments enumerated above. According to other embodiments, instructions may be stored on non-transitory program storage devices for causing one or more processors to execute instructions in accordance with the various method embodiments enumerated above.
Disclosed are systems, methods, and computer readable media for providing improved insights and annotations to enhance recorded audio, video, and/or written transcriptions of testimony, e.g., testimony that has been obtained, at least in part, via the use of automated speech recognition (ASR) technologies. Transcripts, as used herein, may refer to a result of transcribing legal testimony (e.g., in a deposition or courtroom setting), or any other spoken words in any context for which there is a desire to have an audio, video, and/or written transcription. For example, in some embodiments, a method is disclosed for correlating non-verbal cues recognized from an audio and/or video recording of a testimony to the corresponding testimony transcript locations. In other embodiments, a method is disclosed for providing testimony-specific artificial intelligence (AI)-based insights and annotations to a testimony transcript, e.g., based on the use of machine learning, natural language processing, and/or other techniques. In still other embodiments, a method is disclosed for providing smart citations to a testimony transcript, e.g., which track the location of semantic constructs within the transcript over the course of various modifications being made to the transcript. In yet other embodiments, a method is disclosed for providing intelligent speaker identification-related insights and annotations to an audio recording of a testimony transcript.
Referring now to
Computing device 102 may be executing one or more instances of improved digital court reporting software 103. Software 103 may provide ‘standard’ court reporting functionality, as well as the ability to provide one or more improved insights and annotations for audio/video recorded and/or written transcriptions, which will be discussed in greater detail below with regard to
Improved DCR systems may alter (and improve upon) the state of the art of transcript and the court reporting industry by enabling users of such documents the ability to easily look across various transcript artifacts and compare by sub-document level entities. For example, a user may wish to compare testimony across proceedings by the same speaker, by topic, by attendees, and/or by date/time. The Improved DCR systems may not only be able to identify these relationships, but can also offer probability-based insights and recommendations, for example, expert witnesses who have a similar set of expertise based on the language used during a given testimony may be suggested or recommended as potential experts for the given lawsuit or proceeding. Additional features extracted from the audio and/or video of recorded proceedings may be leveraged to assign certain non-verbal cues to portions of the testimony, as will be described in greater detail below. Other features of interest in improved DCR systems may include the following:
Timestamps: Having timestamps for each word lets an editor synchronize to particular audio portions of the transcript if there is a desire to re-listen to a portion. Timestamps also allow playback applications to play the audio in sync with the transcription text.
Word Alternatives: Word alternatives are second or third choice words for what the system may have initially determined was spoken during live testimony. An application could provide these word alternatives as drop downs or accept them through some other form of user feedback.
Confidence Scores: Scores that show how confident the system was in each transcribed word result and is another source of potential feedback to the user.
Smart Formatting: Formatting that may include recognizing and formatting dates, times, series of digits and numbers, phone numbers, currency values, and interne addresses, as well as capitalizing many proper nouns and recognizing sentence starts and ends.
Other Features: Other features that may be desirable in DCR systems may include, e.g.: video captioning, text search, keyword spotting, topic identification, vocabulary customization, and real-time transcribing. Within a single transcript, and/or over time with many transcripts, custom dictionaries and word ontologies (i.e., data structures describing the relationship between words) may be built to improve speech recognition accuracy and add context to the transcriptions.
According to some embodiments, computing device 102 may be in communication with processing resources hosted via a remote server, e.g., a cloud service provider 110, in real-time during testimony (or after live testimony has concluded), for the provision of various ASR transcription services. In recent years, the accuracy rates of real-time cloud transcription services have increased to the point where they are typically reliable enough to provide a first rough pass of testimony for a testimony transcript. Moreover, according to some embodiments, an improved DCR system may be designed with sufficient flexibility, such that it could interchangeably access and utilize the APIs of different cloud service providers. For example, in some instances, a highest quality transcription provider may be desired, whereas, in other instances, a lowest cost provider of cloud transcription services may be desired. In still other embodiments, the real-time ASR transcription services may be provided by computing device 102 itself, i.e., even if there is no access to remote ASR transcription resources.
As illustrated in
DCR system 100 may further comprise one or more multi-channel external sound interfaces 106. In some embodiments, the sound interface 106 may connect to computing device 102 via a USB interface or other desired interface for data exchange (e.g., any wired or wireless data communication interface). In still other embodiments, the sound interface 106 may be integrated into computing device 102. One role of sound interface 106 is to receive the recorded audio from microphones 104 and transmit the audio to computing device 102, so that the court reporting software 103 (or later, e.g., via production tools 208) may mix, edit, synchronize and/or encode the recorded audio for production to a client, e.g., in conjunction with a final testimony transcript file. The recorded audio may comprise 1, 2, 4, 8, or N channels of audio.
In other embodiments, the sound interface 106 may further comprise one or more indicator lights 108. For example, indicator lights 108 may comprise one or more uni-colored or multi-colored LED lights. In such embodiments, the indicator lights may be used to indicate various conditions during the testimony, e.g., a green light to indicate ‘on-record’ and a red light to indicate ‘off-record,’ an assignment of different light colors to signify different active speakers (e.g., as determined via which microphone 104 is currently capturing audio and/or by other more intelligent speaker detection techniques, which will be described in further detail below with reference to
It is to be understood that microphone devices 104 may also be part of (e.g., integrated into) a video recording device (not shown in
Ultimately, the court reporting software 103 may produce a testimony transcript file 112. As will be explained in greater detail below, e.g., with reference to
Referring now to
Court Reporter (202): The role of a Court Reporter (202) may be to use a digital court reporting system (e.g., computing system 102 executing software 103) to create an initial rough draft raw transcript of a testimony. The raw transcript may be created in ASCII format (e.g., free from the particular formatting requirements that may be present in a final draft transcript) and be utilized for editing purposes only. The transcript may be created via the reporter manually typing within software as the proceedings are ongoing, and/or be aided by real-time speech-to-text recognition software and/or services (e.g., cloud-based ASR and transcription services), as will be discussed in greater detail herein. As a representative of the Court, the Reporter 202 may also be responsible for performing one or more of the following tasks: making real-time edits to the transcript, marking known errors in the transcript; making manual speaker assignments, managing the swearing in of deponents, managing annotations of the proceedings going on-/off-record, inserting annotations to mark where particular sections of the testimony begin or end (e.g., recess, exhibits, etc.), keeping a list of fixes to make after the real-time testimony concludes, and making manual/automatic adjustments or edits to the transcript as needed. At the conclusion of creating the draft raw transcript of the testimony, Reporter 202 may then send ASCII and/or audio files (e.g., a .wav audio file, as will be described in greater detail below) to Scopist 204, as shown by arrow 1 in
Scopist (204): Upon receipt of the raw draft from Reporter 202, Scopist 204 may use one or more software tools to create a proofable draft transcript of the testimony. Scopist 204 may also be responsible for performing one or more of the following tasks: applying state-specific templates to the transcript (e.g., related to jurisdiction-specific requirements for indentation, margins, header/footers, fonts, line spacing, numbering, etc.); playing back the audio file to make changes as needed; converting any shortened annotations of text to the full version of the word(s); adding one or more of: a cover page, appearance page, exhibits page, certification page, line numbering; performing any global (e.g., find/replace) edits; and performing any spell checking and/or grammar checking. At the conclusion of performing these tasks, Scopist 204 may then send a PDF of the transcript to a Proofer (206) (optionally copying Reporter 202), as shown by arrows 3A and 3B in
Proofer (206): Proofer 206 may use one or more software tools to mark-up the transcript document. Proofer 206 may also be responsible for performing one or more of the following tasks: highlighting errors in the text; and adding comments and/or suggested edits where errors or potential errors are located. At the conclusion of performing these tasks, Proofer 206 may send an annotated PDF file of the transcript back to Scopist 204 for any further edits (as shown by arrow 4A in
Scopist (204): At this stage, Scopist 204 may make final updates to the transcript and add any necessary certifications/notarizations to the transcript file. At the conclusion of performing these tasks, Scopist 204 may send an ASCII file version of the final transcript to producer/production tools 208 (as shown by arrow 5A in
Producer/Production Tools (208): A producer may then use various tools to produce a final PDF version of the transcript, which may involve adding any final formatting, logos, outlines, etc. to the transcript. Producer 208 may also be responsible for performing one or more of the following tasks: generating an .SMI file for video synchronization; generating a PDF and/or CD/DVD for the client 210 (as shown by arrow 6 in
According to some embodiments described herein, the various improved insights and annotations for recorded audio/video and/or written transcriptions may be applied by the production tools 208, before the final package is sent to the client. According to other embodiments, however, it may also be possible to apply the various improved insights and annotations to the transcriptions in real-time or near real-time, e.g., saving the annotations along with the transcription file (e.g., in a JSON object). In still other embodiments, early versions of the ASR output may be provided a client in “rough” format, i.e., in real-time or near real-time, e.g., via a networked connection to the laptops or other computing devices of the various participants in the room where the testimony is taking place.
Referring now to
Thus, according to some embodiments, the method may begin at Step 302 by obtaining a video recording of a testimony. (It is to be understood that, according to the various embodiments disclosed throughout this document, the audio and/or video recordings of testimony may be received in a streaming ‘real-time’ fashion and/or as a complete file, e.g., after the testimony has already concluded.) Next, at Step 304, the process may obtain a textual transcript of the testimony (e.g., in accordance with the exemplary process flow outlined with reference to
To further enhance the insights gained at Step 306, according to some embodiments, longitudinal patterns in the deponent's testimony may also be detected (Step 316), e.g., by comparing the testimony in the obtained transcript to other testimony given by the same witness at different points in time and/or in different transcripts the system has access to. This longitudinal analysis may reveal certain emotional and/or non-verbal cues used by the testifying party in past testimony, as well as how such cues may have correlated to the of truthfulness, honestly, confidence, believability, etc. of the corresponding past testimony. Information obtained from the detection of longitudinal patterns in a deponent's testimony may then be used to further update/annotate any located potentially-relevant non-verbal cues in the presently analyzed testimony (and/or similar portions of other transcripts, e.g., other transcripts of the same deponent).
Next, at Step 318, any located potentially-relevant non-verbal cues may be linked to the corresponding portions of the presently analyzed testimony (and/or similar portions of other transcripts, e.g., other transcripts of the same deponent). For example, in the case of a textual testimony transcript (e.g., in ASCII or PDF formats), the relevant pages/lines where a deponent exhibited a particular non-verbal cue, e.g., an unusual response delay, may each be associated with the corresponding non-verbal cue from the testimony. In the case of a video testimony transcripts, the begin/end timestamps of the particular non-verbal cue, for example, may be associated with each corresponding non-verbal cue. Finally, the process may update and/or tag the relevant portions of the relevant transcripts with indications of the corresponding identified non-verbal cues (Step 320). The tagging process may comprise adding the tag into a metadata portion of the transcript document or recording, and/or making a visual annotation directly into the transcript document (e.g., via bolding, underlining, highlighting, annotation, overlaying, or other emphasis techniques). These attributes may be individualized by participant based on their own personal ranges of response within a single attribute (i.e. percentile based vs. absolute values).
Referring now to
In the example 330 shown, corresponding testimony wherein the deponent has exhibited a high response delay has been highlighted using the square-hatched pattern 332. (It is to be understood that, in a given implementation, any desired highlighting or indication could be used to denote the relevance of a particular portion of the testimony transcript, e.g., colored-highlighting, colored text, font variance, graphical badging, etc.) In this case, the deponent has exhibited an unusually high amount of delay before responding “You're Right” at line 89. This may be an indication to a reviewing attorney that, e.g., the deponent was less believable when giving this particular response than he or she was when giving a normal response during this testimony (or past testimony). Likewise, corresponding testimony wherein the deponent has exhibited a low confidence level (e.g., as determined by one or more emotional detection classifiers, neural networks, etc., that have been used to analyze the obtained video recording of the testimony) has been highlighted using the diagonally-hatched pattern 334. In this case, the deponent has exhibited low confidence in his or her response of “I understand” at line 95. This may be an indication to a reviewing attorney that, e.g., the deponent was less sure or less believable in giving this particular response than he or she was about a normal response during this testimony (or past testimony).
As may be understood, such indications of potentially-relevant non-verbal cues may be initially determined and tagged through a computerized and/or automated process, and then may be modified, increased in duration/line rage, deleted, etc., manually by one or more of the participants reviewing the testimony transcript (e.g., from among the various participants whose roles were described above with reference to
Referring now to
As discussed above, according to some embodiments, additional insights and annotations of non-verbal cues recognized in the video file may also be annotated in a visual fashion so that a viewer of the video file may be aware when such non-verbal cues are present in the video. One example of a graphical element for displaying such non-verbal insights during the playback of video testimony is shown in frame overlay 380. Exemplary frame overlay 380 shows the relative sensed amounts of various non-verbal cues in the given video frame currently displayed. For example, bar 382 may represent a present amount of (A)nger observed in the deponent, bar 384 may represent a present amount of (T)ruthfulness/believability observed in the deponent, bar 386 may represent a present amount of (C)onfidence observed in the deponent, bar 388 may represent a present amount of (M)ovement/fidgeting observed in the deponent, and bar 390 may represent a present amount of (D)elay between the ending of the last posed question and the beginning of the deponent's response.
By watching the fluctuations and relative changes in the amounts of the various recognized non-verbal cues, an attorney reviewing the video testimony file may be able to more quickly identify or pay closer attention to portions of the testimony that may be of particular relevance. The information presented in video frame overlay 380 may be encoded on top of each individual frame of video in the final produced video (e.g., by producer 208 in the example of
Referring now to
Referring now to
Next, the method may receive updates to the transcript, e.g., in the form of insertions, deletions, and/or modifications received from one or more parties with access to and authority to make modifications to the transcript (Step 508). In response to the received updates, the method may then update the associated page/line ranges for the various identified semantic constructs within the transcript file. As may be understood, a given modification to a transcript document may affect only a single semantic construct's page/line range, or the given modification may affect up to all of the identified semantic constructs' page/line ranges for a given transcript file. As may now be understood, by tracking the changing location of various semantic constructs within the transcript throughout the various phrases of the transcript editing process, important annotations, highlights, query results, etc., may be updated in the appropriate fashion, i.e., so that they are still tied to the correct portion of the transcript document as subsequent changes are made to the transcript file. In some instances, modifications to the transcript file may result in the deletion of a semantic construct or the splitting of a semantic construct into two or more new semantic constructs. The creation of new semantic construct Token_IDs may necessitate an updating operation to historically stored Token_ID references that may no longer exist in the transcript. The design enables flexibility, e.g., allowing a human operator to override any auto-assigned speaker, text or attribute in the original transcript. Finally, to the extent any of the updated and/or tagged semantic constructs in the transcript document are also reflected, in whole or in part, in a master index file, the corresponding portions of the master index file may also be the updated accordingly to reflect the updated page/line range associations with the various affected Token_IDs (Step 512).
Referring now to
Human speech that is to be transcribed into textual form may typically fall into one of three main scenarios: 1.) Human-to-Machine Applied Speech: In these utterance scenarios (e.g., a human user interacting verbally with a phone, car, or other device with a limited set of key intents that need to be recognized), the human user generally knows they are interacting with a machine and trying to get it to do something specific. The user may even limit their speech patterns to include simple phrases or commands that they know the machine is capable of responding to. 2.) Human-to-Machine General Speech: In these utterance or dictation scenarios, a human user typically knows that they are talking to a machine, may even be shown the results of the machines' speech-to-text recognition on a display in real-time, and may carefully plan their commands, use more formalized language, clearer diction, and employ full sentences that last less than 10 seconds long. An example of this type of scenario is dictating a text message to a mobile phone. 3.) Human-to-Human General Speech: In typical conversation scenarios, users are engaged in a human-to-human conversation, wherein multiple users may be speaking, may be overlapping each other in speech, may be speaking at different rates, and/or with different diction clarity levels, etc. This is the hardest scenario in which to perform speech-to-text recognition, and it also approximates the situation in a legal deposition or other testimony environment the most closely.
As such, when performing or attempting to perform speech-to-text recognition in human-to-human scenarios, it is preferable to not use compressed audio formats (e.g., mp3, mpeg, aac, etc.). Instead, it is preferable to use lossless encoding, e.g., in the .wav format (e.g., Linear16 and/or Linear PCM) or in the lossless .flac file format. It may further be preferable to capture the audio at a sampling rate of at least 44 kHz and not re-encoding the audio. For example, taking an mp3 recording and re-encoding it into wav/flac may present worse results that simply using the mp3 source recording. It is also preferable to ensure that every processing step performed on the audio file is also lossless, to the extent possible. For example, sometimes, merely trimming the start and/or end times of an audio file may cause a re-encoding of the audio and thus decrease the quality—without the user realizing it. Finally, it may be preferable to record single channel audio (as opposed to stereo audio), at least because most transcription APIs require stereo audio channels to be combined into a single channel before transcription, which, as mentioned above, may lead to further degradations in quality of the audio if the channels have to be combined in post-processing to be submitted to the transcription API.
Returning now to
Next, e.g., at a later time, incoming untagged audio recordings of one or more speakers may be obtained by the digital court reporting system (Step 610). The obtained untagged audio recordings may then be compared with audio stored in the aforementioned audio repository (Step 612). If no match is found in the audio repository, e.g., a match between a voice present in the incoming untagged audio recording and a voice in one or more of the tagged audio recordings in the audio repository (i.e., “NO” at Step 614), then the method 600 may end, as it is unlikely that the audio repository would be able to contribute additional useful information regarding the current incoming untagged audio recording. If, instead, a match is found between a voice present in the incoming untagged audio recording and a voice in one or more of the tagged audio recordings in the audio repository (i.e., “YES” at Step 614), then the method 600 may proceed to Step 616 and tag the obtained incoming untagged audio with one or more speaker cues gleaned from the matching audio recordings in the audio repository. For example, in some embodiments, the tagged speaker cues may comprise: a speaker probability value for a given recognized voice in the incoming audio recording; a Speaker_ID value for given recognized voice in the incoming audio recording; a speaker volume indication (e.g., a relative volume level of a deponent in a current recording as compared to previous recordings of the deponent); an indication that multiple speakers are likely overlapping each other during a given time interval, etc. Further, based on the matching content identified in the repository, the user may be given the option to download or playback the matching content, or be prompted with options to download or playback other relevant material, for example other experts/expert testimony on the same topics and/or previous testimony by the same (or related) deponents, etc.
Exemplary Processing DeviceReferring now to
System unit 705 may be programmed to perform methods in accordance with this disclosure. System unit 705 comprises one or more processing units, input-output (I/O) bus 725 and memory 715. Access to memory 715 can be accomplished using the communication bus 725. Processing unit 710 may include any programmable controller device including, for example, a mainframe processor, a mobile phone processor, or, as examples, one or more members of the INTEL® ATOM™, INTEL® XEON™, and INTEL® CORE™ processor families from Intel Corporation and the Cortex and ARM processor families from ARM. (INTEL, INTEL ATOM, XEON, and CORE are trademarks of the Intel Corporation. CORTEX is a registered trademark of the ARM Limited Corporation. ARM is a registered trademark of the ARM Limited Company). Memory 715 may include one or more memory modules and comprise random access memory (RAM), read only memory (ROM), programmable read only memory (PROM), programmable read-write memory, and solid-state memory. As also shown in
Referring now to
The processing unit core 710 is shown including execution logic 780 having a set of execution units 785-1 through 785-N. Some embodiments may include a number of execution units dedicated to specific functions or sets of functions. Other embodiments may include only one execution unit or one execution unit that can perform a particular function. The execution logic 780 performs the operations specified by code instructions.
After completion of execution of the operations specified by the code instructions, back end logic 790 retires the instructions of the code 750. In one embodiment, the processing unit core 710 allows out of order execution but requires in order retirement of instructions. Retirement logic 795 may take a variety of forms as known to those of skill in the art (e.g., re-order buffers or the like). In this manner, the processing unit core 710 is transformed during execution of the code 750, at least in terms of the output generated by the decoder, the hardware registers and tables utilized by the register renaming logic 762, and any registers (not shown) modified by the execution logic 780.
Although not illustrated in
In the foregoing description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the disclosed embodiments. It will be apparent, however, to one skilled in the art that the disclosed embodiments may be practiced without these specific details. In other instances, structure and devices are shown in block diagram form in order to avoid obscuring the disclosed embodiments. References to numbers without subscripts or suffixes are understood to reference all instance of subscripts and suffixes corresponding to the referenced number. Moreover, the language used in this disclosure has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter, resort to the claims being necessary to determine such inventive subject matter. Reference in the specification to “one embodiment” or to “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiments is included in at least one disclosed embodiment, and multiple references to “one embodiment” or “an embodiment” should not be understood as necessarily all referring to the same embodiment.
It is also to be understood that the above description is intended to be illustrative, and not restrictive. For example, above-described embodiments may be used in combination with each other and illustrative process steps may be performed in an order different than shown. Many other embodiments will be apparent to those of skill in the art upon reviewing the above description. The scope of the invention therefore should be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled. In the appended claims, terms “including” and “in which” are used as plain-English equivalents of the respective terms “comprising” and “wherein.”
Claims
1. A method, comprising:
- obtaining a first audio recording of a first testimony, wherein the first audio recording comprises one or more first speaking parties;
- tagging the obtained first audio recording with one or more unique speaker identifiers, wherein each of the one or more unique speaker identifier corresponds to one of the one or more first speaking parties;
- determining at least one characteristic of the tagged obtained first audio recording;
- storing the tagged obtained first audio recording and corresponding at least one determined characteristic in a repository;
- obtaining a second audio recording of a second testimony, wherein the second audio recording comprises one or more second speaking parties;
- comparing the obtained second audio recording to one or more audio recordings stored in the repository; and
- in response to finding at least one matching audio recording in the repository, updating the obtained second audio recording with one or more speaker cues, wherein the one or more speaker cues are based, at least in part, on the at least one matching audio recording in the repository.
2. The method of claim 1, wherein the audio recording further comprises an audiovisual recording.
3. The method of claim 1, wherein one of the at least one matching audio recordings comprises the first audio recording.
4. The method of claim 1, wherein one of the one or more speaker cues comprises at least one of the following: a speaker probability value for a voice in the at least one matching audio recording; a unique speaker identifier for a voice in the at least one matching audio recording; a speaker volume indication for a voice in the at least one matching audio recording; and an indication that multiple speakers are likely overlapping each other during a first time interval in the at least one matching audio recording.
5. The method of claim 1, wherein one of the at least one characteristics of the tagged obtained first audio recording comprises at least one of the following: a meaning of the obtained first audio recording, an intent of the obtained first audio recording, a source of the obtained first audio recording, and a content of the obtained first audio recording.
6. The method of claim 1, wherein there is at least one speaking party in common between the one or more first speaking parties and the one or more second speaking parties.
7. The method of claim 1, wherein tagging the obtained first audio recording with one or more unique speaker identifiers comprises tagging the obtained first audio recording based on at least one of the following: input from a human operator; automated voice recognition; automated face recognition; or line level analysis of one or more audio channels of the first audio recording.
8. A non-transitory program storage device comprising instructions stored thereon to cause one or more processors to:
- obtain a first audio recording of a first testimony, wherein the first audio recording comprises one or more first speaking parties;
- tag the obtained first audio recording with one or more unique speaker identifiers, wherein each of the one or more unique speaker identifier corresponds to one of the one or more first speaking parties;
- determine at least one characteristic of the tagged obtained first audio recording;
- store the tagged obtained first audio recording and corresponding at least one determined characteristic in a repository;
- obtain a second audio recording of a second testimony, wherein the second audio recording comprises one or more second speaking parties;
- compare the obtained second audio recording to one or more audio recordings stored in the repository; and
- in response to finding at least one matching audio recording in the repository, update the obtained second audio recording with one or more speaker cues, wherein the one or more speaker cues are based, at least in part, on the at least one matching audio recording in the repository.
9. The non-transitory program storage device of claim 8, wherein the audio recording further comprises an audiovisual recording.
10. The non-transitory program storage device of claim 8, wherein one of the at least one matching audio recordings comprises the first audio recording.
11. The non-transitory program storage device of claim 8, wherein one of the one or more speaker cues comprises at least one of the following: a speaker probability value for a voice in the at least one matching audio recording; a unique speaker identifier for a voice in the at least one matching audio recording; a speaker volume indication for a voice in the at least one matching audio recording; and an indication that multiple speakers are likely overlapping each other during a first time interval in the at least one matching audio recording.
12. The non-transitory program storage device of claim 8, wherein one of the at least one characteristics of the tagged obtained first audio recording comprises at least one of the following: a meaning of the obtained first audio recording, an intent of the obtained first audio recording, a source of the obtained first audio recording, and a content of the obtained first audio recording.
13. The non-transitory program storage device of claim 8, wherein there is at least one speaking party in common between the one or more first speaking parties and the one or more second speaking parties.
14. The non-transitory program storage device of claim 8, wherein the instructions to tag the obtained first audio recording with one or more unique speaker identifiers comprise instructions to tag the obtained first audio recording based on at least one of the following: input from a human operator; automated voice recognition; automated face recognition; or line level analysis of one or more audio channels of the first audio recording.
15. A device, comprising:
- a memory;
- a display;
- a user interface; and
- one or more processors operatively coupled to the memory, wherein the one or more processors are configured to execute instructions causing the one or more processors to: obtain a first audio recording of a first testimony, wherein the first audio recording comprises one or more first speaking parties; tag the obtained first audio recording with one or more unique speaker identifiers, wherein each of the one or more unique speaker identifier corresponds to one of the one or more first speaking parties; determine at least one characteristic of the tagged obtained first audio recording; store the tagged obtained first audio recording and corresponding at least one determined characteristic in a repository; obtain a second audio recording of a second testimony, wherein the second audio recording comprises one or more second speaking parties; compare the obtained second audio recording to one or more audio recordings stored in the repository; and in response to finding at least one matching audio recording in the repository, update the obtained second audio recording with one or more speaker cues, wherein the one or more speaker cues are based, at least in part, on the at least one matching audio recording in the repository.
16. The device of claim 15, wherein one of the at least one matching audio recordings comprises the first audio recording.
17. The device of claim 15, wherein one of the one or more speaker cues comprises at least one of the following: a speaker probability value for a voice in the at least one matching audio recording; a unique speaker identifier for a voice in the at least one matching audio recording; a speaker volume indication for a voice in the at least one matching audio recording; and an indication that multiple speakers are likely overlapping each other during a first time interval in the at least one matching audio recording.
18. The device of claim 15, wherein one of the at least one characteristics of the tagged obtained first audio recording comprises at least one of the following: a meaning of the obtained first audio recording, an intent of the obtained first audio recording, a source of the obtained first audio recording, and a content of the obtained first audio recording.
19. The device of claim 15, wherein there is at least one speaking party in common between the one or more first speaking parties and the one or more second speaking parties.
20. The device of claim 15, wherein the instructions to tag the obtained first audio recording with one or more unique speaker identifiers comprise instructions to tag the obtained first audio recording based on at least one of the following: input from a human operator; automated voice recognition; automated face recognition; or line level analysis of one or more audio channels of the first audio recording.
Type: Application
Filed: Sep 13, 2019
Publication Date: Mar 19, 2020
Inventors: Robert Ackerman (Boca Raton, FL), Anthony J. Vaglica (Silver Spring, MD), Holli Goldman (Richboro, PA), Amber Hickman (Swedesboro, NJ), Walter Barrett (Gibbsboro, NJ), Cameron Turner (Palo Alto, CA), Shawn Rutledge (Seattle, WA)
Application Number: 16/570,699