Patents by Inventor Naoyuki Kanda
Naoyuki Kanda has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).
-
Patent number: 11935542Abstract: A hypothesis stitcher for speech recognition of long-form audio provides superior performance, such as higher accuracy and reduced computational cost. An example disclosed operation includes: segmenting the audio stream into a plurality of audio segments; identifying a plurality of speakers within each of the plurality of audio segments; performing automatic speech recognition (ASR) on each of the plurality of audio segments to generate a plurality of short-segment hypotheses; merging at least a portion of the short-segment hypotheses into a first merged hypothesis set; inserting stitching symbols into the first merged hypothesis set, the stitching symbols including a window change (WC) symbol; and consolidating, with a network-based hypothesis stitcher, the first merged hypothesis set into a first consolidated hypothesis.Type: GrantFiled: January 19, 2023Date of Patent: March 19, 2024Assignee: Microsoft Technology Licensing, LLC.Inventors: Naoyuki Kanda, Xuankai Chang, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Takuya Yoshioka
-
Publication number: 20230215439Abstract: The disclosure herein describes using a transcript generation model for generating a transcript from a multi-speaker audio stream. Audio data including overlapping speech of a plurality of speakers is obtained and a set of frame embeddings are generated from audio data frames of the obtained audio data using an audio data encoder. A set of words and channel change (CC) symbols are generated from the set of frame embeddings using a transcript generation model. The CC symbols are included between pairs of adjacent words that are spoken by different people at the same time. The set of words and CC symbols are transformed into a plurality of transcript lines, wherein words of the set of words are sorted into transcript lines based on the CC symbols, and a multi-speaker transcript is generated based on the plurality of transcript lines. The inclusion of CC symbols by the model enables efficient, accurate multi-speaker transcription.Type: ApplicationFiled: December 31, 2021Publication date: July 6, 2023Inventors: Naoyuki KANDA, Takuya YOSHIOKA, Zhuo CHEN, Jinyu LI, Yashesh GAUR, Zhong MENG, Xiaofei WANG, Xiong XIAO
-
Publication number: 20230154468Abstract: A hypothesis stitcher for speech recognition of long-form audio provides superior performance, such as higher accuracy and reduced computational cost. An example disclosed operation includes: segmenting the audio stream into a plurality of audio segments; identifying a plurality of speakers within each of the plurality of audio segments; performing automatic speech recognition (ASR) on each of the plurality of audio segments to generate a plurality of short-segment hypotheses; merging at least a portion of the short-segment hypotheses into a first merged hypothesis set; inserting stitching symbols into the first merged hypothesis set, the stitching symbols including a window change (WC) symbol; and consolidating, with a network-based hypothesis stitcher, the first merged hypothesis set into a first consolidated hypothesis.Type: ApplicationFiled: January 19, 2023Publication date: May 18, 2023Inventors: Naoyuki KANDA, Xuankai CHANG, Yashesh GAUR, Xiaofei WANG, Zhong MENG, Takuya YOSHIOKA
-
Patent number: 11574639Abstract: A hypothesis stitcher for speech recognition of long-form audio provides superior performance, such as higher accuracy and reduced computational cost. An example disclosed operation includes: segmenting the audio stream into a plurality of audio segments; identifying a plurality of speakers within each of the plurality of audio segments; performing automatic speech recognition (ASR) on each of the plurality of audio segments to generate a plurality of short-segment hypotheses; merging at least a portion of the short-segment hypotheses into a first merged hypothesis set; inserting stitching symbols into the first merged hypothesis set, the stitching symbols including a window change (WC) symbol; and consolidating, with a network-based hypothesis stitcher, the first merged hypothesis set into a first consolidated hypothesis.Type: GrantFiled: December 18, 2020Date of Patent: February 7, 2023Assignee: Microsoft Technology Licensing, LLCInventors: Naoyuki Kanda, Xuankai Chang, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Takuya Yoshioka
-
Publication number: 20220397963Abstract: A head-mounted display (HMD) 1, which is operated by a gesture operation performed by a user 3, is provided with a distance image acquisition unit 106 that detects a gesture operation, a position information acquisition unit 103 that acquires position information of the HMD 1, and a communication unit 2 that performs communication with another HMD 1?. A control unit 205 sets and displays an operating space 600 where a gesture operation performed by the user 3 is valid, exchanges position information and operating space information of the host HMD 1 and the other HMD 1 therebetween by the communication unit 2, and adjusts the operating space of the host HMD so that the operating space 600 and an operating space 600? of the other HMD 1 do not overlap each other.Type: ApplicationFiled: August 22, 2022Publication date: December 15, 2022Inventors: Yo NONOMURA, Naoyuki KANDA
-
Patent number: 11527238Abstract: A computer device is provided that includes one or more processors configured to receive an end-to-end (E2E) model that has been trained for automatic speech recognition with training data from a source-domain, and receive an external language model that has been trained with training data from a target-domain. The one or more processors are configured to perform an inference of the probability of an output token sequence given a sequence of input speech features. Performing the inference includes computing an E2E model score, computing an external language model score, and computing an estimated internal language model score for the E2E model. The estimated internal language model score is computed by removing a contribution of an intrinsic acoustic model. The processor is further configured to compute an integrated score based at least on E2E model score, the external language model score, and the estimated internal language model score.Type: GrantFiled: January 21, 2021Date of Patent: December 13, 2022Assignee: Microsoft Technology Licensing, LLCInventors: Zhong Meng, Sarangarajan Parthasarathy, Xie Sun, Yashesh Gaur, Naoyuki Kanda, Liang Lu, Xie Chen, Rui Zhao, Jinyu Li, Yifan Gong
-
Patent number: 11455042Abstract: A head-mounted display (HMD) 1, which is operated by a gesture operation performed by a user 3, is provided with a distance image acquisition unit 106 that detects a gesture operation, a position information acquisition unit 103 that acquires position information of the HMD 1, and a communication unit 2 that performs communication with another HMD 1?. A control unit 205 sets and displays an operating space 600 where a gesture operation performed by the user 3 is valid, exchanges position information and operating space information of the host HMD 1 and the other HMD 1 therebetween by the communication unit 2, and adjusts the operating space of the host HMD so that the operating space 600 and an operating space 600? of the other HMD 1 do not overlap each other.Type: GrantFiled: August 24, 2017Date of Patent: September 27, 2022Assignee: MAXELL, LTD.Inventors: Yo Nonomura, Naoyuki Kanda
-
Publication number: 20220254352Abstract: An audio analysis platform may receive a portion of an audio input, wherein the audio input corresponds to audio associated with a plurality of speakers. The audio analysis platform may process, using a neural network, the portion of the audio input to determine voice activity of the plurality of speakers during the portion of the audio input, wherein the neural network is trained using reference audio data and reference diarization data corresponding to the reference audio data. The audio analysis platform may determine, based on the neural network being used to process the portion of the audio input, a diarization output associated with the portion of the audio input, wherein the diarization output indicates individual voice activity of the plurality of speakers. The audio analysis platform may provide the diarization output to indicate the individual voice activity of the plurality of speakers during the portion of the audio input.Type: ApplicationFiled: August 31, 2020Publication date: August 11, 2022Applicants: The Johns Hopkins University, Hitachi, Ltd.Inventors: Yusuke FUJITA, Shinji WATANABE, Naoyuki KANDA, Shota HORIGUCHI
-
Publication number: 20220199091Abstract: A hypothesis stitcher for speech recognition of long-form audio provides superior performance, such as higher accuracy and reduced computational cost. An example disclosed operation includes: segmenting the audio stream into a plurality of audio segments; identifying a plurality of speakers within each of the plurality of audio segments; performing automatic speech recognition (ASR) on each of the plurality of audio segments to generate a plurality of short-segment hypotheses; merging at least a portion of the short-segment hypotheses into a first merged hypothesis set; inserting stitching symbols into the first merged hypothesis set, the stitching symbols including a window change (WC) symbol; and consolidating, with a network-based hypothesis stitcher, the first merged hypothesis set into a first consolidated hypothesis.Type: ApplicationFiled: December 18, 2020Publication date: June 23, 2022Inventors: Naoyuki KANDA, Xuankai CHANG, Yashesh GAUR, Xiaofei WANG, Zhong MENG, Takuya YOSHIOKA
-
Publication number: 20220139380Abstract: A computer device is provided that includes one or more processors configured to receive an end-to-end (E2E) model that has been trained for automatic speech recognition with training data from a source-domain, and receive an external language model that has been trained with training data from a target-domain. The one or more processors are configured to perform an inference of the probability of an output token sequence given a sequence of input speech features. Performing the inference includes computing an E2E model score, computing an external language model score, and computing an estimated internal language model score for the E2E model. The estimated internal language model score is computed by removing a contribution of an intrinsic acoustic model. The processor is further configured to compute an integrated score based at least on E2E model score, the external language model score, and the estimated internal language model score.Type: ApplicationFiled: January 21, 2021Publication date: May 5, 2022Applicant: Microsoft Technology Licensing, LLCInventors: Zhong MENG, Sarangarajan PARTHASARATHY, Xie SUN, Yashesh GAUR, Naoyuki KANDA, Liang LU, Xie CHEN, Rui ZHAO, Jinyu LI, Yifan GONG
-
Publication number: 20210357792Abstract: In order to assist participants in thinking of an idea by acquiring audio data, it is provided a workshop assistance system, which includes a computer having an arithmetic apparatus configured to execute predetermined processing, a storage device coupled to the arithmetic apparatus, and a communication interface coupled to the arithmetic apparatus, the computer being configured to access solved problem case data including information of solved cases that correspond to problem data, the workshop assistance system comprising: a problem processing module configured to search, by the arithmetic apparatus, solved cases based on problem data that is generated from a discussion among participants; and an idea generation module configured to present, by the arithmetic apparatus, idea data including the generated problem data and information of the solved case found in the search to the participants.Type: ApplicationFiled: January 31, 2019Publication date: November 18, 2021Applicant: Hitachi, Ltd.Inventors: Shuhei FURUYA, Yo TAKEUCHI, Kiyoshi KUMAGAI, Toshiyuki ONO, Masao ISHIGURO, Tatsuya TOKUNAGA, Chisa NAGAI, Takashi SUMIYOSHI, Naoyuki KANDA, Kenji NAGAMATSU, Kenji OHYA
-
Publication number: 20210311558Abstract: In a mixed reality display system, a server and a plurality of mixed reality display terminals are connected, and virtual objects are displayed. The virtual objects include a shared virtual object for which a plurality of terminals have an operation authority and a private virtual object for which only a specific terminal has an operation authority. The server has virtual object attribute information for displaying the virtual objects in each terminal, and each terminal has a motion detecting unit that detects a motion of a user for switching between the shared virtual object and the private virtual object. When a detection result by the motion detecting unit is received from the terminal, the server updates the virtual object attribute information depending on whether the virtual object is the shared virtual object or the private virtual object, and transmits data of the virtual object after the update to each terminal.Type: ApplicationFiled: June 16, 2021Publication date: October 7, 2021Inventor: Naoyuki KANDA
-
Patent number: 11107476Abstract: A speaker estimation method that estimate the speaker from audio and image includes: inputting audio; extracting a feature quantity representing a voice characteristic from the input audio; inputting an image; detecting person regions of respective persons from the input image; estimating feature quantities representing voice characteristics from the respective detected person regions; Performing a change such that an image taken from another position and with another angle is input when any person is not detected; calculating a similarity between the feature quantity representing the voice characteristic extracted from the audio and the feature quantity representing the voice characteristic estimated from the person region in the image; and estimating a speaker from the calculated similarity.Type: GrantFiled: February 26, 2019Date of Patent: August 31, 2021Assignee: HITACHI, LTD.Inventors: Shota Horiguchi, Naoyuki Kanda
-
Patent number: 11068072Abstract: In a mixed reality display system, a server and a plurality of mixed reality display terminals are connected, and virtual objects are displayed. The virtual objects include a shared virtual object for which a plurality of terminals have an operation authority and a private virtual object for which only a specific terminal has an operation authority. The server has virtual object attribute information for displaying the virtual objects in each terminal, and each terminal has a motion detecting unit that detects a motion of a user for switching between the shared virtual object and the private virtual object. When a detection result by the motion detecting unit is received from the terminal, the server updates the virtual object attribute information depending on whether the virtual object is the shared virtual object or the private virtual object, and transmits data of the virtual object after the update to each terminal.Type: GrantFiled: August 11, 2020Date of Patent: July 20, 2021Assignee: MAXELL, LTD.Inventor: Naoyuki Kanda
-
Patent number: 10909976Abstract: A speech recognition device includes: an acoustic model based on an End-to-End neural network responsive to an observed sequence formed of prescribed acoustic features obtained from a speech signal by feature extracting unit, for calculating probability of the observed sequence being a certain symbol sequence; and a decoder responsive to a symbol sequence candidate, for decoding a speech signal by a WFST based on a posterior probability of each of word sequences corresponding to the symbol sequence candidate, probabilities calculated by the acoustic model for symbol sequences selected based on an observed sequence, and a posterior probability of each of the plurality of symbol sequences.Type: GrantFiled: June 2, 2017Date of Patent: February 2, 2021Assignee: National Institute of Information and Communications TechnologyInventor: Naoyuki Kanda
-
Publication number: 20200371602Abstract: In a mixed reality display system, a server and a plurality of mixed reality display terminals are connected, and virtual objects are displayed. The virtual objects include a shared virtual object for which a plurality of terminals have an operation authority and a private virtual object for which only a specific terminal has an operation authority. The server has virtual object attribute information for displaying the virtual objects in each terminal, and each terminal has a motion detecting unit that detects a motion of a user for switching between the shared virtual object and the private virtual object. When a detection result by the motion detecting unit is received from the terminal, the server updates the virtual object attribute information depending on whether the virtual object is the shared virtual object or the private virtual object, and transmits data of the virtual object after the update to each terminal.Type: ApplicationFiled: August 11, 2020Publication date: November 26, 2020Inventor: Naoyuki KANDA
-
Patent number: 10775897Abstract: In a mixed reality display system, a server and a plurality of mixed reality display terminals are connected, and virtual objects are displayed. The virtual objects include a shared virtual object for which a plurality of terminals have an operation authority and a private virtual object for which only a specific terminal has an operation authority. The server has virtual object attribute information for displaying the virtual objects in each terminal, and each terminal has a motion detecting unit that detects a motion of a user for switching between the shared virtual object and the private virtual object. When a detection result by the motion detecting unit is received from the terminal, the server updates the virtual object attribute information depending on whether the virtual object is the shared virtual object or the private virtual object, and transmits data of the virtual object after the update to each terminal.Type: GrantFiled: June 6, 2017Date of Patent: September 15, 2020Assignee: Maxell, Ltd.Inventor: Naoyuki Kanda
-
Publication number: 20200241647Abstract: In a mixed reality display system, a server and a plurality of mixed reality display terminals are connected, and virtual objects are displayed. The virtual objects include a shared virtual object for which a plurality of terminals have an operation authority and a private virtual object for which only a specific terminal has an operation authority. The server has virtual object attribute information for displaying the virtual objects in each terminal, and each terminal has a motion detecting unit that detects a motion of a user for switching between the shared virtual object and the private virtual object. When a detection result by the motion detecting unit is received from the terminal, the server updates the virtual object attribute information depending on whether the virtual object is the shared virtual object or the private virtual object, and transmits data of the virtual object after the update to each terminal.Type: ApplicationFiled: June 6, 2017Publication date: July 30, 2020Inventor: Naoyuki KANDA
-
Publication number: 20200167003Abstract: A head-mounted display (HMD) 1, which is operated by a gesture operation performed by a user 3, is provided with a distance image acquisition unit 106 that detects a gesture operation, a position information acquisition unit 103 that acquires position information of the HMD 1, and a communication unit 2 that performs communication with another HMD 1?. A control unit 205 sets and displays an operating space 600 where a gesture operation performed by the user 3 is valid, exchanges position information and operating space information of the host HMD 1 and the other HMD 1 therebetween by the communication unit 2, and adjusts the operating space of the host HMD so that the operating space 600 and an operating space 600? of the other HMD 1 do not overlap each other.Type: ApplicationFiled: August 24, 2017Publication date: May 28, 2020Inventors: Yo NONOMURA, Naoyuki KANDA
-
Patent number: 10607602Abstract: An object is to provide a speech recognition device with improved recognition accuracy using characteristics of a neural network. A speech recognition device includes: an acoustic model 308 implemented by a RNN (recurrent neural network) for calculating, for each state sequence, the 45 posterior probability of a state sequence in response to an observed sequence consisting of prescribed speech features obtained from a speech; a WFST 320 based on S-1HCLG calculating, for each word sequence, posterior probability of a word sequence in response to a state sequence; and a hypothesis selecting unit 322, performing speech recognition of the speech signal based on a score calculated for each hypothesis of a 50 word sequence corresponding to the speech signal, using the posterior probabilities calculated by the acoustic model 308 and the WFST 320 for the input observed sequence.Type: GrantFiled: May 10, 2016Date of Patent: March 31, 2020Assignee: NATIONAL INSTITUTE OF INFORMATION AND COMMUNICATIONS TECHNOLOGYInventor: Naoyuki Kanda