Patents by Inventor Naoyuki Kanda

Naoyuki Kanda has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

Head-mounted display

Patent number: 12321528

Abstract: A head-mounted display (HMD) 1, which is operated by a gesture operation performed by a user 3, is provided with a distance image acquisition unit 106 that detects a gesture operation, a position information acquisition unit 103 that acquires position information of the HMD 1, and a communication unit 2 that performs communication with another HMD 1?. A control unit 205 sets and displays an operating space 600 where a gesture operation performed by the user 3 is valid, exchanges position information and operating space information of the host HMD 1 and the other HMD 1 therebetween by the communication unit 2, and adjusts the operating space of the host HMD so that the operating space 600 and an operating space 600? of the other HMD 1 do not overlap each other.

Type: Grant

Filed: February 15, 2024

Date of Patent: June 3, 2025

Assignee: MAXELL, LTD.

Inventors: Yo Nonomura, Naoyuki Kanda
FACTORIZED NEURAL TRANSDUCER FOR MULTI-SPEAKER SPEECH RECOGNITION

Publication number: 20240412736

Abstract: Systems and methods are provided for instantiating, modifying, adapting, and using a factorized neural transducer for multi-speaker automatic speech recognition. The factorized neural transducer includes a vocabulary predictor with multiple hidden states to process speech from different speakers, a non-vocabulary predictor that facilitates the prediction of channel change tokens indicating a speaker change in input speech data, an encoder used to encode acoustic features of the input speech data, and a joint network.

Type: Application

Filed: June 8, 2023

Publication date: December 12, 2024

Inventors: Jian WU, Jinyu LI, Zhuo CHEN, Naoyuki KANDA, Takuya YOSHIOKA
Multi-speaker diarization of audio input using a neural network

Patent number: 12165654

Abstract: An audio analysis platform may receive a portion of an audio input, wherein the audio input corresponds to audio associated with a plurality of speakers. The audio analysis platform may process, using a neural network, the portion of the audio input to determine voice activity of the plurality of speakers during the portion of the audio input, wherein the neural network is trained using reference audio data and reference diarization data corresponding to the reference audio data. The audio analysis platform may determine, based on the neural network being used to process the portion of the audio input, a diarization output associated with the portion of the audio input, wherein the diarization output indicates individual voice activity of the plurality of speakers. The audio analysis platform may provide the diarization output to indicate the individual voice activity of the plurality of speakers during the portion of the audio input.

Type: Grant

Filed: August 31, 2020

Date of Patent: December 10, 2024

Assignees: The Johns Hopkins University, Hitachi, Ltd.

Inventors: Yusuke Fujita, Shinji Watanabe, Naoyuki Kanda, Shota Horiguchi
Workshop assistance system and workshop assistance method

Patent number: 12086731

Abstract: In order to assist participants in thinking of an idea by acquiring audio data, it is provided a workshop assistance system, which includes a computer having an arithmetic apparatus configured to execute predetermined processing, a storage device coupled to the arithmetic apparatus, and a communication interface coupled to the arithmetic apparatus, the computer being configured to access solved problem case data including information of solved cases that correspond to problem data, the workshop assistance system comprising: a problem processing module configured to search, by the arithmetic apparatus, solved cases based on problem data that is generated from a discussion among participants; and an idea generation module configured to present, by the arithmetic apparatus, idea data including the generated problem data and information of the solved case found in the search to the participants.

Type: Grant

Filed: January 31, 2019

Date of Patent: September 10, 2024

Assignee: Hitachi, Ltd.

Inventors: Shuhei Furuya, Yo Takeuchi, Kiyoshi Kumagai, Toshiyuki Ono, Masao Ishiguro, Tatsuya Tokunaga, Chisa Nagai, Takashi Sumiyoshi, Naoyuki Kanda, Kenji Nagamatsu, Kenji Ohya
TRAINING AND USING A TRANSCRIPT GENERATION MODEL ON A MULTI-SPEAKER AUDIO STREAM

Publication number: 20240257815

Abstract: The disclosure herein describes using a transcript generation model for generating a transcript from a multi-speaker audio stream. Audio data including overlapping speech of a plurality of speakers is obtained and a set of frame embeddings are generated from audio data frames of obtained audio data using an audio data encoder. A set of words and channel change (CC) symbols are generated from the set of frame embeddings using a transcript generation model. The CC symbols are included between pairs of adjacent words that are spoken by different people at the same time. The set of words and CC symbols are transformed into a plurality of transcript lines, wherein words of the set of words are sorted into transcript lines based on CC symbols, and a multi-speaker transcript is generated based on the plurality of transcript lines. The inclusion of CC symbols by the model enables efficient, accurate multi-speaker transcription.

Type: Application

Filed: April 10, 2024

Publication date: August 1, 2024

Inventors: Naoyuki KANDA, Takuya YOSHIOKA, Zhuo CHEN, Jinyu LI, Yashesh GAUR, Zhong MENG, Xiaofei WANG, Xiong XIAO
HEAD-MOUNTED DISPLAY

Publication number: 20240201791

Abstract: A head-mounted display (HMD) 1, which is operated by a gesture operation performed by a user 3, is provided with a distance image acquisition unit 106 that detects a gesture operation, a position information acquisition unit 103 that acquires position information of the HMD 1, and a communication unit 2 that performs communication with another HMD 1?. A control unit 205 sets and displays an operating space 600 where a gesture operation performed by the user 3 is valid, exchanges position information and operating space information of the host HMD 1 and the other HMD 1 therebetween by the communication unit 2, and adjusts the operating space of the host HMD so that the operating space 600 and an operating space 600? of the other HMD 1 do not overlap each other.

Type: Application

Filed: February 15, 2024

Publication date: June 20, 2024

Inventors: Yo NONOMURA, Naoyuki KANDA
HYPOTHESIS STITCHER FOR SPEECH RECOGNITION OF LONG-FORM AUDIO

Publication number: 20240185859

Abstract: A hypothesis stitcher for speech recognition of long-form audio provides superior performance, such as higher accuracy and reduced computational cost. An example disclosed operation includes: segmenting the audio stream into a plurality of audio segments; identifying a plurality of speakers within each of the plurality of audio segments; performing automatic speech recognition (ASR) on each of the plurality of audio segments to generate a plurality of short-segment hypotheses; merging at least a portion of the short-segment hypotheses into a first merged hypothesis set; inserting stitching symbols into the first merged hypothesis set, the stitching symbols including a window change (WC) symbol; and consolidating, with a network-based hypothesis stitcher, the first merged hypothesis set into a first consolidated hypothesis.

Type: Application

Filed: February 13, 2024

Publication date: June 6, 2024

Inventors: Naoyuki KANDA, Xuankai CHANG, Yashesh GAUR, Xiaofei WANG, Zhong MENG, Takuya YOSHIOKA
Training and using a transcript generation model on a multi-speaker audio stream

Patent number: 11984127

Abstract: The disclosure herein describes using a transcript generation model for generating a transcript from a multi-speaker audio stream. Audio data including overlapping speech of a plurality of speakers is obtained and a set of frame embeddings are generated from audio data frames of the obtained audio data using an audio data encoder. A set of words and channel change (CC) symbols are generated from the set of frame embeddings using a transcript generation model. The CC symbols are included between pairs of adjacent words that are spoken by different people at the same time. The set of words and CC symbols are transformed into a plurality of transcript lines, wherein words of the set of words are sorted into transcript lines based on the CC symbols, and a multi-speaker transcript is generated based on the plurality of transcript lines. The inclusion of CC symbols by the model enables efficient, accurate multi-speaker transcription.

Type: Grant

Filed: December 31, 2021

Date of Patent: May 14, 2024

Assignee: Microsoft Technology Licensing, LLC

Inventors: Naoyuki Kanda, Takuya Yoshioka, Zhuo Chen, Jinyu Li, Yashesh Gaur, Zhong Meng, Xiaofei Wang, Xiong Xiao
Hypothesis stitcher for speech recognition of long-form audio

Patent number: 11935542

Abstract: A hypothesis stitcher for speech recognition of long-form audio provides superior performance, such as higher accuracy and reduced computational cost. An example disclosed operation includes: segmenting the audio stream into a plurality of audio segments; identifying a plurality of speakers within each of the plurality of audio segments; performing automatic speech recognition (ASR) on each of the plurality of audio segments to generate a plurality of short-segment hypotheses; merging at least a portion of the short-segment hypotheses into a first merged hypothesis set; inserting stitching symbols into the first merged hypothesis set, the stitching symbols including a window change (WC) symbol; and consolidating, with a network-based hypothesis stitcher, the first merged hypothesis set into a first consolidated hypothesis.

Type: Grant

Filed: January 19, 2023

Date of Patent: March 19, 2024

Assignee: Microsoft Technology Licensing, LLC.

Inventors: Naoyuki Kanda, Xuankai Chang, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Takuya Yoshioka
TRAINING AND USING A TRANSCRIPT GENERATION MODEL ON A MULTI-SPEAKER AUDIO STREAM

Publication number: 20230215439

Abstract: The disclosure herein describes using a transcript generation model for generating a transcript from a multi-speaker audio stream. Audio data including overlapping speech of a plurality of speakers is obtained and a set of frame embeddings are generated from audio data frames of the obtained audio data using an audio data encoder. A set of words and channel change (CC) symbols are generated from the set of frame embeddings using a transcript generation model. The CC symbols are included between pairs of adjacent words that are spoken by different people at the same time. The set of words and CC symbols are transformed into a plurality of transcript lines, wherein words of the set of words are sorted into transcript lines based on the CC symbols, and a multi-speaker transcript is generated based on the plurality of transcript lines. The inclusion of CC symbols by the model enables efficient, accurate multi-speaker transcription.

Type: Application

Filed: December 31, 2021

Publication date: July 6, 2023

Inventors: Naoyuki KANDA, Takuya YOSHIOKA, Zhuo CHEN, Jinyu LI, Yashesh GAUR, Zhong MENG, Xiaofei WANG, Xiong XIAO
HYPOTHESIS STITCHER FOR SPEECH RECOGNITION OF LONG-FORM AUDIO

Publication number: 20230154468

Abstract: A hypothesis stitcher for speech recognition of long-form audio provides superior performance, such as higher accuracy and reduced computational cost. An example disclosed operation includes: segmenting the audio stream into a plurality of audio segments; identifying a plurality of speakers within each of the plurality of audio segments; performing automatic speech recognition (ASR) on each of the plurality of audio segments to generate a plurality of short-segment hypotheses; merging at least a portion of the short-segment hypotheses into a first merged hypothesis set; inserting stitching symbols into the first merged hypothesis set, the stitching symbols including a window change (WC) symbol; and consolidating, with a network-based hypothesis stitcher, the first merged hypothesis set into a first consolidated hypothesis.

Type: Application

Filed: January 19, 2023

Publication date: May 18, 2023

Inventors: Naoyuki KANDA, Xuankai CHANG, Yashesh GAUR, Xiaofei WANG, Zhong MENG, Takuya YOSHIOKA
Hypothesis stitcher for speech recognition of long-form audio

Patent number: 11574639

Abstract: A hypothesis stitcher for speech recognition of long-form audio provides superior performance, such as higher accuracy and reduced computational cost. An example disclosed operation includes: segmenting the audio stream into a plurality of audio segments; identifying a plurality of speakers within each of the plurality of audio segments; performing automatic speech recognition (ASR) on each of the plurality of audio segments to generate a plurality of short-segment hypotheses; merging at least a portion of the short-segment hypotheses into a first merged hypothesis set; inserting stitching symbols into the first merged hypothesis set, the stitching symbols including a window change (WC) symbol; and consolidating, with a network-based hypothesis stitcher, the first merged hypothesis set into a first consolidated hypothesis.

Type: Grant

Filed: December 18, 2020

Date of Patent: February 7, 2023

Assignee: Microsoft Technology Licensing, LLC

Inventors: Naoyuki Kanda, Xuankai Chang, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Takuya Yoshioka
HEAD-MOUNTED DISPLAY

Publication number: 20220397963

Abstract: A head-mounted display (HMD) 1, which is operated by a gesture operation performed by a user 3, is provided with a distance image acquisition unit 106 that detects a gesture operation, a position information acquisition unit 103 that acquires position information of the HMD 1, and a communication unit 2 that performs communication with another HMD 1?. A control unit 205 sets and displays an operating space 600 where a gesture operation performed by the user 3 is valid, exchanges position information and operating space information of the host HMD 1 and the other HMD 1 therebetween by the communication unit 2, and adjusts the operating space of the host HMD so that the operating space 600 and an operating space 600? of the other HMD 1 do not overlap each other.

Type: Application

Filed: August 22, 2022

Publication date: December 15, 2022

Inventors: Yo NONOMURA, Naoyuki KANDA
Internal language model for E2E models

Patent number: 11527238

Abstract: A computer device is provided that includes one or more processors configured to receive an end-to-end (E2E) model that has been trained for automatic speech recognition with training data from a source-domain, and receive an external language model that has been trained with training data from a target-domain. The one or more processors are configured to perform an inference of the probability of an output token sequence given a sequence of input speech features. Performing the inference includes computing an E2E model score, computing an external language model score, and computing an estimated internal language model score for the E2E model. The estimated internal language model score is computed by removing a contribution of an intrinsic acoustic model. The processor is further configured to compute an integrated score based at least on E2E model score, the external language model score, and the estimated internal language model score.

Type: Grant

Filed: January 21, 2021

Date of Patent: December 13, 2022

Assignee: Microsoft Technology Licensing, LLC

Inventors: Zhong Meng, Sarangarajan Parthasarathy, Xie Sun, Yashesh Gaur, Naoyuki Kanda, Liang Lu, Xie Chen, Rui Zhao, Jinyu Li, Yifan Gong
Head-mounted display

Patent number: 11455042

Abstract: A head-mounted display (HMD) 1, which is operated by a gesture operation performed by a user 3, is provided with a distance image acquisition unit 106 that detects a gesture operation, a position information acquisition unit 103 that acquires position information of the HMD 1, and a communication unit 2 that performs communication with another HMD 1?. A control unit 205 sets and displays an operating space 600 where a gesture operation performed by the user 3 is valid, exchanges position information and operating space information of the host HMD 1 and the other HMD 1 therebetween by the communication unit 2, and adjusts the operating space of the host HMD so that the operating space 600 and an operating space 600? of the other HMD 1 do not overlap each other.

Type: Grant

Filed: August 24, 2017

Date of Patent: September 27, 2022

Assignee: MAXELL, LTD.

Inventors: Yo Nonomura, Naoyuki Kanda
MULTI-SPEAKER DIARIZATION OF AUDIO INPUT USING A NEURAL NETWORK

Publication number: 20220254352

Abstract: An audio analysis platform may receive a portion of an audio input, wherein the audio input corresponds to audio associated with a plurality of speakers. The audio analysis platform may process, using a neural network, the portion of the audio input to determine voice activity of the plurality of speakers during the portion of the audio input, wherein the neural network is trained using reference audio data and reference diarization data corresponding to the reference audio data. The audio analysis platform may determine, based on the neural network being used to process the portion of the audio input, a diarization output associated with the portion of the audio input, wherein the diarization output indicates individual voice activity of the plurality of speakers. The audio analysis platform may provide the diarization output to indicate the individual voice activity of the plurality of speakers during the portion of the audio input.

Type: Application

Filed: August 31, 2020

Publication date: August 11, 2022

Applicants: The Johns Hopkins University, Hitachi, Ltd.

Inventors: Yusuke FUJITA, Shinji WATANABE, Naoyuki KANDA, Shota HORIGUCHI
HYPOTHESIS STITCHER FOR SPEECH RECOGNITION OF LONG-FORM AUDIO

Publication number: 20220199091

Abstract: A hypothesis stitcher for speech recognition of long-form audio provides superior performance, such as higher accuracy and reduced computational cost. An example disclosed operation includes: segmenting the audio stream into a plurality of audio segments; identifying a plurality of speakers within each of the plurality of audio segments; performing automatic speech recognition (ASR) on each of the plurality of audio segments to generate a plurality of short-segment hypotheses; merging at least a portion of the short-segment hypotheses into a first merged hypothesis set; inserting stitching symbols into the first merged hypothesis set, the stitching symbols including a window change (WC) symbol; and consolidating, with a network-based hypothesis stitcher, the first merged hypothesis set into a first consolidated hypothesis.

Type: Application

Filed: December 18, 2020

Publication date: June 23, 2022

Inventors: Naoyuki KANDA, Xuankai CHANG, Yashesh GAUR, Xiaofei WANG, Zhong MENG, Takuya YOSHIOKA
INTERNAL LANGUAGE MODEL FOR E2E MODELS

Publication number: 20220139380

Abstract: A computer device is provided that includes one or more processors configured to receive an end-to-end (E2E) model that has been trained for automatic speech recognition with training data from a source-domain, and receive an external language model that has been trained with training data from a target-domain. The one or more processors are configured to perform an inference of the probability of an output token sequence given a sequence of input speech features. Performing the inference includes computing an E2E model score, computing an external language model score, and computing an estimated internal language model score for the E2E model. The estimated internal language model score is computed by removing a contribution of an intrinsic acoustic model. The processor is further configured to compute an integrated score based at least on E2E model score, the external language model score, and the estimated internal language model score.

Type: Application

Filed: January 21, 2021

Publication date: May 5, 2022

Applicant: Microsoft Technology Licensing, LLC

Inventors: Zhong MENG, Sarangarajan PARTHASARATHY, Xie SUN, Yashesh GAUR, Naoyuki KANDA, Liang LU, Xie CHEN, Rui ZHAO, Jinyu LI, Yifan GONG
WORKSHOP ASSISTANCE SYSTEM AND WORKSHOP ASSISTANCE METHOD

Publication number: 20210357792

Abstract: In order to assist participants in thinking of an idea by acquiring audio data, it is provided a workshop assistance system, which includes a computer having an arithmetic apparatus configured to execute predetermined processing, a storage device coupled to the arithmetic apparatus, and a communication interface coupled to the arithmetic apparatus, the computer being configured to access solved problem case data including information of solved cases that correspond to problem data, the workshop assistance system comprising: a problem processing module configured to search, by the arithmetic apparatus, solved cases based on problem data that is generated from a discussion among participants; and an idea generation module configured to present, by the arithmetic apparatus, idea data including the generated problem data and information of the solved case found in the search to the participants.

Type: Application

Filed: January 31, 2019

Publication date: November 18, 2021

Applicant: Hitachi, Ltd.

Inventors: Shuhei FURUYA, Yo TAKEUCHI, Kiyoshi KUMAGAI, Toshiyuki ONO, Masao ISHIGURO, Tatsuya TOKUNAGA, Chisa NAGAI, Takashi SUMIYOSHI, Naoyuki KANDA, Kenji NAGAMATSU, Kenji OHYA
MIXED REALITY DISPLAY SYSTEM AND MIXED REALITY DISPLAY TERMINAL

Publication number: 20210311558

Abstract: In a mixed reality display system, a server and a plurality of mixed reality display terminals are connected, and virtual objects are displayed. The virtual objects include a shared virtual object for which a plurality of terminals have an operation authority and a private virtual object for which only a specific terminal has an operation authority. The server has virtual object attribute information for displaying the virtual objects in each terminal, and each terminal has a motion detecting unit that detects a motion of a user for switching between the shared virtual object and the private virtual object. When a detection result by the motion detecting unit is received from the terminal, the server updates the virtual object attribute information depending on whether the virtual object is the shared virtual object or the private virtual object, and transmits data of the virtual object after the update to each terminal.

Type: Application

Filed: June 16, 2021

Publication date: October 7, 2021

Inventor: Naoyuki KANDA

1 2 next