Patents by Inventor Huapeng Sima
Huapeng Sima has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).
-
Patent number: 12260481Abstract: Disclosed are a method for generating a dynamic image based on audio, a device, and a storage medium, relating to the field of natural human-computer interactions. The method includes: obtaining a reference image and reference audio input by a user; determining a target head pose feature and a target expression coefficient feature based on the reference image and a trained generation network model, and adjusting the trained generation network model based on the target head pose feature and the target expression coefficient feature, to obtain a target generation network model; and processing a to-be-processed image based on the reference audio, the reference image, and the target generation network model, to obtain a target dynamic image. An image object in the to-be-processed image is same as that in the reference image. In this case, a corresponding digital person can be obtained based on a single picture of a target person.Type: GrantFiled: July 19, 2024Date of Patent: March 25, 2025Assignee: NANJING SILICON INTELLIGENCE TECHNOLOGY CO., LTD.Inventors: Huapeng Sima, Maolin Zhang, Liyan Mao
-
Patent number: 12254552Abstract: This application provides a audio-driven three-dimensional facial animation model generation method and apparatus, and an electronic device.Type: GrantFiled: December 23, 2024Date of Patent: March 18, 2025Assignee: NANJING SILICON INTELLIGENCE TECHNOLOGY CO., LTD.Inventors: Huapeng Sima, Zheng Liao
-
Publication number: 20250086826Abstract: This application provides a digital person training method and system, and a digital person driving system. According to the method, human-body pose estimation data in training data is extracted, and the human-body pose estimation data is input into an optimized pose estimation network to obtain human-body pose optimization data. Generation losses of position optimization data and acceleration optimization data in the human-body pose optimization data are calculated based on a loss function of the optimized pose estimation network, so as to minimize errors between position estimation data and acceleration estimation data and a real value. In this way, the optimized pose estimation network is driven to update a network parameter to obtain an optimal driving model that is based on the optimized pose estimation network. The errors between the position estimation data and the acceleration estimation data and the real value are minimized.Type: ApplicationFiled: August 19, 2024Publication date: March 13, 2025Inventors: Huapeng Sima, Hao Jiang, Hongwei Fan, Qixun Qu, Jiabin Li, Jintai Luan
-
Patent number: 12236635Abstract: This application provides a digital person training method and system, and a digital person driving system. According to the method, human-body pose estimation data in training data is extracted, and the human-body pose estimation data is input into an optimized pose estimation network to obtain human-body pose optimization data. Generation losses of position optimization data and acceleration optimization data in the human-body pose optimization data are calculated based on a loss function of the optimized pose estimation network, so as to minimize errors between position estimation data and acceleration estimation data and a real value. In this way, the optimized pose estimation network is driven to update a network parameter to obtain an optimal driving model that is based on the optimized pose estimation network. The errors between the position estimation data and the acceleration estimation data and the real value are minimized.Type: GrantFiled: August 19, 2024Date of Patent: February 25, 2025Inventors: Huapeng Sima, Hao Jiang, Hongwei Fan, Qixun Qu, Jiabin Li, Jintai Luan
-
Patent number: 12223973Abstract: Embodiments of the present application provide a speech conversion method and apparatus, a storage medium, and an electronic device. The method includes: acquiring a source speech to be converted and a target speech sample of a target speaker; recognizing a style category of the target speech sample, and extracting a target audio feature from the target speech sample according to the style category; extracting a source audio feature from the source speech; acquiring a first style feature of the target speech sample and determining a second style feature of the target speech sample according to the first style feature; fusing and mapping the source audio feature, the target audio feature, and the second style feature to obtain a joint encoding feature; and decoding the joint encoding feature, to obtain a target speech feature, and converting the source speech based on the target speech feature to obtain a target speech.Type: GrantFiled: August 9, 2024Date of Patent: February 11, 2025Assignee: NANJING SILICON INTELLIGENCE TECHNOLOGY CO., LTD.Inventors: Huapeng Sima, Ao Yao, Yiping Tang
-
Patent number: 12094046Abstract: A digital human driving method and apparatus are provided which relate to computer and image processing, and can solve the problem of shaking, joint rotation malposition and partial loss of a digital human during a driving process. The solution includes: capturing video data from multiple angles of view in a real three-dimensional space by multiple video capture devices; determining a first coordinate of a key point of the target human; determining a mapping relationship based on the first coordinate; calculating a second coordinate based on the mapping relationship and the first coordinate; processing the second coordinate according to a key point rotation model to obtain rotation value of the virtual key point in the virtual three-dimensional space; and driving the digital human to move based on the rotation value of the virtual key point in the virtual three-dimensional space.Type: GrantFiled: January 23, 2024Date of Patent: September 17, 2024Assignee: NANJING SILICON INTELLIGENCE TECHNOLOGY CO., LTD.Inventors: Huapeng Sima, Jintai Luan, Hongwei Fan, Jiabin Li, Hao Jiang, Qixun Qu
-
Patent number: 12056903Abstract: Disclosed are a gated network-based generator, a generator training method, and a method for avoiding image coordinate adhesion. The generator processes, by using an image input layer, a to-be-processed image as an image sequence and inputs it to a feature encoding layer. Multiple feature encoding layers encode the image sequence by using a gated convolutional network, to obtain an image code. Moreover, multiple image decoding layers decode the image code by using an inverse gated convolution unit, to obtain a target image sequence. Finally, an image output layer splices the target image sequence to obtain a target image. Therefore, a character feature in the obtained target image is more obvious, making details of a facial image of generated digital human more vivid, whereby solving a problem of image coordinate adhesion in a digital human image generated by an existing generator using a generative adversarial network, and improving user experience.Type: GrantFiled: June 29, 2023Date of Patent: August 6, 2024Assignee: NANJING SILICON INTELLIGENCE TECHNOLOGY CO., LTD.Inventors: Huapeng Sima, Maolin Zhang, Peiyu Wang
-
Patent number: 12051400Abstract: This application provide a synthetic audio output method and apparatus, a storage medium, and an electronic device. The method includes: inputting input text and a specified target identity identifier into an audio output model; extracting an identity feature sequence of a target identity by an identity recognition model; extracting a phoneme feature sequence corresponding to the input text by an encoding layer of a speech synthesis model; superimposing and inputting the identity feature sequence of the target identity and the phoneme feature sequence into a variable adapter of the speech synthesis model; and after duration prediction and alignment, energy prediction, and pitch prediction are performed on the phoneme feature sequence by the variable adapter, outputting a target Mel-frequency spectrum feature corresponding to the input text through a decoding layer of the speech synthesis model; and inputting the target Mel-frequency spectrum feature into a vocoder to output synthetic audio.Type: GrantFiled: February 7, 2024Date of Patent: July 30, 2024Assignee: NANJING SILICON INTELLIGENCE TECHNOLOGY CO., LTD.Inventors: Huapeng Sima, Haie Wu, Ao Yao, Da Jiang, Yiping Tang
-
Publication number: 20240169592Abstract: Disclosed are a gated network-based generator, a generator training method, and a method for avoiding image coordinate adhesion. The generator processes, by using an image input layer, a to-be-processed image as an image sequence and inputs it to a feature encoding layer. Multiple feature encoding layers encode the image sequence by using a gated convolutional network, to obtain an image code. Moreover, multiple image decoding layers decode the image code by using an inverse gated convolution unit, to obtain a target image sequence. Finally, an image output layer splices the target image sequence to obtain a target image. Therefore, a character feature in the obtained target image is more obvious, making details of a facial image of generated digital human more vivid, whereby solving a problem of image coordinate adhesion in a digital human image generated by an existing generator using a generative adversarial network, and improving user experience.Type: ApplicationFiled: June 29, 2023Publication date: May 23, 2024Applicant: NANJING SILICON INTELLIGENCE TECHNOLOGY CO., LTD.Inventors: Huapeng SIMA, Maolin ZHANG, Peiyu WANG
-
Patent number: 11928767Abstract: Embodiments of the present disclosure provide a method for audio-driven character lip sync, a model for audio-driven character lip sync, and a training method therefor. A target dynamic image is obtained by acquiring a character image of a target character and speech for generating a target dynamic image, processing the character image and the speech as image-audio data that may be trained, respectively, and mixing the image-audio data with auxiliary data for training. When a large amount of sample data needs to be obtained for training in different scenarios, a video when another character speaks is used as an auxiliary video for processing, so as to obtain the auxiliary data. The auxiliary data, which replaces non-general sample data, and other data are input into a model in a preset ratio for training. The auxiliary data may improve a process of training a synthetic lip sync action of the model, so that there are no parts unrelated to the synthetic lip sync action during the training process.Type: GrantFiled: June 21, 2023Date of Patent: March 12, 2024Assignee: NANJING SILICON INTELLIGENCE TECHNOLOGY CO., LTD.Inventors: Huapeng Sima, Zheng Liao
-
Publication number: 20240054711Abstract: Embodiments of the present disclosure provide a method for audio-driven character lip sync, a model for audio-driven character lip sync, and a training method therefor. A target dynamic image is obtained by acquiring a character image of a target character and speech for generating a target dynamic image, processing the character image and the speech as image-audio data that may be trained, respectively, and mixing the image-audio data with auxiliary data for training. When a large amount of sample data needs to be obtained for training in different scenarios, a video when another character speaks is used as an auxiliary video for processing, so as to obtain the auxiliary data. The auxiliary data, which replaces non-general sample data, and other data are input into a model in a preset ratio for training. The auxiliary data may improve a process of training a synthetic lip sync action of the model, so that there are no parts unrelated to the synthetic lip sync action during the training process.Type: ApplicationFiled: June 21, 2023Publication date: February 15, 2024Inventors: Huapeng Sima, Zheng Liao
-
Publication number: 20240054811Abstract: Embodiments of this disclosure provide a mouth shape correction model, and model training and application methods. The model includes a mouth feature extraction module, a key point extraction module, a first video module, a second video module, and a discriminator. The training method includes: based on a first original video and a second original video, extracting corresponding features by using various modules in the model to train the model; and when the model meets a convergence condition, completing the training to generate a target mouth shape correction model. The application method includes: inputting a video in which a mouth shape of a digital-human actor is to be corrected and corresponding audio into a mouth shape correction model, to obtain a video in which the mouth shape of the digital-human actor in the video is corrected, wherein the mouth shape correction model is a model trained by using the training method.Type: ApplicationFiled: June 21, 2023Publication date: February 15, 2024Applicant: NANJING SILICON INTELLIGENCE TECHNOLOGY CO., LTD.Inventors: Huapeng SIMA, Guo YANG
-
Patent number: 11887403Abstract: Embodiments of this disclosure provide a mouth shape correction model, and model training and application methods. The model includes a mouth feature extraction module, a key point extraction module, a first video module, a second video module, and a discriminator. The training method includes: based on a first original video and a second original video, extracting corresponding features by using various modules in the model to train the model; and when the model meets a convergence condition, completing the training to generate a target mouth shape correction model. The application method includes: inputting a video in which a mouth shape of a digital-human actor is to be corrected and corresponding audio into a mouth shape correction model, to obtain a video in which the mouth shape of the digital-human actor in the video is corrected, wherein the mouth shape correction model is a model trained by using the training method.Type: GrantFiled: June 21, 2023Date of Patent: January 30, 2024Assignee: NANJING SILICON INTELLIGENCE TECHNOLOGY CO., LTD.Inventors: Huapeng Sima, Guo Yang
-
Patent number: 11875775Abstract: The present disclosure proposes a speech conversion scheme for non-parallel corpus training, to get rid of dependence on parallel text and resolve a technical problem that it is difficult to achieve speech conversion under conditions that resources and equipment are limited. A voice conversion system and a training method therefor are included. Compared with the prior art, according to the embodiments of the present disclosure: a trained speaker-independent automatic speech recognition model can be used for any source speaker, that is, the speaker is independent; and bottleneck features of audio are more abstract as compared with phonetic posteriorGram features, can reflect decoupling of spoken content and timbre of the speaker, and meanwhile are not closely bound with a phoneme class, and are not in a clear one-to-one correspondence relationship. In this way, a problem of inaccurate pronunciation caused by a recognition error in ASR is relieved to some extent.Type: GrantFiled: April 20, 2021Date of Patent: January 16, 2024Assignee: Nanjing Silicon Intelligence Technology Co., Ltd.Inventors: Huapeng Sima, Zhiqiang Mao, Xuefei Gong
-
Fitness action recognition model, method of training model, and method of recognizing fitness action
Patent number: 11854306Abstract: A model including an information extraction layer that obtains image information of a training object in a depth image; a pixel point positioning layer that performs position estimation on a three-dimensional coordinate of human-body key points, defines a body part of the training object as a body component, and calibrates a three-dimensional coordinate of all human-body key points corresponding to the body component; a feature extraction layer that extracts a key-point position feature, a body moving speed feature, and a key-point moving speed feature for action recognition; a vector dimensionality reduction layer that combines the key-point position feature, the body moving speed feature, and the key-point moving speed feature as a multidimensional feature vector, and performs dimensionality reduction on the multidimensional feature vector; and a feature vector classification layer that classifies the multidimensional feature vector that is performed with dimensionality reduction, to recognize a fitness actType: GrantFiled: June 28, 2023Date of Patent: December 26, 2023Assignee: Nanjing Silicon Intelligence Technology Co., Ltd.Inventors: Huapeng Sima, Hao Jiang, Hongwei Fan, Qixun Qu, Jintai Luan, Jiabin Li -
Patent number: 11847726Abstract: A method for outputting a blend shape value includes: performing feature extraction on obtained target audio data to obtain a target audio feature vector; inputting the target audio feature vector and a target identifier into an audio-driven animation model; inputting the target audio feature vector into an audio encoding layer, determining an input feature vector of a next layer at a (2t?n)/2 time point based on an input feature vector of a previous layer between a t time point and a t?n time point, determining a feature vector having a causal relationship with the input feature vector of the previous layer as a valid feature vector, outputting sequentially target-audio encoding features, and inputting the target identifier into a one-hot encoding layer for binary vector encoding to obtain a target-identifier encoding feature; and outputting a blend shape value corresponding to the target audio data.Type: GrantFiled: July 22, 2022Date of Patent: December 19, 2023Assignee: Nanjing Silicon Intelligence Technology Co., Ltd.Inventors: Huapeng Sima, Cuicui Tang, Zheng Liao
-
Patent number: 11817079Abstract: The present disclosure provides a GAN-based speech synthesis model, a training method, and a speech synthesis method. According to the speech synthesis method, to-be-converted text is obtained and is converted into a text phoneme, the text phoneme is further digitized to obtain text data, and the text data is converted into a text vector to be input into a speech synthesis model. In this way, target audio corresponding to the to-be-converted text is obtained. When a target Mel-frequency spectrum is generated by using a trained generator, accuracy of the generated target Mel-frequency spectrum can reach that of a standard Mel-frequency spectrum. Through constant adversary between the generator and a discriminator and the trainings thereof, acoustic losses of the target Mel-frequency spectrum are reduced, and acoustic losses of the target audio generated based on the target Mel-frequency spectrum are also reduced, thereby improving accuracy of audio synthesized from speech.Type: GrantFiled: June 16, 2023Date of Patent: November 14, 2023Assignee: NANJING SILICON INTELLIGENCE TECHNOLOGY CO., LTD.Inventors: Huapeng Sima, Zhiqiang Mao
-
Patent number: 11763801Abstract: Embodiments of the present application provide a method and system for outputting target audio, a readable storage medium, and an electronic device. The method includes: inputting source audio into a phonetic posteriorgram PPG classification network model to obtain a PPG feature vector, where the PPG feature vector is used for indicating a phoneme label corresponding to each frame of the source audio, and the PPG feature vector contains text information and prosodic information of the source audio; inputting the PPG feature vector into a voice conversion network model, and outputting an acoustic feature vector of target audio based on the phoneme label corresponding to the PPG feature vector, where the target audio contains a plurality pieces of audio with different timbres; and inputting the acoustic feature vector of the target audio into a voice coder, and outputting the target audio through the voice coder.Type: GrantFiled: August 29, 2022Date of Patent: September 19, 2023Assignee: NANJING SILICON INTELLIGENCE TECHNOLOGY CO., LTD.Inventors: Huapeng Sima, Zhiqiang Mao, Xuefei Gong
-
Publication number: 20230215068Abstract: A method for outputting a blend shape value includes: performing feature extraction on obtained target audio data to obtain a target audio feature vector; inputting the target audio feature vector and a target identifier into an audio-driven animation model; inputting the target audio feature vector into an audio encoding layer, determining an input feature vector of a next layer at a (2t?n)/2 time point based on an input feature vector of a previous layer between a t time point and a t-n time point, determining a feature vector having a causal relationship with the input feature vector of the previous layer as a valid feature vector, outputting sequentially target-audio encoding features, and inputting the target identifier into a one-hot encoding layer for binary vector encoding to obtain a target-identifier encoding feature; and outputting a blend shape value corresponding to the target audio data.Type: ApplicationFiled: July 22, 2022Publication date: July 6, 2023Applicant: NANJING SILICON INTERLLIGENCE TECHNOLOGY CO., LTD.Inventors: Huapeng SIMA, Cuicui TANG, Zheng LIAO
-
Publication number: 20230197061Abstract: Embodiments of the present application provide a method and system for outputting target audio, a readable storage medium, and an electronic device. The method includes: inputting source audio into a phonetic posteriorgram PPG classification network model to obtain a PPG feature vector, where the PPG feature vector is used for indicating a phoneme label corresponding to each frame of the source audio, and the PPG feature vector contains text information and prosodic information of the source audio; inputting the PPG feature vector into a voice conversion network model, and outputting an acoustic feature vector of target audio based on the phoneme label corresponding to the PPG feature vector, where the target audio contains a plurality pieces of audio with different timbres; and inputting the acoustic feature vector of the target audio into a voice coder, and outputting the target audio through the voice coder.Type: ApplicationFiled: August 29, 2022Publication date: June 22, 2023Inventors: Huapeng Sima, Zhiqiang Mao, Xuefei Gong