ENTERING OF HUMAN FACE INFORMATION INTO DATABASE
A processor chip circuit is provided, which is used for entering human face information into a database and includes a circuit unit configured to perform the steps of: videoing one or more videoed persons and extracting human face information of the one or more videoed persons from one or more video frames during the videoing; recording a voice of at least one of the one or more videoed persons during the videoing; performing semantic analysis on the recorded voice so as to extract respective information therefrom; and associating the extracted information with the human face information of the videoed person who has spoken the extracted information, and entering the associated information into the database.
Latest NEXTVPU (SHANGHAI) CO., LTD. Patents:
This application claims priority to and is a continuation of International Patent Application No. PCT/CN2019/104108, filed Sep. 3, 2019; which claims priority from Chinese Patent Application No. CN 201910686122.3, filed Jul. 29, 2019. The entire contents of the PCT/CN2019/104108 application is incorporated by reference herein in its entirety for all purposes.
TECHNICAL FIELDThe present disclosure relates to human face recognition, and in particular to a method for entering human face information into a database, and a processor chip circuit and a non-transitory computer readable storage medium.
DESCRIPTION OF THE RELATED ARTHuman face recognition is a biometric recognition technology for recognition based on human face feature information. The human face recognition technology uses a video camera or a camera to capture an image or a video stream containing a human face, and automatically detects the human face in the image, thereby performing human face recognition on the detected human face. Establishing a human face information database is a prerequisite for human face recognition. In the process of entering human face information into the database, the information corresponding to the captured human face information is usually entered by a user of an image and video capturing device.
BRIEF SUMMARY OF THE INVENTIONAn objective of the present disclosure is to provide a method for entering human face information into a database, and a processor chip circuit and a non-transitory computer readable storage medium.
According to an aspect of the present disclosure, a method for entering human face information into a database is provided, the method including: videoing one or more videoed persons and extracting human face information of the one or more videoed persons from one or more video frames during the videoing; recording a voice of at least one of the one or more videoed persons during the videoing; performing semantic analysis on the recorded voice so as to extract respective information therefrom; and associating the extracted information with the human face information of the videoed person who has spoken the extracted information, and entering the associated information into the database.
According to another aspect of the present disclosure, a processor chip circuit is provided, which is used for entering human face information into a database and includes a circuit unit configured to perform the steps of the method above.
According to yet another aspect of the present disclosure, a non-transitory computer readable storage medium is provided, the storage medium having stored thereon a program which contains instructions that, when executed by a processor of an electronic device, cause the electronic device to perform the steps of the method above.
The accompanying drawings exemplarily show the embodiments and constitute a part of the specification for interpreting the exemplary implementations of the embodiments together with the description of the specification. The embodiments shown are merely for illustrative purposes and do not limit the scope of the claims. In all the figures, the same reference signs refer to similar but not necessarily identical elements.
In the present disclosure, unless otherwise stated, the terms “first”, “second”, etc., used to describe various elements are not intended to limit the positional, temporal or importance relationship of these elements, but rather only to distinguish one component from the other. In some examples, the first element and the second element may point to the same instance of the elements, and in some cases, based on contextual descriptions, they may also refer to different instances.
Hereinafter, a scenario with only one videoed person is first described in accordance with the steps in
In step S101, one videoed person is videoed and human face information of the videoed person is extracted from one or more video frames during the videoing.
Videoing can be done with the aid of a video camera, a camera or other video capturing units with an image sensor. When the videoed person is within a videoing range of a video capturing unit, the video capturing unit can automatically search for the human face by means of the human face recognition technology, and then extract the human face information of the videoed person for human face recognition.
The human face information includes human face feature information that can be used to identify the videoed person. The features that can be used by a human face recognition system include visual features, pixel statistical features, human face image transform coefficient features, human face image algebra features, and the like. For example, the geometric description of a structural relationship among the parts such as the eyes, the nose, the mouth, the chin, etc. of the human face, and the iris can be both used as important features for recognizing the human face.
During human face recognition, the extracted human face information above is searched and matched with human face information templates stored in the database, and identity information of the human face is determined according to the degree of similarity. For example, the degree of similarity can be determined by training a neural network through deep learning.
In step S103, a voice of the videoed person during the videoing is recorded.
The voice can contain the speaker's own identity information; alternatively or additionally, the voice may also include information related to the scenario where the speaker himself/herself is located. For example, in a medical treatment scenario for a visually impaired person, the conversation content of a doctor may include not only the doctor's identity information, such as the name, his/her department and his/her position, but also effective voice information about a treatment mode, a medicine-taking way, etc.
The voice can be captured by an audio capturing unit such as a microphone. The videoed person actively spoke the information, e.g. his/her own identity information, “I am Wang Jun”, etc. The identity information includes at least the name; however, depending on the purpose of the database, it may also include other information, such as age, place of birth, and the above-mentioned work unit and duty.
In step S105, semantic analysis is performed on the recorded voice, and corresponding information is extracted therefrom.
Extracting information from a voice can be realized with a voice recognition technology, and the extracted information can be stored in the form of texts. Based on a voice database for Chinese (including different dialects), English and other languages that is provided by a voice recognition technology provider, the information reported in multiple languages can be recognized. As described above, the extracted information may be the speaker's own identity information; alternatively or additionally, the extracted information may also include information related to the scenario where the speaker himself/herself is located. It should be noted that the identity information extracted by means of semantic analysis is different from voiceprint information of the speaker.
The degree of cooperation of the videoed person may affect the result of voice recognition. It can be understood that if the videoed person has clearly spoken the information at an appropriate speed, the result of the voice recognition will be more accurate.
In step S107, the extracted information is associated with the human face information of the videoed person who has spoken the extracted information, and the associated information is entered into the database.
In the scenario with only one videoed person, it can be determined that the extracted human face information and the extracted information belong to the same videoed person, and then the extracted human face information is stored in association with the extracted information in the database. The extracted information is stored in the database in the form of text information.
The above-mentioned human face information entry method automatically recognizes the information broadcast by the videoed person and associates same with the human face information thereof, thereby reducing the risk of a user of a video capturing unit erroneously entering information (especially identity information) of a videoed person, and improving the efficiency of human face information entry. Moreover, the method according to the present disclosure makes it possible to simultaneously enter other information related to the scenario, and thus can satisfy the user's usage requirements in different scenarios.
The steps in the flow chart of
It should be understood that the human face recognition and voice recognition described above with respect to a single videoed person may be applied to each individual in the scenario including a plurality of videoed persons, respectively, and thus, related content will not be described again.
In step S101, a plurality of videoed persons are videoed and human face information of the videoed persons is extracted from one or more video frames during the videoing.
As shown in
In step S103, a voice of at least one of the plurality of videoed persons during the videoing is recorded.
The plurality of videoed persons can broadcast their own information in turn, and the recorded voices can be stored in a memory.
In step S105, semantic analysis is performed on each of the recorded voices, and respective information is extracted therefrom. It should be noted that, as described above, in addition to the identity information, the voice may also include information related to the scenario where the speaker is located, and such information may also be extracted by analyzing the voice, and stored in association with the human face information in the database. For the sake of brevity of explanation, the present disclosure will be illustrated below by taking the identity information in the voice as an example.
In step S107, the extracted information is associated with the human face information of the videoed person who has spoken the extracted information, and the associated information is entered into the database.
In the scenario including a plurality of videoed persons, it is possible to further distinguish between scenarios where only one person is speaking and where multiple persons are simultaneously speaking. In the case where the speaking of multiple persons causes serious interference with each other and cannot be distinguished, it is possible to choose to discard the voice recorded in the current scenario and perform voice recording again; the primary (or the only) sound in the recorded voice is analyzed to extract corresponding information when only one person is speaking, or when multiple persons are speaking but one sound can be distinguished from the others.
The association between the extracted respective information and the human face information can be implemented in the following two ways:
I. Sound LocalizationIn the scenario shown in a top view of
The audio capturing unit 205 may be an array including three microphones, which are, for example, non-directional microphone elements that are highly sensitive to sound pressure.
In
The form of the array of microphones is not limited to the patterns in
When one of the videoed persons 201, 202 and 203 broadcasts his/her own identity information, the sound waves from speaking are propagated to the three microphones 305-1, 305-2 and 305-3 of the audio capturing unit. Due to the different positions, there are phase differences among audio signals captured by the three microphones, and the direction of the sound source relative to the human face information entry device can be determined according to the information of the three phase differences. For example, as shown in
In the case shown in
The video capturing units 304 and 404 can be used to map the real scenario where the videoed persons are located with the video scenario in terms of the location. This mapping can be achieved by pre-setting reference markers 206 and 207 in the real scenario (in this case, the distance from the video capturing unit to the reference marker is known), or by using a ranging function of the camera.
The use of camera ranging can be achieved in the following manner:
-
- 1) photographing a multi-view image: in the case where parameters of the camera of the video capturing units 304 and 404 are known, a sensor (such as a gyroscope) inside the device can be used to estimate the change of angle of view of the camera and the displacement of the video capturing unit, thereby inferring an actual spatial distance corresponding to the displacement of pixels in the image; or
- 2) photographing multiple images with different depths of focus by using a defocus (depth from focus) method, and then performing depth estimation by using the multiple images.
Based on the location mapping between the real scenario and the video scenario, a corresponding position of a certain location in the real scenario in the videoed video frame can be determined. For example, in the scenario of
The sound localization described above involves the association between audio and video in terms of spatial location, while the implementation of capturing lip movements involves the association between audio and video in terms of time.
It is beneficial to simultaneously start the video capturing unit and the audio capturing unit, and separately record video and audio.
In
When the audio capturing unit detects that an audio signal is entered in a time interval from t1 to t2, and the identity information can be extracted therefrom effectively (excluding noise), the human face information entry devices 200, 300 and 400 retrieve the recorded video frames, and compare a frame 502 at time t1 and a frame 501 at a previous time (for example, before 100 ms). By comparison, it can be determined that the lip of the videoed person on the left side has an obvious opening action in the frame 502; similarly, a frame 503 at time t2 and a frame 504 at a later time (for example, after 100 ms) are compared; and by comparison, it can be determined that the videoed person on the left side ends the opening state of his/her lip in the frame 504.
Based on the high degree of time consistency, it can be determined that the identity information captured by the audio capturing unit within the time interval from t1 to t2 should be associated with the videoed person on the left side.
The above method for associating identity information with human face information by capturing lip movements can not only be used to reinforce the implementation of sound localization, but also can be separately used as an alternative to the sound localization.
By associating the identity information with the human face information, it is possible to enter information of a plurality of videoed persons during the same videoing, which further saves the time required for human face information entry, and can also assist a visually impaired person in quickly grasping identity information of persons present in a large conference or social occasion and storing the identity information of strangers in association with the corresponding human face information in the database. Once the database is established, during the following conversation, the position of the speaker in a video frame can be determined by the localization technique explained above, and the human face recognition can be performed to provide the visually impaired person with the identity information of the current speaker, for example, through a loudspeaker, thereby providing great convenience for the visually impaired person to participate normal social activities.
Moreover, in the scenario where multiple persons are speaking, it is also possible to accurately analyse corresponding semantics according to the videoed video of lip movements, split different sound sources by means of the audio capturing device, and compare the analyzed semantics on the video of lip movements with information of a single channel of sound source split by the audio capturing device for association.
Unlike the implementation shown in
In step S601, one or more videoed persons are videoed and human face information of the one or more videoed persons is extracted from one or more video frames, and voices of the one or more videoed persons are recorded.
In step S602, the extracted human face information is compared with human face information templates that is already stored in the database.
If it is determined that the human face information has been stored in the database, the process proceeds to step S605 to exit a human face information entry mode.
If it is determined that the human face information has not been stored in the database, the process proceeds to S603 to start to perform semantic analysis on the voices of the one or more videoed persons that are recorded in step 601, and to extract respective information from the voices.
Preferably, when the name to be entered is already stored in the database (the corresponding human face information is different), the name to be entered may be distinguished and then entered into the database. For example, when “Wang Jun” is already in the database, “Wang Jun 2” is entered so as to distinguish from “Wang Jun” that has been entered into the database, so that during subsequent broadcast to a user, a different voice information code name is used to make the user to distinguish between different human face information.
In step S604, the extracted information is associated with the human face information and entered into the database. The manner in which the sound and the human face are associated as described above in connection with
According to the second implementation, the efficiency of entering the extracted information and the human face information can be further improved.
It should be noted that the respective information including the identity extracted according to the present disclosure is text information recognized from voice information of an audio format, and therefore, the above information is stored in the database as text information instead of voice information.
In step S701, one or more videoed persons are videoed and human face information of the one or more videoed persons is extracted from one or more video frames during the videoing.
In step S703, semantic analysis is performed on a voice of a videoed person during the videoing, and the voice can contain the speaker's own identity information.
In step S705, it is determined whether the extracted human face information is already in the database.
If it is found, after determination, that the relevant human face information has not been stored in the database, the process proceeds to step S707, and the extracted information is stored in association with the human face information in the database. Here, the manner in which the sound and the human face are associated as described above in connection with
If it is found, after determination, that the relevant human face information is already stored in the database, the process proceeds to S710 to further determine whether the extracted information can supplement the information that is already in the database. For example, the name of the videoed person already exists in the database, and the extracted information further includes other information such as age and place of birth, or new information related to the scenario where the speaker is located.
If there is no other information that can be supplemented to the database, the process proceeds to S711 to exit the human face information entry mode.
If there is other information that can be supplemented to the database, the process proceeds to S712 to store the information that can be supplemented in the database.
According to the third implementation, a more comprehensive identity information database can be acquired with higher efficiency.
The computing device 2000 may include elements in connection with a bus 2002 or in communication with a bus 2002 (possibly via one or more interfaces). For example, the computing device 2000 can include a bus 2002, one or more processors 2004, one or more input devices 2006, and one or more output devices 2008. The one or more processors 2004 may be any type of processors and may include, but are not limited to, one or more general purpose processors and/or one or more dedicated processors (e.g., special processing chips). The input device 2006 can be any type of devices capable of inputting information to the computing device 2000, and can include, but is not limited to, a camera. The output device 2008 can be any type of devices capable of presenting information, and can include, but is not limited to, a loudspeaker, an audio output terminal, a vibrator, or a display. The computing device 2000 may also include a non-transitory storage device 2010 or be connected to a non-transitory storage device 2010. The non-transitory storage device may be non-transitory and may be any storage device capable of implementing data storage, and may include, but is not limited to, a disk drive, an optical storage device, a solid-state memory, a floppy disk, a flexible disk, a hard disk, a magnetic tape, or any other magnetic medium, an optical disk or any other optical medium, a read-only memory (ROM), a random access memory (RAM), a cache memory and/or any other memory chip or cartridge, and/or any other medium from which a computer can read data, instructions and/or codes. The non-transitory storage device 2010 can be detached from an interface. The non-transitory storage device 2010 may have data/programs (including instructions)/codes for implementing the methods and steps. The computing device 2000 can also include a communication device 2012. The communication device 2012 may be any type of device or system that enables communication with an external device and/or a network, and may include, but is not limited to, a wireless communication device and/or a chipset, e.g., a Bluetooth device, a 1302.11 device, a WiFi device, a WiMax device, a cellular communication device and/or the like.
The computing device 2000 may also include a working memory 2014, which may be any type of working memory that stores programs (including instructions) and/or data useful to the working of the processor 2004, and may include, but is not limited to, a random access memory and/or a read-only memory.
Software elements (programs) may be located in the working memory 2014, and may include, but is not limited to, an operating system 2016, one or more applications 2018, drivers, and/or other data and codes. The instructions for executing the methods and steps may be included in the one or more applications 2018.
When the computing device 2000 shown in
It should also be understood that the components of the computing device 2000 can be distributed over a network. For example, some processing may be executed by one processor while other processing may be executed by another processor away from the one processor.
Other components of the computing device 2000 may also be similarly distributed. As such, the computing device 2000 can be interpreted as a distributed computing system that performs processing at multiple positions.
Although the embodiments or examples of the present disclosure have been described with reference to the accompanying drawings, it should be understood that the methods, systems and devices described above are merely exemplary embodiments or examples, and the scope of the present disclosure is not limited by the embodiments or examples, but only defined by the claims and equivalent scopes thereof. Various elements in the embodiments or examples may be omitted or substituted by equivalent elements thereof. Moreover, the steps may be executed in an order different from that described in the present disclosure. Further, various elements in the embodiments or examples may be combined in various ways. It is important that, as the technology evolves, many elements described herein may be replaced with equivalent elements that appear after the present disclosure.
Claims
1. A processor chip circuit for entering human face information into a database, comprising:
- a circuit unit coupled with an auxiliary wearable device configured for being worn by a visually impaired person, the circuit unit being configured to perform, from the auxiliary wearable device, the steps of:
- videoing one or more videoed persons and extracting human face information of the one or more videoed persons from one or more video frames during the videoing;
- recording a voice of at least one of the one or more videoed persons during the videoing, wherein the voice of the at least one videoed person comprises identity information of a speaker that is spoken by the speaker;
- performing semantic analysis on the recorded voice so as to extract respective information therefrom, wherein the extracted respective information comprises the identity information of the speaker; and
- associating the extracted information with the human face information of the videoed person who has spoken the extracted information, and entering the associated information into the database,
- wherein the circuit unit is further configured to perform, from the auxiliary wearable device, the step of:
- accessing, during a conversation participated by the visually impaired person and at least one of the one or more videoed persons, the database having the entered associated information to provide the visually impaired person with the identity information of the speaker.
2-3. (canceled)
4. The processor chip circuit according to claim 1, wherein associating the extracted information with the human face information of the videoed person who has spoken the extracted information comprises:
- analyzing the movement of lips of the one or more videoed persons from the one or more video frames during the videoing.
5. The processor chip circuit according to claim 4, wherein a start time of the movement of the lips is compared with a start time at which the voice is recorded.
6. A method for entering human face information into a database, comprising:
- videoing, from an auxiliary wearable device configured for being worn by a visually impaired person, one or more videoed persons and extracting human face information of the one or more videoed persons from one or more video frames during the videoing;
- recording, from the auxiliary wearable device, a voice of at least one of the one or more videoed persons during the videoing, wherein the voice of the at least one videoed person comprises identity information of a speaker that is spoken by the speaker;
- performing, from the auxiliary wearable device, semantic analysis on the recorded voice so as to extract respective information therefrom, wherein the extracted respective information comprises the identity information of the speaker; and
- associating, from the auxiliary wearable device, the extracted information with the human face information of the videoed person who has spoken the extracted information, and entering the associated information into the database,
- wherein the method further comprises:
- accessing, from the auxiliary wearable device, the database having the entered associated information to provide the visually impaired person with the identity information of the speaker, during a conversation participated by the visually impaired person and at least one of the one or more videoed persons.
7. The method according to claim 6, wherein the human face information comprises face feature information for identifying the one or more videoed persons.
8. The method according to claim 6, wherein the voice of the at least one videoed person comprises identity information of a speaker, and the extracted respective information comprises the identity information of the speaker.
9. The method according to claim 6, wherein the identity information of the speaker comprises a name of the speaker.
10. The method according to claim 6, wherein the voice of the at least one videoed person comprises information related to a scenario where a speaker is located, and the extracted respective information comprises the information related to the scenario where the speaker is located.
11-12. (canceled)
13. The method according to claim 6, wherein associating the extracted information with the human face information of the videoed person who has spoken the extracted information comprises:
- analyzing the movement of lips of the one or more videoed persons from the one or more video frames during the videoing.
14. The method according to claim 13, wherein a start time of the movement of the lips is compared with a start time at which the voice is recorded.
15. The method according to claim 6, wherein it is detected whether the human face information of the at least one videoed person is already stored in the database, and if the human face information of the at least one videoed person is not in the database, the recorded voice is analyzed.
16. The method according to claim 6, wherein it is detected whether the human face information of the at least one videoed person is already stored in the database, and if the human face information of the at least one videoed person has been stored in the database, the extracted information is used to supplement information associated with the human face information of the at least one videoed person that is already stored in the database.
17. The method according to claim 6, wherein the extracted information is stored in the database as text information.
18. A non-transitory computer readable storage medium storing a program which comprises instructions that, when executed by a processor of an electronic device, cause the electronic device to perform the steps of:
- videoing, from an auxiliary wearable device configured for being worn by a visually impaired person, one or more videoed persons and extracting human face information of the one or more videoed persons from one or more video frames during the videoing;
- recording, from the auxiliary wearable device, a voice of at least one of the one or more videoed persons during the videoing, wherein the voice of the at least one videoed person comprises identity information of a speaker that is spoken by the speaker;
- performing, from the auxiliary wearable device, semantic analysis on the recorded voice so as to extract respective information therefrom, wherein the extracted respective information comprises the identity information of the speaker; and
- associating, from the auxiliary wearable device, the extracted information with the human face information of the videoed person who has spoken the extracted information, and entering the associated information into the database,
- wherein the instructions, when executed by the processor, further cause the electronic device to perform the step of:
- accessing, from the auxiliary wearable device, the database having the entered associated information to provide the visually impaired person with the identity information of the speaker, during a conversation participated by the visually impaired person and at least one of the one or more videoed persons.
19. (canceled)
20. The non-transitory computer readable storage medium according to claim 18, wherein associating the extracted information with the human face information of the videoed person who has spoken the extracted information comprises:
- analyzing the movement of lips of the one or more videoed persons from the video frames during the videoing.
Type: Application
Filed: Nov 8, 2019
Publication Date: Feb 4, 2021
Applicant: NEXTVPU (SHANGHAI) CO., LTD. (Shanghai)
Inventors: Haijiao Cai (Shanghai), Xinpeng Feng (Shanghai), Ji Zhou (Shanghai)
Application Number: 16/678,838