Mobile face capture and image processing system and method

Info

Publication number: 20050083248
Type: Application
Filed: Aug 9, 2004
Publication Date: Apr 21, 2005
Inventors: Frank Biocca (East Lansing, MI), Jannick Rolland (Chultuota, FL), George Stockman (Okemos, MI), Chandan Reddy (Ithaca, NY), Miguel Figueroa (Lansing, MI)
Application Number: 10/914,621

Abstract

Image processing procedures include receiving at least two side view images of a face of a user. In other aspects, side view images are warped and blended into an output image of a face of a user as if viewed from a virtual point of view. In further aspects, a virtual video is produced in real time of output images from a video feed of side view images.

Description

Description

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation-in-part of U.S. patent application Ser. No. 09/748,761 filed on Dec. 22, 2000. The disclosure of the above application is incorporated herein by reference in its entirety for any purpose.

FIELD OF THE INVENTION

The present invention generally relates to computer-based teleconferencing in a networked virtual reality environment, and more particularly to mobile face capture and image processing.

BACKGROUND OF THE INVENTION

Networked virtual environments allow users at remote locations to use a telecommunication link to coordinate work and social interaction. Teleconferencing systems and virtual environments that use 3D computer graphics displays and digital video recording systems allow remote users to interact with each other, to view virtual work objects such as text, engineering models, medical models, play environments and other forms of digital data, and to view each other's physical environment.

A number of teleconferencing technologies support collaborative virtual environments Which allow interaction between individuals in local and remote sites. For example, video-teleconferencing systems use simple video screens and wide screen displays to allow interaction between individuals in local and remote sites. However, wide screen displays are disadvantageous because virtual 3D objects presented on the screen are not blended into the environment of the room of the users. In such an environment, local users cannot have a virtual object between them. This problem applies to representation of remote users as well. The location of the remote participants cannot be anywhere in the room or the space around the user, but is restricted to the screen.

Networked immersive virtual environments also present various disadvantages. Networked immersive virtual reality systems are sometimes used to allow remote users to connect via a telecommunication link and interact with each other and virtual objects. In many such systems the users must wear a virtual reality display where the user's eyes and a large part of the face are occluded. Because these systems only display 3D virtual environments, the user cannot see both the physical world of the site in which they are located and the virtual world which is displayed. Furthermore, people in the same room cannot see each others' full face and eyes, so local interaction is diminished. Because the face is occluded, such systems cannot capture and record a full stereoscopic view of remote users' faces.

Another teleconferencing system is termed CAVES. CAVES systems use multiple screens arranged in a room configuration to display virtual information. Such systems have several disadvantages. In CAVES systems, there is only one correct viewpoint, all other local users have a distorted perspective on the virtual scene. Scenes in the CAVES are only projected on a wall. So two local users can view a scene on the wall, but an object cannot be presented in the space between users. These systems also use multiple rear screen projectors, and therefore are very bulky and expensive. Additionally, CAVES systems may also utilize stereoscopic screen displays. Stereoscopic screen display systems do not present 3D stereoscopic views that interpose 3D objects between local users of the system. These systems sometimes use 3D glasses to present a 3D view, but only one viewpoint is shared among many users often with perspective distortions.

Consequently, there is a need for an augmented reality display that mitigates the above mentioned disadvantages and has the capability to display virtual objects and environments, superimpose virtual objects on the “real world” scenes, provide “face-to-face” recording and display, be used in various ambient lighting environments, and correct for optical distortion, while minimizing computational power and time.

Faces have been captured passively in rooms instrumented with a set of cameras, where stereo computations can be done using selected viewpoints. Other objects can be captured using the same methods. Such hardware configurations are unavailable for mobile use in arbitrary environments, however. Other work has shown that faces can be captured using a single camera and processing that uses knowledge of the human face. Either the face has to move relative to the camera, or assumptions of symmetry are employed. Our approach is to use two cameras affixed to the head, which is necessary to convey non symmetrical facial expression, such as the closing of one eye and not the other, or the reflection of a fire on only one side of the face.

There is little overlap in the images taken from outside the user's central field of view, so the frontal view synthesized is a novel view. In previous work, novel views have been synthesized by a panoramic system and/or by interpolating between a set of views. Producing novel views in a dynamic scenario was successfully shown for a highly rigid motion. This work extended interpolation techniques to the temporal domain from the spatial domain. A novel view at a new time instant was generated by interpolating views at nearby time intervals using spatio-temporal view interpolation, where a dynamic 3-D scene is modeled and novel views are generated at intermediate time intervals.

There remains a need for a way to generate in real time a synthetic frontal view of a human face from two real side views.

SUMMARY OF THE INVENTION

In accordance with the present invention, image processing procedures include receiving at least two side view images of a face of a user. In other aspects, side view images are warped and blended into an output image of a face of a user as if viewed from a virtual point of view. In further aspects, a video is produced in real time of output images from a video feed of side view images.

In yet other aspects, a teleportal system is provided. A principal feature of the teleportal system is that single or multiple users at a local site and a remote site use a telecommunication link to engage in face-to-face interaction with other users in a 3D augmented reality environment. Each user utilizes a system that includes a display such as a projection augmented-reality display and sensors such as a stereo facial expression video capture system. The video capture system allows the participants to view a 3D, stereoscopic, video-based image of the face of all remote participants and hear their voices, view unobstructed the local participants, and view a room that blends physical with virtual objects with which users can interact and manipulate.

In one preferred embodiment of the system, multiple local and remote users can interact in a room-sized space draped in a fine grained retro-reflective fabric. An optical tracker preferably having markers attached to each user's body and digital video cameras at the site records the location of each user at a site. A computer uses the information about each user's location to calculate the user's body location in space and create a correct perspective on the location of the 3D virtual objects in the room.

The projection augmented-reality display projects stereo images towards a screen which is covered by a fine grain retro-reflective fabric. The projection augmented-reality display uses an optics system that preferably includes two miniature source displays, and projection-optics, such as a double Gauss form lens combined with a beam splitter, to project an image via light towards the surface covered with the retro-reflective fabric. The retro-reflective fabric retro-reflects the projected light brightly and directly back to the eyes of the user. Because of the properties of the retro-reflective screen and the optics system, each eye receives the image from only one of the source displays. The user perceives a 3D stereoscopic image apparently floating in space. The projection augmented-reality display and video capture system does not occlude vision of the physical environment in which the user is located. The system of the present invention allows users to see both virtual and physical objects, so that the objects appear to occupy the same space. Depending on the embodiment of the system, the system can completely immerse the user in a virtual environment, or the virtual environment can be restricted to a specific region in space, such as a projection window or table top. Furthermore, the restricted regions can be made part of an immersive wrap-around display.

Further areas of applicability of the present invention will become apparent from the detailed description provided hereinafter. It should be understood that the detailed description and specific examples, while indicating the preferred embodiment of the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention.

BRIEF DESCRIPTION OF THE DRAWINGS

The present invention will become more fully understood from the detailed description and the accompanying drawings, wherein:

FIG. 1 is a plan view of a first preferred embodiment of a teleportal system of the present invention showing one local user at a first site and two remote users at a second site;

FIG. 2 is a block diagram depicting the teleportal system of the present invention;

FIG. 3 is a perspective view of the illumination system for a projection user-mounted display of the present invention;

FIG. 4 is a perspective view of a first preferred embodiment of a vertical architecture of the illumination system for the projection user-mounted display of the present invention;

FIG. 5 is a perspective view of a second preferred embodiment of a horizontal architecture of the illumination system for the projection user-mounted display of the present invention;

FIG. 6 is a diagram depicting an exemplary optical pathway associated with a projection user-mounted display of the present invention;

FIG. 7 is a side view of a projection lens used in the projection augmented-reality display of the present invention;

FIG. 8 is a side view of the projection augmented-reality display of FIG. 4 mounted into a headwear apparatus;

FIG. 9 is a perspective view of the video system in the teleportal headset of the present invention;

FIG. 10 is a side view of the video system of FIG. 9;

FIG. 11 is a top view of a video system of FIG. 9;

FIG. 12a is an alternate embodiment of the teleportal site of the present invention with a wall screen;

FIG. 12b is another alternate embodiment of the teleportal site of the present invention with a spherical screen;

FIG. 12c is yet another alternate embodiment of the teleportal site of the present invention with a hand-held screen;

FIG. 12d is yet another alternate embodiment of the teleportal site of the present invention with body shaped screens;

FIG. 13 is a first preferred embodiment of the projection augmented-reality display of the present invention;

FIG. 14 is a side view of the projection augmented-reality display of FIG. 13;

FIG. 15 is a view of a face capture concept and images from a prototype head mounted display unit;

FIG. 16 is a view of an experimental prototype of a face capture system;

FIG. 17 is a view demonstrating behavior of a grid pattern;

FIG. 18 is a view of face images captured during a calibration stage;

FIG. 19 is a block diagram of an off-line calibration stage during synthesis of a virtual frontal view;

FIG. 20 is a block diagram of an operational stage during synthesis of a virtual frontal view;

FIG. 21 is a set of views illustrating generation of a frontal view during a calibration stage and reconstruction of the frontal image from a side view using a grid: (a) left image captured during the calibration stage; (b) operational left image warped into virtual image plus calibration stripes; and (c) operational left image without stripes;

FIG. 22 is a set of views illustrating: (a) a frontal view obtained from a camcorder; and (b) a virtual frontal view obtained as a reconstructed frontal view from transformation tables and a side image of FIG. 21(c);

FIG. 23 is a set of views of images considered for objective evaluation with a top row of real video frames compared to a bottom row of virtual video frames;

FIG. 24 is a set of views of a real video image on the left compared to a corresponding virtual video image on the right, wherein facial regions are compared using cross-correlation;

FIG. 25 is a set of views of a real video image on the left compared to a corresponding virtual video image on the right, wherein distances between facial feature points are considered using a Euclidean distance measure;

FIG. 26 is a set of views with a top row showing images captured using a left camera, a second row showing images captured using a right camera; a third row showing images captured using a camcorder placed in front of the face, and a final row showing virtual frontal views generated from images in the first two rows;

FIG. 27 is a set of views illustrating synchronization of eyelids during blinking, with real video displayed in a top row and virtual video illustrated in a bottom row; and

FIG. 28 is a view identifying some feature points in a side image and a set of triangles formed using the feature points as vertices.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

The following description of the preferred embodiment(s) is merely exemplary in nature and is in no way intended to limit the invention, its application, or uses.

FIG. 1 depicts a teleportal system 100 using two display sites 101 and 102. Teleportal system 100 includes a first teleportal site or local site 101 and a second teleportal site or remote site 102. It should be appreciated that additional teleportal sites can be included in teleportal system 100. Although first teleportal site 101 is described in detail below, it should further be appreciated that the second teleportal site 102 can be identical to the first teleportal site 101. It should also be noted that the number of users and types of screens can vary at each site.

Teleportal sites 101 and 102 preferably include a screen 103. Screen 103 is made of a retro-reflective material such as beads-based or corner-cube based materials manufactured by 3M® and Reflexite Corporation. The retro-reflective material is preferably gold which produces a bright image with adequate resolution. Alternatively, other material which has metalic fiber adequate to reflect at least a majority of the image or light projected onto its surface may be used. The retro-reflective material preferably provides about 98 percent reflection of the incident light projected onto its surface. The material retro-reflects light projected onto its surface directly back upon its incident path and to the eyes of the user. Screen 103 can be a surface of any shape, including but not limited to a plane, sphere, pyramid, and body-shaped, for example, like a glove for a user's hand or a body suit for the entire body. Screen 103 can also be formed to a substantially cubic shape resembling a room, preferably similar to four walls and a ceiling which generally surround the users. In the preferred embodiment, screen 103 forms four walls which surround users 110. 3D graphics are visible via screen 103. Because the users can see 3D stereographic images, text, and animations, all surfaces that have retro-reflective property in the room or physical environment can carry information. For example, a spherical screen 104 is disposed within the room or physical environment for projecting images. The room or physical environment may include physical objects substantially unrelated to the teleportal system 100. For example, physical objects may include furniture, walls, floors, ceilings and/or other inanimate objects.

With a continued reference to FIG. 1, local site 101 includes a tracking system 106. Tracking system 106 is preferably an optical or optical/hybrid tracking system which may include at least one digital video camera or CCD camera. By way of example, four digital video cameras 114, 116, 118 and 120 are shown. By way of another example, several sets of three CCD arrays stacked up could be used for optical tracking. Visual processing software (not shown) processes teleportal site data acquired from digital video cameras 114, 116, 118 and 120. The software provides the data to the networked computer 107a. Teleportal site data, for example, includes the position of users within the teleportal room.

Optical tracking system 106 further includes markers 96 that are preferably attached to one or more body parts of the user. In the preferred embodiment, markers 96 are coupled to each user's hand, which is monitored for movement and position. Markers 96 communicate marker location data regarding the location of the user's head and hands. It should be appreciated that the location of any other body part of the user or object to which a marker is attached can be acquired.

Users 110 wear a novel teleportal headset 105. Each headset preferably has displays and sensors. Each teleportal headset 105 communicates with a networked computer. For example, teleportal headsets 105 of site 101 communicate with networked computer 107a. Networked computer 107a communicates with a networked computer 107b of site 102 via a networked data system 99. In this manner, teleportal headsets can exchange data via the networked computers. It should be appreciated that teleportal headset 105 can be connected via a wireless connection to the networked computers. It should also be appreciated that headset 105 can alternatively communicate directly to networked data system 99. One type of networked data system 99 is the Internet, a dedicated telecommunication line connecting the two sites, or a wireless network connection.

FIG. 2 is a block diagram showing the components for processing and distribution of information of the present invention teleportal system 100. It should be appreciated that information can be processed and distributed from other sources that provide visual data which can be projected by teleportal system 100. For example, digital pictures of body parts, images acquired via medical imaging technology and images of other three dimensional (3D) objects. Teleportal headset 105 includes at least one sensor array 220 which identifies and transmits the user's behavior. In the preferred embodiment, sensor array 220 includes a facial capture system 203 (described in further detail with reference to FIGS. 9, 10, and 11) that senses facial expression, an optical tracking system 106 that senses head motion, and a microphone 204 that senses voice and communication noise. It should be appreciated that other attributes of the user's behavior can be identified and transmitted by adding additional types of sensors.

Each of sensors 203, 106 and 204 are preferably connected to networked computer 107 and sends signals to the networked computer. Facial capture system 203 sends signals to the networked computer. However, it should be appreciated that sensors 203, 106 and 204 can directly communicate with a networked data system 99. Facial capture system 203 provides image signals based on the image viewed by a digital camera which are processed by a face-unwarping and image stitching module 207. Images or “first images” sensed by face capture system 203 are morphed for viewing by users at remote sites via a networked computer. The images for viewing are 3D and stereoscopic such that each user experiences a perspectively correct viewpoint on an augmented reality scene. The images of participants can be located anywhere in space around the user.

Morphing distorts the stereo images to produce a viewpoint of preferably a user's moving face that appears different from the viewpoint originally obtained by facial capture system 203. The distorted viewpoint is accomplished via image morphing to approximate a direct face-to-face view of the remote face. Face-warping and image-stitching module 207 morphs images to the user's viewpoint. The pixel correspondence algorithm or face warping and image stitching module 207 calculates the corresponding points between the first images to create second images for remote users. Image data retrieved from the first images allows for a calculation of a 3D structure of the head of the user. The 3D image is preferably a stereoscopic video image or a video texture mapping to a 3D virtual mesh. The 3D model can display the 3D structure or second images to the users in the remote location. Each user in the local and remote sites has a personal and correct perspective viewpoint on the augmented reality scene. Optical tracking system 106 and microphone 204 provide signals to networked computer 107 that are processed by a virtual environment module 208.

A display array 222 is provided to allow the user to experience the 3D virtual environment, for example via a projection augmented-reality display 401 and stereo audio earphones 205 which are connected to user 110. Display array 222 is connected to a networked computer. In the preferred embodiment, a modem 209 connects a networked computer to network 99.

FIGS. 3 through 5 illustrate a projection augmented-reality display 401 which can be used in a wide variety of lighting conditions, including indoor and outdoor environments. With specific reference to FIG. 3, a projection lens 502 is positioned to receive a beam from a beamsplitter 503. A source display 501, which is a reflective LCD panel, is positioned opposite of projection lens 502 from beamsplitter 503. Alternatively, source display 501 may be a DLP flipping mirror manufactured by Texas Instruments®. Beamsplitter 503 is angled at a position less than ninety degrees from the plane in which projection lens 502 is positioned. A collimating lens 302 is positioned to provide a collimating lens beam to beamsplitter 503. A mirror 304 is placed between collimating lens 302 and a surface mounted LCD 306. Surface mounted LCD 306 provides light to mirror 304 which passes through collimating lens 302 and beamsplitter 503.

Source display 501 transmits light to beamsplitter 503. It should be appreciated that FIG. 4 depicts a pair of the projection augmented-reality displays shown in FIG. 3; however, each of projection augmented-reality displays 530 and 532 are mounted in a vertical orientation relative to the head of the user. Furthermore, FIG. 5 depicts a pair of projection augmented-reality displays of the type shown in FIG. 3; however, each of projection augmented-reality displays 534 and 536 are mounted in a horizontal orientation relative to the hood of the user.

FIG. 6 illustrates the optics of projection augmented-reality display 500 relative to a user's eye 508. A projection lens 502 receives an image from a source display 501 located beyond the focal plane of projection lens 502. Source display 501 may be a reflective LCD panel. However, it should be appreciated that any miniature display including, but not limited to, miniature CRT displays, DLP flipping mirror systems and backlighting transmissive LCDs may be alternatively utilized. Source display 501 preferably provides an image that is further transmitted through projection lens 502. The image is preferably computer-generated. A translucent mirror or light beamsplitter 503 is placed after projection lens 502 at preferably 45 degrees with respect to the optical axis of projection lens 502; therefore, the light refracted by projection lens 502 produces an intermediary image 505 at its optical conjugate and the reflected light of the beam-splitter produces a projected image 506, symmetrical to intermediary image 505 about the plane in which light beamsplitter 503 is positioned. A retro-reflective screen 504 is placed in a position onto which projected image 506 is directed. Retro-reflective screen 504 may be located in front of or behind projected image 506 so that rays hitting the surface are reflected back in the opposite direction and travel through beamsplitter 503 to user's eye 508. The reflected image is of a sufficient brightness which permits improved resolution. User's eye 508 will perceive projected image 506 from an exit pupil 507 of the optical system.

FIG. 7 depicts a preferred optical form for projection lens 502. Projection lens 502 includes a variety of elements and can be accomplished with glass optics, plastic optics, or diffractive optics. A non-limiting example of projection lens 502 is a double Gauss lens form formed by a first singlet lens 609, a second singlet lens 613, a first doublet lens 610, a second doublet lens 612, and a stop surface 611, which are arranged in series. Projection lens 502 is made of a material which is transparent to visible light. The lens material may include glass and plastic materials.

Additionally, the projection augmented-reality display can be mounted on the head. More specifically, FIG. 8 shows projection augmented-reality display 800 mounted to headwear or helmet 810. Projection augmented-reality display 800 is mounted in a vertical direction. Projection augmented-reality display 800 can be used in various ambient light conditions, including, but not limited to, artificial light and natural sunlight. In the preferred embodiment, light source 812 transmits light to source display 814. Projection augmented-reality display 800 provides optics to produce an image to the user.

FIGS. 9, 10 and 11 illustrate teleportal headset 105 of the present invention. Teleportal headset 105 preferably includes a facial expression capture system 402, ear phones 404, and a microphone 403. Facial expression capture system 402 preferably includes digital video cameras 601a and 601b. In the preferred embodiment, digital video cameras 601a and 601b are disposed on either side of the user's face 606, such that images covering the entire face are captured, which are the used to create one image of the complete face, or a 3D model of the complete face that can then be used to generate single images or stereo images for general viewpoints of the face 606.

Each video camera 601a and 601b is mounted to a housing 406. Housing 406 is formed as a temple section of the headset 105. In the preferred embodiment, each digital video camera 601a and 601b is pointed at a respective convex mirror 602a and 602b. Each convex mirror 602a and 602b is connected to housing 406 and is angled to reflect an image of the adjacent side of the face. Digital cameras 601a and 601b located on each side of the user's face 410 capture a first image or particular image of the face from each convex mirror 602a and 602b associated with the individual digital cameras 601a and 601b, respectively, such that a stereo image of the face is captured. A lens 408 is located at each eye of user face 606. Lens 408 allows images to be displayed to the user as the lens 408 is positioned 45 percent relative to the axis in which a light beam is transmitted from a projector. Lens 408 is made of a material that reflects and transmits light. One preferred material is “half silvered mirror.”

FIGS. 12a through 12d show alternate configurations of a teleportal site of the present invention with various shaped screens. FIG. 12a illustrates an alternate embodiment of the teleportal system 702 in which retro-reflective fabric screen 103 is used on a room's wall so that a more traditional teleconferencing system can be provided. FIG. 12b illustrates another alternate embodiment of a teleportal site 704 in which a desktop system 702 is provided. In desktop system 702, two users 110 observe a 3D object on a table top screen 708. In the preferred embodiment, screen 708 is spherically shaped. All users in site of the screen 708 can view the perspective projections at the same time from their particular positions.

FIG. 12c shows yet another alternate embodiment of teleportal site 704. User 110 has a wearable computer forming a “magic mirror” configuration of teleportal site 704. Teleportal headset 105 is connected to a wearable computer 712. The wearable computer 712 is linked to the remote user (not shown) preferably via a wireless network connection. A wearable screen includes a hand-held surface 714 covered with a retro-reflective fabric for the display of the remote user. A “magic mirror” configuration of teleportal site 704 is preferred in the outdoor setting because it is mobile and easy to transport. In the “magic mirror configuration,” the user holds the surface 714, preferably via a handle and positions the surface 714 over a space to view the virtual environment projected by the projection display of the teleportal head set 105.

FIG. 12d shows yet another alternate embodiment of the teleportal site 810. A body shaped screen 812 is disposed on a person's body 814. Body shaped screen 812 can be continuous or substantially discontinuous depending upon the desire to cover certain body parts. For example, a body shaped screen 812 can be shaped for a patient's head, upper body, and lower body. A body shaped screen 812 is beneficial for projecting images, such as that produced by MRI (or other digital images), onto the patient's body during surgery. This projecting permits a surgeon or user 816 to better approximate the location of internal organs prior to invasive treatment. Body shaped screen 812 can further be formed as gloves 816, thereby allowing the surgeon to place his hands (and arms) over the body of the patient yet continue to view the internal image in a virtual view without interference of his hands.

FIGS. 13 and 14 show a first preferred embodiment of a projection augmented-reality display 900 which includes a pair of LCD displays 902 coupled to headwear 905. In the preferred embodiment, a pair of LCD displays 902 project images to the eyes of the users. A microphone 910 is also coupled to headwear 905 to sense the user's voice. Furthermore, an earphone 912 is coupled to headwear 905. A lens 906 covers the eyes of the user 914 but still permits the user to view the surrounding around her. The glass lens 906 transmits and reflects light. In this manner, the user's eyes are not occluded by the lens. One preferred material for the transparent glass lens 906 is a “half silvered mirror.”

Communication of the expressive human face is important to tele-communication and distributed collaborative work. In addition to sophisticated collaborative work environments, there is a strong popular trend for the merger of cell phone and video functionality at consumer prices. At both ends of the technology spectrum, there is a problem producing quality video of a person's face without interfering with that person's ability to perform some task requiring both visual and motor attention. When the person is mobile, the technology of most collaborative environments is unusable. Referring now to FIG. 15, the solution proposed here is to modify a helmet mounted display (HMD) for minimally intrusive face capture. The prototype HMD has small mirrors held above the temples and viewed by small video cameras above the ears, creating a helmet that is balanced and light and with minimal occlusion of the wearer's field of view. The complete HMD design includes display components that display remote faces and scenes to the wearer as well as reality augmentation for the wearer's environment. The system and method of the present invention provides a virtual frontal video of the HMD wearer. This virtual video (VV) is synthesized by warping and blending the two real side view videos.

Side view as used herein should be interpreted as any offset view. Thus, the angle with respect to the face does not have to be directly from the side. Also, the side view can be from an angle beneath or above the face. Further, while side views of faces of users are typically captured and used from/in a virtual view, it should be readily understood that other parts of a user may also be captured, such as a user's hand.

A prototype HMD facial capture system has been developed. The development of the video processing reported here was isolated from the HMD device and performed using a fixed lab bench and conventional computer. Porting and integration of the video processing with the mobile HMD hardware can be accomplished in a variety of ways as further described below.

The prototype system was configured with off-the-shelf hardware and software components. FIG. 16 illustrates a lab bench used to develop the mobile face capture and image processing system and method. The bench was built to accommodate human subjects so they could keep their heads fixed relative to two cameras 1000A and 1000B and a structured light projector 1002. The two cameras 1000A and 1000B are placed so that their images are similar to those that can be obtained from the HMD optics. The light projector 1002 is used to orient the head precisely and to obtain calibration data used in image warping. In addition to the equipment shown in FIG. 16, a video camera (not shown) placed on top of the projector records the subject's face during each experiment for comparison purposes. The prototype uses an Intel Pentium III processor running at 746 MHz with 384 MB RAM having two Matrox Meteor II standard cards.

In the experiment demonstrating feasibility of some embodiments of the present invention, several videos were taken for several volunteers so that the synthetic video could be compared to real video. One question posed was whether the synthetic frontal video would be of sufficient quality to support the applications intended for the HMD. The bench was set up for a general user and adjustments were made for individuals only when needed. Video and audio were recorded for each subject for offline processing.

The problem is to generate a virtual frontal view from two side views. The projected light grid provides a basis for mapping pixels from the side images into a virtual image with the projector's viewpoint. The grid is projected onto the face for only a few frames so that mapping tables can be built, and then is switched off for regular operation.

There are three 2D coordinate systems involved in creation of the virtual video. A global 3D coordinate system is denoted; however, it must be emphasized that 3D coordinates are not required for the task according to some embodiments of the present invention.

- 1) World Coordinate System (WCS): for discussion only in some embodiments
- 2) Left Camera Coordinate System (LCS): I_L[s, t] is the left image with s, t coordinates.
- 3) Right Camera Coordinate System (RCS): I_R[u, v] is the right image with u, v coordinates.
- 4) Projector Coordinate System (PCS): V[x, y] is the output virtual video image with coordinates defined by the projected grid.

During the calibration phase, the transformation tables are generated using the grid pattern coordinates. A rectangular grid is projected onto the face and the two side views are captured as shown in FIGS. 16 and 18. The location of the grid regions in the side images define where real pixel data is to be accessed for placement in the virtual video. Coordinate transformation is done between PCS and LCS and between PCS and RCS. Using transformation tables that store the locations of grid points, an algorithm can map every pixel in the front view to the appropriate side view. By centering the grid on the face, the grid also supports the correspondence between LCS and RCS and the blending of their pixels.

The behavior of a single gridded cell in the original side view and the virtual frontal view is demonstrated in FIG. 17. A grid cell in the frontal image maps to a quadrilateral with curve edges in the side image. Bilinear interpolation is used to reconstruct the original frontal grid pattern by warping a quadrilateral into a square or a rectangle.
s=f₁(x, y) and t=g₁(x, y) (1)
u=f_r(x, y) and v=g_r(x, y) (2)

Equations 1 and 2 are four functions determined during the calibration stage and implemented via the transformations tables. These transformation tables are then used in the operational stage immediately after the grid is switched off. During operation, it is known for each pixel V[x, y] in which grid cell of LCS or RCS it lies. Bilinear interpolation is then used on the grid cell corners to access an actual pixel value to be output to the VV.

In the case where convex mirrors, wide angle lenses, or equivalent sensors are employed to capture offset views of user faces, warping can still be accomplished in one step by making the point correspondences. However, in the case of strong nonlinear distortion, it is envisioned that bicubic interpolation may be employed instead of bilinear interpolation. It is also envisioned that subpixel coordinates and multiple pixel sampling can be used in cases where the face texture changes fast or where the face normal is away from the sensor direction.

Some implementation details are as follows. A rectangular grid of dimension 400×400 is projected onto the face. The grid is made by repeating three colored lines. White, green and cyan colors proved useful because of their bright appearance over the skin color. This combination of hues demonstrated good performance over a wide variety of skin pigmentations. However, it is envisioned that other hues may be employed. The first few frames have the grid projected onto the face before the grid is turned off. One of the frames with the grid is taken and the transformation tables are generated. The size of the grid pattern that is projected in the calibration stage plays a significant role in the quality of the video. This size was decided based on the trade-off between the quality of the video and execution time. An appropriate grid size was chosen based on trial and error. The trial and error process started by projecting a sparse grid pattern onto the face and then increasing the density of the grid pattern. At one point, the increase in the density did not significantly improve the quality of the face image but consumed too much time. At that point, the grid was finalized with a grid cell size of row-width 24 pixels and column-width 18 pixels. FIG. 18 shows the frames that are captured during the calibration stage of the experiment. This calibration step is feasible for use in collaborative rooms; however, it is envisioned that the calibration is applicable to mobile users as well.

FIG. 19 shows the off-line calibration stage during the synthesis of the virtual frontal view. Projector 1002 projects grid pattern 1004 onto human face 1006. Grid lines reflect off of human face 1006 to left and right mirrors 1008A and 1008B, and from the mirrors to respective left and right cameras 1000A and 1000B. Quadrilaterals of left and right calibration face images 1010A and 1010B are mapped to corresponding squares or rectangles of grid pattern 1004 to form left and right transformation tables 1012. It is envisioned that more than two side views can be used, and that other polygonal shapes besides quadrilaterals may be employed. Thus, a grid pattern of predetermined polygonal shapes is projected onto the face from a virtual point of view, side view images of the face are captured, and pixels enclosed by the polygons of captured side view images are mapped back to corresponding predetermined polygonal shapes of the grid pattern to form the transformation tables. It is envisioned that the side view imaging arrays may be integrated into a projection screen of the HMD, thus eliminating the mirrors while retaining fixed positions respective of and orientations toward sides of the user's face.

Using the transformation tables 1012 generated in the calibration phase, each virtual frontal frame is generated. The algorithm reconstructs each (x, y) coordinate in the virtual view by accessing the corresponding location in the transformation table and retrieving the pixel in I_L(or I_R) using interpolation. Then a 1D linear smoothing filter is used to smooth the intensity across the vertical midline of the face. Without this smoothing, a human viewer usually perceives a slight intensity edge at the midline of the face.

FIG. 20 shows the complete block diagram of the operational phase. Transformation tables 1012 are used to warp left and right face images 1010A and 1010B into left warped face image 1014A and right warped face image 1014B. These portions of the virtual output image 1016 are then blended by mosaicking the face image. Post processing to linearly smooth the image is performed to result in a final virtual face image 1018. Since the transformation is based on the bilinear interpolation technique, each pixel can be generated only when it is inside four grid coordinate points. Because the grid is not defined well at the periphery of the face, the algorithm is unable to generate the ears and hair portion of the face. The results of the warping during the calibration and the operation stage is shown in FIGS. 21 through 23.

Some other post-processing can be included. For example, frames with a gridded pattern can be deleted from the final output: these can be identified by a large shift in intensity when the projected grid is switched off. Also, a microphone recording of the voice of the user, stored in a separate .wav file, can be appended to the video file to obtain a final output.

Color balancing of the cameras can also be performed. Even though software based approaches for color balancing can be taken, the color balancing in the present work is done at the hardware level. Before the cameras are used for calibration, they are balanced using the white balancing technique. A single white paper is shown to both cameras and cameras are white balanced instantly.

The virtual video of the face can be adequate to support the communication of identity, mental state, gesture, and gaze direction. Some objective comparisons between the synthesized and real videos are reported below, plus a qualitative assessment.

The real video frames from the camcorder and the virtual video frames were normalized to the same size of 200×200 and compared using cross correlation and interpoint distances between salient face features. Five images that were considered for evaluation are shown in FIG. 23. Important items considered were the smoothness and accuracy of lips and eyes and their movements, the quality of the intensities, and the synchronization of the audio and video. In particular, flaws looked for were breaks at the centerline of the face due to blending and for other distortions that may have been caused by the sensing and warping process.

1) Normalized cross-correlation: The cross correlation between regions of the virtual image and real image was computed for rectangular regions containing the eyes and mouth (FIG. 24). As Table 1 shows, there was high correlation between the real and the virtual images taken at the same instant of time. Frames 2 and 3 shown in FIG. 23 contain facial expressions (eye and lip movements) that were quite different from the expression used during the calibration stage and the generated view gave a slightly lower correlation value when compared with the other frames. Also, the facial expressions in the first and fourth frames were similar to that of the expression in the calibration frame. Hence, these frames have a higher correlation value compared to the rest. The eye and lip regions were considered for evaluating the system because during any facial movement, these regions change significantly and are more important in communication.

TABLE 1 Results of Normalized Cross-Correlation Between the Real and the Virtual Frontal Views Applied in Regions Around the Eyes and Mouth video left eye right eye mouth eyes + mouth complete Frame 1 0.988 0.987 0.993 0.989 0.989 Frame 2 0.969 0.972 0.985 0.978 0.985 Frame 3 0.969 0.967 0.992 0.978 0.986 Frame 4 0.991 0.989 0.993 0.990 0.990 Frame 5 0.985 0.986 0.992 0.988 0.989

2) Euclidean distance measure: The difference in the normalized Euclidean distances between some of the most prominent feature points were computed. The feature points are chosen in such a way that one of them is relatively static with respect to the other. For some prominent feature points, such as corners of the eyes, nose tip, corners of the mouth, the corners of the eyes are relatively static when compared with the corners of the mouth. FIG. 25 shows the most prominent facial feature points and the distances between those points. Let R_ijrepresent the Euclidean distance between two feature points i and j in the real frontal image and V_ijrepresent the Euclidean distance between two feature points in the virtual frontal image. The difference in the Euclidean distance is D_ij=|R_ij−V_ij|. The average error ε for comparing the face images is defined by $ɛ = \frac{1}{6} [D_{a f} + D_{b f} + D_{c f} + D_{c g} + D_{d g} + D_{e g}] .$

TABLE 2 Euclidean Distance Measurement of the Prominent Facial Distances in the Real Image and Virtual Image and the Defined Average Error. All Dimensions are in Pixels. Frames D_af D_bf D_cf D_cg D_dg D_eg Error (ε) Frame 1 2.00 0.80 4.15 3.49 2.95 3.46 2.80 Frame 2 0.59 3.00 0.79 4.91 0.63 0.80 1.79 Frame 3 1.88 3.84 4.29 4.34 2.68 1.83 3.14 Frame 4 1.09 2.97 2.10 6.33 3.01 4.08 3.36 Frame 5 1.62 2.21 5.57 4.99 1.24 1.90 2.92

The results in Table II indicate small errors in the Euclidean distance measurements of the order of 3 pixels in an image of size 200×200. The facial feature points in the five frames were selected manually and hence the errors might have also been caused due to the instability of manual selection. One can note that the error values of D_cfand D_cgare larger than the others. This is probably because the nose tip is not as robustly located compared to eye corners.

A preliminary subjective study was also performed. In general, the quality of the videos was assessed as adequate to support the variety of intended applications. The two halves of all the videos are well synchronized and color balanced. The quality of the audio is good and it has been synchronized well with the lip movements. Some observed problems were distortion in the eyes and teeth and in some cases a cross-eyed appearance. The face appears slightly bulged compared with the real videos, which is probably due to the combined radial distortions of the camera and projector lenses.

Synchronization in the two videos is preferred in the invention application. Since two views of a face with lip movements are merged together, any small changes in the synchronization has high impact on the misalignment of the lips. This synchronization was evaluated based on sensitive movements such as eyeball movements and blinking eyelids. Similarly, mouth movements were examined in the virtual videos. FIGS. 26 to 27 show some of these effects.

Analysis indicates that a real-time mobile system is feasible. The total computation time consists of (1) transferring the images into buffers, (2) warping by interpolating each of the grid blocks, and (3) linearly smoothing each output image. The average time is about 60 ms per frame using a 746 MHz computer. Less than 30 ms would be considered to be real-time: this can be achieved with a current computer with clock rate of 2.6 GHz. Some implementations can require more power to mosaic training data into the video to account for features occluded from the cameras.

It can be concluded that the algorithm being used can be made to work in real-time. The working prototype has been tested on a diverse set of seven individuals. From comparisons of the virtual videos with real videos, it is expected that important facial expressions will be represented adequately and not distorted by more than 2%. Thus, the HMD system implementing the image processing software of the present invention can support the intended telecommunication applications.

It is envisioned that calibration using a projected grid can be used with the algorithms described above. 3D texture-mapped face models can also be created by calibrating the cameras and projector in the WCS. 3D models present the opportunity for greater compression of the signal and for arbitrary frontal viewpoints, which are desired for virtual face-to-face collaboration. Although technically feasible, structured light projection is an obtrusive step in the process and may be cumbersome in the field. Thus, a generic mesh model of the face can also be employed.

There is a problem due to occlusion in the blending of the two side images. Some facial surface points that should be displayed in the frontal image are not visible in the side images. For example, the two cameras cannot see the back of the mouth. It is envisioned that training data may be taken from a user and patched into the synthetic video, either for that user or for another, similar user. During training, the user can generate a basis for all possible future output material. The system can contain methods to index to the right material and blend it with the regular warped output. A related problem is that facial deformations that make significant alterations to the face surface may not be rendered well by the static warp. Examples are tongue thrusts and severe facial distortions. The static warp algorithm achieves good results for moderate facial distortion: It does not crash when severe cases are encountered, but the virtual video can show a discontinuity in important facial features.

Other embodiments of the present invention employ a 3D model as described below. The 3D modeling embodiments include one or more of the following: (a) a calibration method that does not depend upon structured light, (b) an output format that is a dynamic 3D model rather than just a 2D video, and (c) a real-time tracking method that identifies salient face points in the two side videos and updates both the 3D structure and the texture of the 3D model accordingly.

The 3D face model can be represented by a closed mesh of n points (x, y, z), i=1, n and a texture map. This model can be rendered rapidly by standard graphics software and displayed by standard graphics cards. The mesh point 3D coordinates are available for a generic face. Scaling and deformation transformations can be used to instantiate this model for an individual wearing the Face Capture Head Mounted Display Units (FCHMDs). The model can be viewed/rendered from a general viewpoint within the coverage of the cameras and not just from the central point in front of the face. Triangles of the mesh can be texture-mapped to the sensed images and to other stored face data that may be needed to fill in for unimaged patches.

The 3D face model can be instantiated to fit a specific individual by one or more of the following: (1) choosing special points by hand on a digital frontal and profile photo; (2) choosing special points from the two side video frames of a neutral expression taken from the FCHMD, and enabling the wearer to make adjustments while viewing the resulting rendered 3D model.

In some embodiments, standard rendering of the face model requires one or more of the following: (1) the set of triangles modeling the 3D geometry; (2) the two side images from the FCHMD; (3) a mapping of all vertices of each 3D triangle into the 2D coordinate space of one of the side images; (4) a viewpoint from which to view the 3D model; and (5) a lighting model that determines how the 3D model is illuminated.

FIG. 28 illustrates the identification of some feature points in a side image and a set of triangles formed using the feature points as vertices. These triangles serve as bounding polygons for regions to be texture mapped to corresponding polygonally bounded regions of a generic mesh model. On a frame by frame basis, the generic mesh model used is selected to maximize similarity between the feature points automatically recognized in the side view image and feature points of the mesh model as if viewed from the side. In some embodiments, scaling and deformation transformations already obtained for causing the generic mesh model to fit a particular user are next used to modify texture mapping of the generic mesh model to the side view images. Then, the resulting 3D model of the user's face can be rendered from a selected virtual point of view to result in an output image. Accordingly, input video streams of side view images can be used in realtime to produce a video stream of output images from a virtual point of view.

It is envisioned that users communicating with one another may each wear a FCHMD, and that the FCHMD can operate in a variety of ways. For example, side views of a first user's face can be transmitted to the second user's FCHMD, where they can be warped and blended to produce the 3D model, which is then rendered from a selected perspective to produce the output image. Also, the first user's FCHMD can warp and blend the side views to produce the 3D model, and transmit the 3D model to the second user's FCHMD where it can be rendered from a selected perspective to produce the output image. Further, the first user's FCHMD can warp and blend the side views, render the resulting 3D model from a selected perspective to produce the output image, and transmit the output image to the second user's FCHMD. Yet further, an external image processing module external to the FCHMDs can perform some or all of the steps necessary to produce the output image from the side views. Further still, this external image processing module can be remotely located on a communications network, rather than physically located at a location of one or more of the user's. Accordingly, a FCHMD may be adapted to transmit to a remote location and/or receive from a remote location at least one of the following: (1) side view images; (2) user-specific scaling and deformation transformations; (3) position of a user's face in a common coordinate system of a collaborative, virtual environment; (4) a 3D model of a user's face; (5) a selection of a virtual point of view from which to render a user's face; and (6) an output image. Supplemental image data obtained from a particular user or from training users can also be transmitted or received, and can even be integrated into the generic mesh models ahead of time.

It should be readily understood that the FCHMD does not have to transmit or receive one or more of each of the types of data listed above. For example, it is possible that an FCHMD may only transmit and receive output images. It is also possible that an FCHMD may transmit and receive only two data types, including output images together with position of a user's face in a common coordinate system of a collaborative, virtual environment. It is further possible that an FCHMD will transmit and receive only side view images. It is still further possible that an FCHMD will transmit and receive only two data types, including side view images, together with position of a user's face in a common coordinate system of a collaborative, virtual environment. It is yet further possible that an FCHMD will transmit and receive only 3D models of user's faces. It is still yet further possible that an FCHMD will transmit and receive only two data types, including 3D models of user's faces, together with position of a user's face in a common coordinate system of a collaborative, virtual environment. In the cases where 3D models or side view images are transmitted and received, it may be the case that user-specific scaling and deformation transformations are transmitted and received at some point, perhaps during an initialization of collaboration. It is additionally possible that one FCHMD can do most or all of the work for both FCHMDs, and receive side view images and face position data for a first user while transmitting output images or a 3D model for a second user. Accordingly, all of these embodiments and others that will be readily apparent to those skilled in the art are described above.

During operation, the FCHMD optics/electronics of some embodiments can sense in real time the real expressive face of the wearer from the two side videos, and the software can create in real time an active 3D face model to be transmitted to remote collaborators.

The morphable model is trained for dynamic use on a population of users. A diverse set of training users may wear the FCHMD and follow a script that induces a variety of facial expressions, while frontal video is also recorded. This training set can support salient point tracking and also the substitution of real data for viewpoints that cannot be observed by the side cameras (inside the mouth, for example). Moreover, the training videos can record sequences of articulator movements that can be used during online FCHMD use.

Let S be a set of shape vectors composed of the face surface points and a corresponding set T of texture vectors.
S_j=(x₁, y₁, z₁, . . . , x_n, y_n, z_n) (1)
T_j=(r₁g₁b₁, . . . , r_n, g_n, b_n) (2)
The shape points contain, as a subset, the salient points of the shape mesh. Training the model can be accomplished by hand labeling of the mesh points for a diverse set of faces and multiframe video recording followed by principal components analysis to obtain a minimum spanning dimensionality.

Any face S_p, T_pin the population can be represented as S_p=Σ_j=1^Ma_jS_jand T_p=Σ_j=1^Ma_jT_j, with Σ_j=1^Ma_j=1 and Σ_j=1^Mb_j1. The parameters a_j, b_jrepresent the face p in terms of the training faces and the new illumination conditions and possibly slight variation in the camera view.

Tracking of salient feature points can be accomplished to dynamically change the transformation tables and achieve a dynamic model. The parameters of the model a_j, b_jcan be dynamically fit by optimizing the similarity between a model rendered using these parameters and the observed images. $\begin{matrix} E_{a_{j}, b_{j}} = \sum_{x, y}  I_{observed} [x, y]  & (3) \end{matrix}$
Fitting via hill-climbing is one designated optimization procedure in some embodiments so that small dynamic updates can be made to the model parameters for the next observed side video frames.

The FCHMD can be calibrated by finding the optimal fit between a parameterized model and the video data currently observed on the FCHMD. Once this fit is known, locations of the salient mesh points (X_k, Y_k, Z_k) are known and thus a texture map is defined between the 3D mesh and the 2D images for that instant of time (current expression). Since iterative hill-climbing is used for the fitting procedure, it is expected that either some intelligent guess or some hand selection will be needed to initialize the fitting. A fully automatic procedure can be initialized from an average wearer's face determined from the training data. The control software for the FCHMD can have a back up procedure so that the HMD wearer can initialize the fitting by manually choosing some salient face points via the wearer viewing the video images and selecting points.

The description of the invention is merely exemplary in nature and, thus, variations that do not depart from the gist of the invention are intended to be within the scope of the invention. Such variations are not to be regarded as a departure from the spirit and scope of the invention.

Claims

1. An image processing method, comprising:

receiving at least two side view images of a face of a user;

warping and blending the side view images into an output image of the face of the user as if viewed from a virtual point of view; and

producing a virtual video in real time of output images from a video feed of side view images.

2. The method of claim 1, further comprising:

accessing a three-dimensional closed mesh model of points corresponding to salient facial feature points; and

warping and blending the side view images by texture mapping the side view images to the three-dimensional closed mesh model based on mappings of vertices of polygons of the mesh model into two-dimensional coordinate spaces of side view images.

3. The method of claim 2, further comprising instantiating a mesh model for an individual user by obtaining scaling and deformation transformations.

4. The method of claim 3, further comprising obtaining scaling and deformation transformations by choosing special points by hand on a digital frontal and profile photo.

5. The method of claim 3, further comprising obtaining scaling and deformation transformations by choosing special points from the two side images of a neutral facial expression of the user captured by imaging components of a head mounted display worn by the user, and enabling the user wearing the head mounted display to make adjustments that apply various scaling and deformation transformations while viewing a resulting output image rendered by the head mounted display.

6. The method of claim 2, further comprising:

receiving a selection of a virtual point of view from which to view a three-dimensional model of the face of the user; and

rendering the three-dimensional model from the virtual point of view based on the selection.

7. The method of claim 6, further comprising applying a lighting model that determines how the three-dimensional model appears to be illuminated.

8. The method of claim 2, further comprising dynamically fitting parameters of the mesh model by optimizing similarity between a three-dimensional model rendered from a virtual point of view corresponding to an actual point of view of a side view image while using the parameters and the side view image.

9. The method of claim 8, further comprising fitting the parameters via hill-climbing so that incremental dynamic updates can be made to the model parameters for sequentially observed side video frames.

10. The method of claim 2, further comprising training a morphable model for dynamic use on a population of users, including capturing side views and a frontal view of a diverse set of training speakers.

11. The method of claim 10, further comprising:

hand labeling mesh points for a diverse set of faces and multiframe video recording; and

performing principal components analysis to obtain a minimum spanning dimensionality.

12. The method of claim 2, further comprising texture mapping triangles of the mesh model to stored face data as needed to fill in un-imaged patches.

13. The method of claim 1, further comprising:

accessing transformation tables for the side view images, wherein the transformation tables define rules for interpolating regions of the side view images into side portions of the output image;

warping the side view images based on the transformation tables, thereby producing the side portions of the output image; and

blending the side portions of the output image, thereby producing the output image.

14. The method of claim 13, further comprising creating the transformation tables by projecting a grid pattern onto a human face at least as if from the virtual point of view and mapping polygons of left and right calibration face images to corresponding polygons of the grid pattern.

15. The method of claim 13, wherein warping the side view images includes reconstructing coordinates in side portions of the output image by accessing corresponding locations in the transformation tables and retrieving pixels in the side view images using interpolation.

16. The method of claim 1, wherein receiving at least two side view images includes receiving side view images captured via at least two imaging components of a head mounted display worn by the user, said imaging components attached to said head mounted display unit and thereby obtaining fixed positions and orientations relative to the face of the user and adapted to receive at least two side views of the face of the user;

17. The method of claim 1, further comprising linearly smoothing the output image in order to smooth intensity across a vertical midline of the face.

18. An apparatus, comprising:

a head mounted display unit worn by a first user, the display unit rendering to the first user an output image of a face of a second user virtually interacting with the first user in a collaborative, virtual environment, wherein the output image has been formed, based on offset view images of the face of the second user, such that the face of the second user appears as if viewed from a virtual point of view; and

an input port receiving at least one of the following: (a) offset view images of the face of the second user; (b) user-specific scaling and deformation transformations specific to the second user; (c) position of the face of the second user in a common coordinate system of the collaborative, virtual environment; (d) a three-dimensional model of the face of the second user; (e) a selection of a virtual point of view from which to render the three-dimensional model of the face of the second user; and (f) an output image of the face of the second user.

19. The apparatus of claim 18, further comprising:

an array of at least two imaging components having fixed positions and orientations relative to a face of the first user and adapted to receive at least two offset views of the face of the first user; and

an output port transmitting at least one of the following: (a) offset view images of the face of the first user; (b) user-specific scaling and deformation transformations specific to the first user; (c) position of the face of the first user in the common coordinate system of the collaborative, virtual environment within which the first user and the second user virtually interact; (d) a three-dimensional model of the face of the first user; (e) a selection of a virtual point of view from which to render the three-dimensional model of the face of the first user; and (f) an output image of the face of the first user.

20. The apparatus of claim 19, further comprising an image processing module accessing a three-dimensional closed mesh model of points corresponding to salient facial feature points of offset view images of the face of the first user, and combining the offset view images of the face of the first user by texture mapping the offset view images of the face of the first user to the three-dimensional closed mesh model based on mappings of vertices of polygons of the mesh model into two-dimensional coordinate spaces of the offset view images of the face of the first user, thereby forming the three-dimensional model of the face of the first user.

21. The apparatus of claim 20, wherein said image processing module is further adapted to select a virtual point of view from which to view the three-dimensional model of the face of the first user based on positions of faces of the first user and the second user in a common coordinate system of the collaborative environment, and to render the three-dimensional model of the face of the first user from the virtual point of view, thereby forming the output image of the face of the first user.

22. The apparatus of claim 21, wherein said image processing module is adapted to linearly smooth the output image in order to smooth intensity across a vertical midline of the face of the first user.

23. The apparatus of claim 21, wherein said image processing module is adapted to apply a lighting model that determines how the three-dimensional model of the face of the first user appears to be illuminated.

24. The apparatus of claim 18, further comprising an image processing module adapted to select a virtual point of view from which to view the three-dimensional model of the face of the second user based on positions of faces of the first user and the second user in a common coordinate system of the collaborative environment, and to render the three-dimensional model of the face of the second user from the virtual point of view, thereby forming the output image of the face of the second user.

25. The apparatus of claim 24, wherein said imaging module is further adapted to access a three-dimensional closed mesh model of points corresponding to salient facial feature points of offset view images of the face of the second user, and combining the offset view images of the face of the second user by texture mapping the offset view images of the face of the second user to the three-dimensional closed mesh model based on mappings of vertices of polygons of the mesh model into two-dimensional coordinate spaces of the offset view images of the face of the second user, thereby forming the three-dimensional model of the face of the second user.

26. The apparatus of claim 24, wherein said image processing module is further adapted to apply a lighting model that determines how the three-dimensional model appears to be illuminated.

27. The apparatus of claim 18, further comprising an image processing module adapted to linearly smooth the output image in order to smooth intensity across a vertical midline of the face.

28. An apparatus, comprising:

an array of at least two imaging components having fixed positions and orientations relative to a face of a first user and adapted to receive at least two offset views of the face of the first user; and

an output port transmitting at least one of the following: (a) offset view images of the face of the first user; (b) user-specific scaling and deformation transformations specific to the first user; (c) position of the face of the first user in a common coordinate system of a collaborative, virtual environment within which the first user and a second user virtually interact; (d) a three-dimensional model of the face of the first user; (e) a selection of a virtual point of view from which to render the three-dimensional model of the face of the first user; and (f) an output image of the face of the first user, wherein the output image of the face of the first user has been formed by combining offset view images of the face of the first user into an output image of the face of the first user as if viewed from a virtual point of view.

29. The apparatus of claim 28, further comprising an image processing module accessing a three-dimensional closed mesh model of points corresponding to salient facial feature points of offset view images of the face of the first user, and combining the offset view images of the face of the first user by texture mapping the offset view images of the face of the first user to the three-dimensional closed mesh model based on mappings of vertices of polygons of the mesh model into two-dimensional coordinate spaces of the offset view images of the face of the first user, thereby forming the three-dimensional model of the face of the first user.

30. The apparatus of claim 29, wherein said image processing module is further adapted to select a virtual point of view from which to view the three-dimensional model of the face of the first user based on positions of faces of the first user and the second user in a common coordinate system of the collaborative environment, and to render the three-dimensional model of the face of the first user from the virtual point of view, thereby forming the output image of the face of the first user.

31. The apparatus of claim 30, wherein said image processing module is adapted to linearly smooth the output image in order to smooth intensity across a vertical midline of the face of the first user.

32. The apparatus of claim 29, wherein said image processing module is adapted to apply a lighting model that determines how the three-dimensional model of the face of the first user appears to be illuminated.

33. Computer software, comprising:

first instructions receiving at least two offset view images of a contoured structure;

second instructions forming, from the offset view images, an output image of the contoured structure as if viewed from a virtual point of view.

34. The computer software of claim 33, wherein said second instructions are adapted to recognize feature points of the contoured structure in the offset view images, to access a three-dimensional closed mesh model of feature points similar to the recognized feature points, and to texture map the offset view images to the three-dimensional closed mesh model based on mappings of vertices of polygons of the mesh model into two-dimensional coordinate spaces of the offset view images, thereby forming a three-dimensional model of the contoured structure.

35. The computer software of claim 33, wherein said second set of instructions is further adapted to select a virtual point of view from which to view the three-dimensional model, and to render the three-dimensional model from the virtual point of view, thereby forming the output image.

36. Computer software, comprising:

a first set of instructions receiving at least two offset view images of a contoured structure;

a second set of instructions recognizing feature points of the contoured structure in the offset view images, accessing a three-dimensional closed mesh model of feature points similar to the recognized feature points, and texture mapping the offset view images to the three-dimensional closed mesh model based on mappings of vertices of polygons of the mesh model into two-dimensional coordinate spaces of the offset view images, thereby forming a three-dimensional model of the contoured structure.