METHOD AND SYSTEM FOR AN AUTOMATIC SENSING, ANALYSIS, COMPOSITION AND DIRECTION OF A 3D SPACE, SCENE, OBJECT, AND EQUIPMENT

Info

Publication number: 20160088286
Type: Application
Filed: Sep 18, 2015
Publication Date: Mar 24, 2016
Inventors: HAMISH FORSYTHE (PALO ALTO, CA), Alexander Cecil (REDWOOD CITY, CA)
Application Number: 14/858,901

Abstract

Method and system for automatic composition and orchestration of a 3D space or scene using networked devices and computer vision to bring ease of use and autonomy to a range of compositions. A scene, its objects, subjects and background are identified and classified, and relationships and behaviors are deduced through analysis. Compositional theories are applied, and context attributes (for example location, external data, camera metadata, and the relative positions of subjects and objects in the scene) are considered automatically to produce optimal composition and allow for direction of networked equipment and devices. Events inform the capture process, for example, a video recording initiated when a rock climber waves her hand, an autonomous camera automatically adjusting to keep her body in frame throughout the sequence of moves. Model analysis allows for direction, including audio tones to indicate proper form for the subject and instructions sent to equipment ensure optimal scene orchestration.

Description

Description

CROSS REFERENCE TO RELATED APPLICATION

The instant application is a utility application of the previously filed U.S. Provisional Application 62/053,055 filed on 19 Sep. 2014. The pending U.S. Provisional Application 62/053,055 is hereby incorporated by reference in its entireties for all of its teachings.

FIELD OF INVENTION

A method and system for automatically sensing using photographic equipment that captures a 3D space, scene, subject, object, and equipment for further analysis, composition and direction that can be used for creating visual design.

BACKGROUND

Computer device hardware and software continue to advance in sophistication. Cameras, micro controllers, computer processors (e.g., ARM), and smartphones have become more capable, as well as smaller, cheaper, and ubiquitous. In parallel, more sophisticated algorithms including computer vision, machine learning and 3D models can be computed in real-time or near real-time on a smartphone or distributed over a plurality of devices over a network.

At the same time, multiple cameras including front-facing cameras on smartphones have enabled the popularity of the selfie as a way for anyone to quickly capture a moment and share it with others. But the primary mechanism for composition has not advanced beyond an extended arm or a selfie stick and use of the device's screen as a visual reference for the user to achieve basic scene framing. Recently, there have been GPS-based drone cameras introduced such as Lily that improve on the selfie-stick, but they are not autonomous and instead require the user to wear a tracking device to continually establish the focal point of the composition and pass directional “commands” to the drone via buttons on the device. This is limiting when trying to include multiple dynamic subjects and or objects in the frame (a “groupie”), or when the user is preoccupied or distracted (for example at a concert, or while engaged in other activities).

SUMMARY

The present invention is in the areas of sensing, analytics, direction, and composition of 3D spaces. It provides a dynamic real-time approach to sense, recognize, and analyze objects of interest in a scene; applies a composition model that automatically incorporates best practices from prior art as models, for example: photography, choreography, cinematography, art exhibition, and live sports events; and directs subjects and equipment in the scene to achieve the desired outcome.

In one embodiment, a high-quality professional-style recording is being composed using the method and system. Because traditional and ubiquitous image capture equipment can now be enabled with microcontrollers and/or sensor nodes in a network to synthesize numerous compositional inputs and communicate real-time directions to subjects and equipment using a combination of sensory (e.g., visual, audio, vibration) feedback and control messages, it becomes significantly easier to get a high-quality output on one's own. If there are multiple people or subjects who need to be posed precisely, each subject can receive personalized direction to ensure their optimal positioning relative to the scene around them.

In one embodiment, real-world scenes are captured using sensor data and translated into 2D, 2.5 D and 3D models in real-time using a method such that continuous spatial sensing, recognition, composition, and direction is possible without requiring additional human judgment or interaction with the equipment and/or scene.

In one embodiment, image processing, image filtering, video analysis motion, background subtraction, object tracking, pose, stereo correspondence, and 3D reconstruction are run perpetually to provide optimal orchestration of subjects and equipment in the scene without a human operator.

In one embodiment, subjects can be tagged explicitly by a user, or determined automatically by the system. If desired, subjects can be tracked or kept in frame over time and as they move throughout a scene, without further user interaction with the system. The subject(s) can also be automatically directed through sensory feedback (e.g., audio, visual, vibration) or any other user interface.

In one embodiment as a method, an event begins the process of capturing the scene. The event can be an explicit hardware action such as pressing a shutter button or activating a remote control for the camera, or the event can be determined via software, a real world event, message or notification symbol; for example recognizing the subject waving their arms, a hand gesture or an object, a symbol, or identified subject or entity entering a predetermined area in the scene.

The system allows for the identification of multiple sensory event types, including physical-world events (object entering/exiting the frame, a sunrise, a change in the lighting of the scene, the sound of someone's voice, etc.) and software-defined events (state changes, timers, sensor-based). In one embodiment, a video recording is initiated when a golfer settles into her stance and aims her head down, and the camera automatically adjusts to keep her moving club in the frame during her backswing before activating burst mode so as to best capture the moment of impact with the ball during her downswing before pausing the recording seconds after the ball leaves the frame. Feedback can be further provided to improve her swing based on rules and constraints provided from an external golf professional, while measuring and scoring how well she complies with leading practice motion ranges.

In another embodiment, a video or camera scan can be voice or automatically initiated when the subject is inside the camera frame and monitor and direct them through a sequence of easy to understand body movements and steps with a combination of voice, lights and by simple mimic of on-screen poses as in a user interface or visual display. For a few examples, the subject could be practicing and precisely evaluating yoga poses, following a physical therapy program, or taking private body measurements.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1A, 1B, 1C, 1D, 1E show various methods for hands-free capture of a scene.

FIG. 2A, 2B, 2C, 2D, 2E, 2F, 2G, 2H, 2I, 2J illustrate a system of cameras that can be used to implement a method of sensing, analyzing, composing, and directing a scene.

FIG. 3 shows examples of interfaces, inputs, and outputs to direct subjects in a scene.

FIG. 4 shows further examples of interface when capturing close-up scenes.

FIG. 5 shows examples of selfies, groupies, and other typical applications.

FIG. 6 is a diagram of the Sensing Module, Analytics Module, Composition/Architecture Module, and Direction/Control Module.

FIG. 7 is a diagram of the system's algorithm and high-level process flow.

FIG. 8 is a detailed look at the Sensing Module from FIG. 6.

FIG. 9 is a high level view of the system architecture for on-premise and cloud embodiments.

FIG. 10 illustrates various iconic and familiar compositions and reference poses.

FIG. 11 shows an interface for choosing a composition model and assigning objects or subjects for direction.

FIG. 12 shows further examples of compositions that can be directed.

FIG. 13 shows an example interface for using data attached to specific geolocations, as well as an example use case.

FIG. 14 shows how computer vision can influence composition model selection.

FIGS. 15A and 15B show examples of Building Information Management (BIM) applications.

FIG. 16 shows how a collection of images and file types can be constructed and deconstructed into sub-components including 3D aggregate models and hashed files for protecting user privacy across system from device to network and cloud service.

FIG. 17 shows types of inputs that inform the Models from FIG. 6

FIG. 18 shows a method for virtual instruction to teach how to play music

FIG. 19 is an example of how a Model can apply to Sensed data.

FIG. 20 shows example connections to the network and to the Processing Unit.

DETAILED DESCRIPTION

The present invention enables real-time sensing, spatial composition, and direction for objects, subjects, scenes, and equipment in 2D, 2.5D or 3D models in a 3D space. In a common embodiment, a smartphone will be used for both its ubiquity and the combination of cameras, sensors, and interface options.

FIG. 1A shows how such a cell phone (110) can be positioned to provide hands-free capture of a scene. This can be achieved using supplemental stands different from traditional tripods designed for non-phone cameras. FIG. 1C shows a stand can be either foldable (101) or rigid (102) so long as it holds the sensors on the phone in a stable position. A braced style of stand (103) like the one shown in FIG. 1E can also be used. The stand can be made of any combination of materials, so long as the stand is sufficiently tall and wide as to support the weight of the capturing device (110) and hold it securely in place.

In an embodiment, the self-assembled stand (101) can be fashioned from materials included as a branded or unbranded removable insert (105) in a magazine or other promotion (106) with labeling and tabs sufficient so that the user is able to remove the insert (105) and assemble it into a stand (101) without any tools. This shortens the time to initial use by an end-user by reducing the steps needed to position a device for proper capture of a scene.

As seen in FIG. 1D, the effect of the stand can also be achieved using the angle of the wall/floor and the natural surface friction of a space. In this embodiment, the angle of placement (107) is determined by the phone's (110) sensors and slippage can be detected by monitoring changes in those sensors. The angle of elevation can be extrapolated from the camera's lens (111), allowing for very wide capture of a scene when the phone is oriented in portrait mode. Combined with a known fixed position from the bottom of the phone to the lens (104), the system is now able to deliver precise measurements and calibrations of objects in a scene. This precision could be used, for example, to capture a user's measurements and position using only one capture device (110) instead of multiple.

When positioning the device on a door, wall, or other vertical surface (FIG. 1A), adhesive or magnets (120) can be used to secure the capture device (110) and prevent it from falling. For rented apartments or other temporary spaces, the capture device can also be placed in a case (122) such that the case can then be mounted via hooks, adhesives, magnets, or other ubiquitous fasteners (FIG. 1B). This allows for easy removal of the device (110) without compromising or disturbing the capture location (121).

Referring now to FIG. 2A, 2B, 2C, 2D, 2E, 2F, 2G, 2H, 2I, 2J, various devices can be orchestrated into an ensemble to capture a scene. Existing capture device types can be positioned and networked to provide an optimal awareness of the scene. Examples include: cameras (202), wearable computers such as the Apple Watch, Google Glass, or FitBit (FIG. 2B), pan/tilt cameras such as those found in webcams/security cameras (FIG. 2C), mobile devices such as smartphones or tablets (FIG. 2D) equipped with front and rear-facing cameras (including advanced models with body-tracking sensors and fast-focus systems such as in the LG G3), traditional digital cameras (FIG. 2E), laptops with integrated webcams (FIG. 2F), depth cameras or thermal sensors (FIG. 2G) like those found in the Xbox Kinect hardware, dedicated video cameras (FIG. 2H), and autonomous equipment with only cameras attached (FIG. 2I) or autonomous equipment with sensors (FIG. 2J) such as sonar sensors, or infrared, or laser, or thermal imaging technology.

Advances in hardware/software coupling on smartphones further extend the applicability of the system and provide opportunities for a better user experience when capturing a scene because ubiquitous smartphones and tablets (FIG. 2D) can increasingly be used instead of traditionally expensive video cameras (FIG. 2E, FIG. 2H).

Using the mounts described in FIG. 1A, a device (110) can be mounted on a door or wall to capture the scene. The door allows panning of the scene by incorporating the known fixed-plane movement of the door. For alternate vantage points, it is also possible to use the mounts to position a device on a table (213) or the floor using a stand (103), or to use a traditional style tripod (215). The versatility afforded by the mounts and stands allows for multiple placement options for capturing devices, which in turn allows for greater precision and flexibility when sensing, analyzing, composing, and directing a subject in a 3D space.

Once recognized in a scene, subjects (220) can then be directed via the system to match desired compositional models, according to various sensed orientations and positions. These include body alignment (225), arm placement (230), and head tilt angle (234). Additionally, the subject can be directed to rotate in place (235) or to change their physical location by either moving forward, backward, or laterally (240).

Rotation (225) in conjunction with movement along a plane (240) also allows for medical observation, such as orthopedic evaluation of a user's gait or posture. While an established procedure exists today wherein trained professional humans evaluate gait, posture, and other attributes in-person, access to those professionals is limited and the quality and consistency of the evaluations is irregular. The invention addresses both shortcomings through a method and system that makes use of ubiquitous smartphones (110) and the precision and modularity of models. Another instance where networked sensors and cameras can replace a human professional is precise body measurement, previously achieved by visiting a quality tailor. By creating a 3D scene and directing subjects (220) intuitively as they move within it, the system is able to ensure with high accuracy that the subjects go through the correct sequences and the appropriate measurements are collected efficiently and with repeatable precision. Additionally, this method of dynamic and precise capture of a subject while sensing can be used to achieve positioning required for stereographic images with e.g., a single lens or sensor.

FIG. 3 provides examples of interface possibilities to communicate feedback to the subjects and users. The capturing device (110) can relay feedback that is passed to subjects through audio tones (345), voice commands (346), visually via a screen (347), or using vibration (348). An example of such a feedback loop is shown as a top view looking down on the subject (220) as they move along the rotation path directed in (225) according to audio tones heard by the subject (349).

The visual on-screen feedback (347) can take the form of a superimposed image of the subject's sensed position relative to the directed position in the scene (350). In one embodiment, the positions are represented as avatars, allowing human subjects to naturally mimic and achieve the desired position by aligning the two avatars (350). Real-time visual feedback is possible because the feedback-providing device (110) is networked (351) to all other sensing devices (352), allowing for synthesis and scoring of varied location and position inputs and providing a precise awareness of the scene's spatial composition (this method and system is discussed further in FIG. 8). One example of additional sensing data that can be networked is imagery of an infrared camera (360).

Other devices such as Wi-Fi-enabled GoPro®-style action cameras (202) and wearable technologies such as a smart watch with a digital display screen (353) can participate in the network (351) and provide the same types of visual feedback (350). This method of networking devices for capturing and directing allows individuals to receive communications according to their preferences on any network-connected device such as, but not limited to, a desktop computer (354), laptop computer (355), phone (356), tablet (357), or other mobile computer (358).

FIG. 4 provides examples of an interface when the screen is not visible, for example because the capture device is in too close of proximity to the subject. If the capture device is a smartphone (110) oriented to properly capture a subject's foot (465), it is unlikely that the subject will have sufficient ability to interact with the phone's screen, and there may not be additional devices or screens available to display visual feedback to the user.

The example in (466) shows how even the bottom of a foot (471) can be captured and precise measurements can be taken using a smartphone (110). By using the phone's gyroscope, the phone's camera can be directed to begin the capture when the phone is on its back, level, and the foot is completely in frame. No visual feedback is required and the system communicates direction such as rotation (470) or orientation changes (473, 474) through spoken instructions (446) via the smartphone's speakers (472).

Multiple sensory interface options provide ways to make the system more accessible, and allow more people to use it. In an embodiment, a user can indicate they do not want to receive visual feedback (because they are visually impaired, or because the ambient lighting is too bright, or for other reasons) and their preference can be remembered, so that they can receive feedback through audio (446) and vibration (448) instead.

Referring now to FIG. 5, examples of different types of scenes are shown to indicate how various compositional models can be applied. Traditionally, sensing, analytics, composition, and direction have been manual processes. The selfie shown in (501) is a photo or video typically difficult to capture by the operator at his or her arm length and/or reliant on a front-facing camera so immediate visual feedback is provided. Absent extensive planning and rehearsal, an additional human photographer has previously been required to achieve well-composed scene capture as seen in (502) and (503). Compositions with small children (504) or groups (505) represent further examples of use cases that are traditionally difficult to achieve without a human camera operator, because of the number of subjects involved and the requirement that they be simultaneously directed into precise poses.

Additionally, sports-specific movements such as those in soccer (506) (goal keeper positioning, shoot on goal, or dribbling and juggling form) and activities like baseball (507) (batting, fielding, catching), martial arts (508), dance (509), or yoga (510) are traditionally difficult to self-capture as they require precise timing and the subject is preoccupied so visual feedback becomes impractical. Looking again at (506), the ball may only contact the athlete's foot for a short amount of time, so the window for capture is correspondingly brief. The existing state of the art to capture such images is to record high definition, high-speed video over the duration of the activity and generate stills afterward, often manually. This is inefficient and creates an additional burden to sift through potentially large amounts of undesired footage.

A method and system for integrating perpetual sensor inputs, real-time analytics capabilities, and layered compositional algorithms (discussed further in FIG. 6) provide a benefit to the user through the form of automatic direction and orchestration without the need for additional human operators. In one embodiment, sports teams' uniforms can contain a designated symbol for sensing specific individuals, or existing uniform numbers can be used with CV and analytics methods to identify participants using software. Once identified, the system can use these markers for both identification and editing to inform capture, as well as for direction and control of the subjects.

In another embodiment, the system can use the order of the images to infer a motion path and can direct participants in the scene according to a compositional model matched from a database. Or, the images provided can be inputted to the system as designated “capture points” (516) or moments to be marked if they occur in the scene organically. This type of system for autonomous capture is valuable because it simplifies the post-capture editing/highlighting process by reducing the amount of waste footage captured initially, as defined by the user.

In another embodiment, static scenes such as architectural photography (518) can also be translated from 2D to 3D. The method for recording models for interior (517) and exterior (518) landscapes by directing the human user holding the camera can standardize historically individually composed applications (for example in real estate appraisals, MLS listings, or promotional materials for hotels). Because the system is capable of self-direction and provides a method for repeatable, autonomous capture of high quality visual assets by sensing, analyzing, composing, and directing, the system allows professionals in the above-mentioned verticals to focus their efforts not on orchestrating the perfect shot but on storytelling.

In another embodiment, mounted cameras and sensors can provide information for Building Information Modeling (BIM) systems. Providing real-time monitoring and sensing allows events to be not only tagged but also directed and responded to, using models that provide more granularity than is traditionally available. In one embodiment, successful architectural components from existing structures can evolve into models that can inform new construction, direct building maintenance, identify how people are using the building (e.g., traffic maps), and can optimize HVAC or lighting, or adjust other environment settings.

As their ubiquity drives their cost down, cameras and sensors used for creating 3D building models will proliferate. Once a 3D model of a building has been captured (517), the precise measurements can be shared and made useful to other networked devices. As an example, the state of the art now is for each device to create its own siloes of information. Dyson's vacuum cleaner The Eye, for example, captures multiple 360 images each second on its way to mapping a plausible vacuuming route through a building's interior, but those images remain isolated and aren't synthesized into a richer understanding of the physical space. Following 3D space and markers using relative navigation of model parameters and attribute values is much more reliable and less costly, regardless of whether image sensing is involved.

In another embodiment, the system can pre-direct a 3D scene via a series of 2D images such as a traditional storyboard (515). This can be accomplished by sensing the content in the 2D image, transforming sensed 2D content into a 3D model of the scene, objects, and subjects, and ultimately assigning actors roles based on the subjects and objects they are to mimic. This transformation method allows for greater collaboration in film and television industries by enabling the possibility of productions where direction can be given to actors without the need for actors and directors to be in the same place at the same time, or speak a common language.

FIG. 6 shows the method of the invention, including the identification of foundational components including Objects, Subjects, Scene, Scape, and Equipment (601).

Once the capture process has been started (602), pre-sensed contexts and components (Object(s), Subject(s), Scene, Scape, Equipment) (601) are fed into the Sensing Module (603). Now both physical and virtual analytics such as computer vision (i.e., CV) can be applied in the Analytics Module (604) to make sense of scene components identified in the Sensing Module (603). And they can be mapped against composition models in the Composition/Architecture Module (605) so that in an embodiment, a subject can be scored for compliance against a known composition or pattern. Pre-existing models can be stored in a Database (600) that can hold application states and reference models, and those models can be applied at every step of this process. Once the analysis has taken place comparing sensed scenes to composed scenes, direction of the components of the scene can occur in the Direction/Control Module (606) up to and including control of robotic or computerized equipment. Other types of direction include touch-UI, voice-UI, display, control message events, sounds, vibrations, and notifications. Equipment can be similarly directed via the Direction/Control Module (606) to automatically and autonomously identify a particular subject (e.g., a baseball player) in conjunction with other pattern recognition (such as a hit, 507), allowing efficient capture of subsets in frame only. This can provide an intuitive way for a user to instruct the capture of a scene (e.g., begin recording when #22 steps up to the plate, and save all photos of his swing, if applicable).

The Sensing Module (603) can connect to the Analytics Module (604) and the Database (600), however the Composition/Architecture Module (605) and Direction/Control Module (606) can connect to the Analytics Module (604) and the Database (600) as shown in FIG. 6.

In another embodiment, the capability gained from pairing the system's Sensing Module (603) and Analytics Module (604) with its Composition/Architecture Module (605) and Direction/Control Module (606) allows for on-demand orchestration of potentially large numbers of people in a building, for example automatically directing occupants to safety during an emergency evacuation such as a fire. The Sensing Module (603) can make sense of inputs from sources including security cameras, proximity sensors such as those found in commercial lighting systems, and models stored in a database (600) (e.g., seating charts, blueprints, maintenance schematics) to create a 3D model of the scene and its subjects and objects. Next, the Analytics Module (604) can use layered CV algorithms such as background cancellation to deduce, for example, where motion is occurring. The Analytics Module (604) can also run facial and body recognition processes to identify human subjects in the scene, and can make use of ID badge reading hardware inputs to link sensed subjects to real-world identities. The Composition/Architecture Module (605) can provide the optimal choreography model for the evacuation, which can be captured organically during a previous during a fire drill at this location, or can be provided to the system in the form of an existing “best practice” for evacuation. All three modules (Sensing Module (603), Analytics Module (604), and Composition/Architecture Module (605)) can work in a feedback loop to process sensed inputs, make sense of them, and score them against the ideal compositional model for the evacuation. Additionally, the Direction/Control Module (606) can provide feedback to the evacuees using the methods and system described in FIG. 3 and FIG. 4. The Direction/Control Module (606) can also, for example, shut off the gas line to the building if it has been properly networked beforehand. Because the Sensing Module (603) is running continuously, the system is capable of sensing if occupants are not complying with the directions being given from the Direction/Control Module (606). The benefits of automatically synthesizing disparate inputs into one cohesive scene is also evident in this example of an emergency evacuation, as infrared camera inputs allow the system to identify human subjects using a combination of CV algorithms and allow the system to direct them to the correct evacuation points, even if the smoke is too thick for traditional security cameras to be effective, or the evacuation points are not visible. The Direction/Control Module (606) can also dynamically switch between different styles of feedback, for example if high ambient noise levels are detected during the evacuation, feedback can be switched from audio to visual or haptic.

FIG. 7 is a process flow for a process, method, and system for automatic orchestration, sensing, composition and direction of subjects, objects and equipment in a 3D space. Once started (700), any real world event (701) from a user pushing a button on the software UI to some specific event or message received by the application can begin the capture process and the Sensing Module (603). This sensing can be done by a single sensor for example infrared or sonic sensor device (702) or from a plurality of nodes in a network that could also include a combination of image sensing (or camera) nodes (703).

To protect subject privacy and provide high levels of trust in the system, traditional images are neither captured nor stored, and only obfuscated points clouds are recorded by the device (704). These obfuscated points clouds are less identifiable than traditional camera-captured images, and can be encrypted (704). In real-time as this data is captured at any number of nodes and types, either by a set of device local (e.g., smartphone) or by a cloud based service, a dynamic set of computer vision modules (i.e., CV) (705) and machine learning algorithms (ML) are included and reordered as they are applied to optimally identify the objects and subjects in a 3D or 2D space. An external to the invention “context system” (706) can concurrently provide additional efficiency or speed in correlating what's being sensed with prior composition and/or direction models. Depending on the results from the CV and on the specific use-case, the system can transform the space, subjects and objects into a 3D space with 2D, 2.5 D or 3D object and subject models (707).

In some use-cases, additional machine learning and heuristic algorithms (708) can be applied across the entire system and throughout processes and methods, for example to correlate the new space being sensed with most relevant composition and or direction models or to provide other applications outside of this application with analytics on this new data. The system utilizes both supervised and unsupervised machine learning in parallel and can run in the background to provide context (706) around, for example, what CV and ML methods were implemented most successfully. Supervised and unsupervised machine learning can also identify the leading practices associated with successful outcomes, where success can be determined by criteria from the user, or expert or social feedback, or publicly available success metrics. For performance, the application can cache in memory most relevant composition model(s) (710) for faster association with models related to sensing and direction. While monitoring and tracking the new stored sensed data (711), this can be converted and dynamically updated (712) into a new unique composition model if the pattern is unique, for example as determined automatically using statistical analysis, ML, or manually through a user/expert review interface.

In embodiments where a user is involved in the process, the application can provide continual audio, speech, vibration or visual direction to a user (715) or periodically send an event or message to an application on the same or other device on the network (716) (e.g., a second camera to begin capturing data). Direction can be sporadic or continuous, can be specific to humans or equipment, and can be given using the approaches and interfaces detailed in FIG. 3

As the application monitors the processing of the data, it utilizes a feedback loop (720) against the composition or direction model and will adjust parameters and loop back to (710) or inclusion of software components and update dynamically on a continuous basis (721). New composition models will be stored (722) whether detected by the software or defined by user or expert through a user interface (723). New and old composition models and corresponding data are managed and version controlled (724).

By analyzing the output from the Sensing Module (603), the system can dynamically and automatically utilize or recommend a relevant stored composition model (725) and direct users or any and all equipment or devices from this model. But in other use cases, the user can manually select a composition model from those previously stored (726).

From the composition model, the direction model (727) provides events, messages, and notifications, or control values to other subjects, applications, robots or hardware devices. Users and/or experts can provide additional feedback as to the effectiveness of a direction model (728), to validate, augment or improve existing direction models. These models and data are version controlled (729).

In many embodiments, throughout the process the system can sporadically or continuously provide direction (730), by visual UI, audio, voice, vibration to user(s) or control values by event or message to networked devices (731) (e.g., robotic camera dolly, quadcopter drone, pan and tilt robot, Wi-Fi-enabled GoPro®, etc.).

Each process throughout the system can utilize a continuous feedback loop as it monitors, tracks, and reviews sensor data against training set models (732). The process can continuously compute and loop back to (710) in the process flow and can end (733) on an event or message from external or internal application or input from a user/expert through a UI.

FIG. 8 is a process flow for the Sensing Module (603) of the system, which can be started (800) by a user through UI or voice command, or by sensing a pattern in the frame (801) or by an event in the application. A plurality of sensors capture data into memory (802) and through a combination of machine learning and computer vision sensing and recognition processing, entities, objects, subjects and scenes can be recognized (803). They also will identify most strongly correlated model to help make sense of the data patterns being sensed against (804) previously sensed models stored in a Database (600), via a feedback loop (815). In one embodiment, the image sensor (804) will be dynamically adjusted to improve the sensing precision, for example, separating a foreground object or subject from the background in terms of contrast. A reference object in either 2D or 3D can be loaded (805) to help constrain the CV and aid in recognition of objects in the scene. Using a reference object to constrain the CV helps the Sensing Module (603) ignore noise in the image including shadows and non-target subjects, as well as objects that might enter or exit the frame.

Other sensors can be used in parallel or serially to improve the context and quality of sensing (806). For example, collecting the transmitted geolocation positions from their wearable devices or smartphones of the subjects in an imaged space can help provide richer real-time sensing data to other parts of the system, such as the Composition Module (605). Throughout the processes, the entity, object and scene capture validation (807) is continuously evaluating what, and to what level of confidence, in the scene is being captured and what is recognized. This confidence level of recognition and tracking is enhanced as other devices and cameras are added to the network because their inputs and sensory capabilities can be shared and reused and their various screens and interface options can be used to provide rich feedback and direction (FIG. 3).

The sensing process might start over or move onto a plurality and dynamically ordered set of computer vision algorithm components (809) and/or machine learning algorithms components (810). In various embodiments, those components can include, for example, blob detection algorithms, edge detection operators such as Canny, and edge histogram descriptors. The CV components are always in a feedback loop (808) provided by previously stored leading practice models in the Database (600) and machine learning processes (811). In an embodiment, image sensing lens distortion (i.e., smartphone camera data) can be error corrected for barrel distortion and the gyroscope and compass can be used to understand the context of subject positions to a 3D space relative to camera angles (812). The system can generate 3D models from the device or networked service or obfuscated and/or encrypted point clouds (813). These point clouds or models also maintained in a feedback loop (814) with pre-existing leading practice models in the Database (600).

A broader set of analytics and machine learning can be run against all models and data (604). The Sensing Module (603) is described earlier in FIG. 6 and a more detailed process flow is outlined here in FIG. 8. As powerful hardware is commercialized and further capabilities are unlocked via APIs, the system can correlate and analyze the increased sensor information to augment the Sensing Module (603) and provide greater precision and measurement of a scene.

FIG. 9 is a diagram of the architecture for the system (950) according to one or more embodiments. In an on-premise embodiment, the Processing Unit (900) is comprised of the Sensing Module (603), Analytics Module (604), Composition/Architecture Module (605), and Direction/Control Module (606), can be connected to a processor (901) and a device local database (600) or created in any other computer medium and connected to through a network (902) including being routed by a software defined network (i.e., SDN) (911). The Processing Unit (900) can also be connected to an off-premise service for greater scale, performance and context by network SDN (912). This processing capability service cloud or data center might be connected by SDN (913) to a distributed file system (910) (e.g., HDFS with Spark or Hadoop), a plurality of service side databases (600) or cloud computing platform (909). In one or more embodiments, the Processing Unit can be coupled to a processor inside a host data processing system (903) (e.g., a remote server or local server) through a wired interface and/or a wireless network interface. In another embodiment, processing can be done on distributed devices for use cases requiring real-time performance (e.g., CV for capturing a subject's position) and that processing can be correlated with other processing throughout the service (e.g., other subject's positioning in the scene).

FIG. 10 shows examples of iconic posing and professional compositions, including both stereotypical model poses (1000) and famous celebrity poses such as Marilyn Monroe (1001). These existing compositions can be provided to the system by the user and can be subsequently understood by the system, such that subjects can then be auto-directed to pose relative to a scene that optimally reproduces these compositions, with feedback given in real-time as the system determines all precise spatial orientation and compliance with the model.

In one embodiment, a solo subject can also be directed to pose in the style of professional models (1002), incorporating architectural features such as walls and with special attention given to precise hand, arm, leg placement and positioning even when no specific image is providing sole compositional guidance or reference. To achieve this, the system can synthesize multiple desirable compositions from a database (600) into one composite reference composition model. The system also provides the ability to ingest existing 2D art (1006) which is then transformed into a 3D model used to auto-direct composition and can act as a proxy for the types of scene attributes a user might be able to recognize but not articulate or manually program.

In another embodiment, groups of subjects can be automatically directed to pose and positioned so that hierarchy and status are conveyed (1010). This can be achieved using the same image synthesis method and system as in (1002), and by directing each subject individually and while posing them relative to each other to ensure compliance with the reference model. The system's simultaneous direction of multiple subjects in frame can dramatically shorten the time required to achieve a quality composition. Whereas previously a family (1005) would have used time-delay and extensive back-and-forth positioning or enlisted a professional human photographer, now the system is able to direct them and reliably execute the ideal photograph at the right time and using ubiquitous hardware they already own (e.g., smartphones). The system is able to make use of facial recognition (1007) to deliver specific direction to each participant, in this embodiment achieving optimal positioning of the child's arm (1008,1009). In another embodiment, the system is able to direct a kiss (1003) using the Sensing Module (603), Analytics Module (604), Composition/Architecture Model (605), and Direction/Control Module (606) and the method described in FIG. 7 to ensure both participants are in compliance with the choreography model throughout the activity. The system is also able to make use of sensed behaviors as triggers for other events, so that in one embodiment a dancer's movements can be used as inputs to direct the composition of live music, or in another embodiment specific choreography can be used to control the lighting of an event. This allows experts or professionals to create models to be emulated by others (e.g., for instruction or entertainment).

FIG. 11 is provided as an example of a consumer-facing UI for the system that would all for assignment of models to scenes (1103), and of roles to subjects (1100) and objects. Virtual subject identifiers can be superimposed over a visual representation of the scene (1101) to provide auto-linkage from group to composition, and allows for intuitive dragging and reassignments (1105). Sensed subjects, once assigned, can be linked to complex profile information (1104) including LinkedIn, Facebook, or various proprietary corporate LDAP or organizational hierarchy information. Once identified, subjects can be directed simultaneously and individually by the system, through the interfaces described in FIG. 3.

In scenarios where distinguishing between subjects is difficult (poor light, similar clothing, camouflage in nature) stickers or other markers can be attached to the real-world subjects and tagged in this manner. Imagine a distinguishing sticker placed on each of the five subjects (901) and helping to keep them correctly identified. These stickers or markers can be any sufficiently differentiated pattern (including stripes, dots, solid colors, text) and can be any material, including simple paper and adhesive, allowing them to come packaged in the magazine insert from FIG. 1 (105) as well.

FIG. 12 provides further examples of compositions difficult to achieve traditionally, in this case because of objects or entities relative to the landscape of the scene. Nature photography in particular poses a challenge due to the uncontrollable lighting on natural features such as mountains in the background versus the subject in the foreground (1200). Using the interface described in FIG. 11, users are able to create rules or conditions to govern the capture process and achieve the ideal composition with minimal waste and excess. Those rules can be used to suggest alternate compositions or directions if the desired outcome is determined to be unattainable, for example because of weather. Additionally, existing photographs (1201) can be captured by the system, as a method of creating a reference model. In one embodiment, auto-sensing capabilities described in FIG. 8 combined with compositional analysis and geolocation data can deliver specific user-defined outcomes such as a self-portrait facing away from the camera, executed when no one else is in the frame and the clouds are over the trees (1202). In another embodiment, the composition model is able to direct a subject to stand in front of a less visually “busy” section of the building (1203).

Much of the specific location information the system makes use of to inform the composition and direction decisions is embodied in a location model, as described in FIG. 13. Representing specific geolocations (1305), each pin (1306) provides composition and direction for camera settings and controls, positioning, camera angles (1302), architectural features, lighting, and traffic in the scene. This information is synthesized and can be presented to the user in such a way that the compositional process is easy to understand and highly automated, while delivering high quality capture of a scene. For example, consider a typical tourist destination that can be ideally composed (1307) involving the Arc de Triomphe. The system is able to synthesize a wide range of information (including lighting and shadows depending on date/time, weather, expected crowd sizes, ratings of comparable iterations of this photo taken previously) which it uses to suggest desirable compositions and execute them with precision and reliability, resulting in a pleasant and stress-free experience for the user.

FIG. 14 is a representation of computer vision and simple models informing composition. A building's exterior (1401) invokes a perspective model (1402) automatically to influence composition through a CV process and analytics of images during framing of architectural photography. The lines in the model (1402) can communicate ideal perspective to direct perspective, depth, and other compositional qualities, to produce emotional effects in architectural photography applications such as real estate listings.

Referring now to FIG. 15A, the system can make use of a smartphone or camera-equipped aerial drone (1500) to perform surveillance and visual inspection of traditionally difficult or dangerous-to-inspect structures such as bridges. Using 3D-constrained CV to navigate and control the drone autonomously and more precisely than traditional GPS waypoints, the system can make use of appropriate reference models for weld inspections, corrosion checks, and insurance estimates and damage appraisals. Relative navigation based on models of real-world structures (1502) provides greater flexibility and accuracy when directing equipment when compared to existing methods such as GPS. Because the system can make use of the Sensing Module (603) it is able to interpret nested and hierarchical instructions such as “search for corrosion on the underside of each support beam.” FIG. 15B depicts an active construction site, where a drone can provide instant inspections that are precise and available 24/7. A human inspector can monitor the video and sensory feed or take control from the system if desired, or the system is able to autonomously control the drone, recognizing and scoring the construction site's sensed environment for compliance based on existing models (e.g., local building codes). Other BIM (Building Information Management) applications include persistent monitoring and reporting as well as responsive structures that react to sensed changes in their environment, for example a window washing system that uses constrained CV to monitor only the exposed panes of glass in a building and can intelligently sense the need for cleaning in a specific location and coordinate an appropriate equipment response, autonomously and without human intervention.

Human subjects (1600) can be deconstructed similarly to buildings, as seen in FIG. 16. Beginning with a close and precise measurement of the subject's body (1601) which can be abstracted into, for example, a point cloud (1602), composite core rigging (1603) can then be applied such that a new composite reference core or base NURB 3D model is created (1604). This deconstruction, atomization, and reconstruction of subjects allows for precision modeling and the fusing of real and virtual worlds.

In one embodiment, such as a body measurement application for Body Mass Index or other health use-case, fitness application, garment fit or virtual fitting application, a simpler representation (1605) might be created and stored at the device for user interface or in a social site's datacenters. This obfuscates the subject's body to protect their privacy or mask their vivid body model to protect any privacy or social “body image” concerns. Furthermore, data encryption and hash processing of these images can also be automatically applied in the application on the user's device and throughout the service to protect user privacy and security.

Depending on the output from the Sensing Module (603), the system can either create a new composition model for the Database (600), or select a composition model based on attributes deemed most appropriate for composition: body type, size, shape, height, arm position, face position. Further precise composition body models can be created for precise direction applications in photo, film, theater, musical performance, dance, yoga.

FIG. 17 catalogues some of the models and/or data that can be stored centrally in a database (600) available to all methods and processes throughout the system, to facilitate a universal scoring approach for all items. In one example, models for best practices for shooting a close up movie scene (1702) are stored and include such items as camera angles, out of focus affects, aperture and exposure settings, depth of field, lighting equipment types with positions and settings, dolly and boom positions relative to the subject (i.e., actor), where “extras” should stand in the scene and entrance timing, set composition. By sensing and understanding the subjects and contexts of a scene over time via those models, film equipment can be directed to react in context with the entire space. An example is a networked system of camera dollies, mic booms, and lighting equipment on a film set that identifies actors in a scene and automatically cooperates with other networked equipment to provide optimal composition dynamically and in real-time, freeing the human director to focus on storytelling.

The Database (600) can also hold 2D images of individuals and contextualized body theory models (1707), 3D models of individuals (1705), and 2D and 3D models of clothing (1704), allowing the system to score and correlate between models. In one embodiment, the system can select an appropriate suit for someone it senses is tall and thin (1705) by considering the body theory and fashion models (1707) as well as clothing attributes (1704) such as the intended fit profile or the number of buttons.

The system can keep these models and their individual components correlated to social feedback (1703) such as Facebook, YouTube, Instagram, or Twitter using metrics such as views, likes, or changes in followers and subscribers. By connecting the system to a number of social applications, a number of use cases could directly provide context and social proof around identified individuals in a play or movie, from the overall composition cinematography of a scene in a play, music recital, movie or sports event to how well-received a personal image (501) or group image or video was (1101). This also continuously provides a method and process for tuning best practice models of all types of compositions from photography, painting, movies, skiing, mountain biking, surfing, competitive sports, exercises, yoga poses (510), dance, music, performances.

All of these composition models can also be analyzed for trends from social popularity, from fashion, to popular dance moves and latest form alterations to yoga or fitness exercises. In one example use case, a camera (202) and broad spectrum of hardware (1706), such as lights, robotic camera booms or dollies, autonomous quadcopters, could be evaluated individually, or as part of the overall composition including such items as lights, dolly movements, camera with its multitude of settings and attributes.

Referring now to FIG. 18, in one embodiment the system can facilitate the learning an instrument through the provision of real-time feedback. 3D models of an instrument, for example a guitar fretboard model, can be synthesized and used to constrain the CV algorithms so that only the fingers and relevant sections of the instrument (e.g., frets for guitars, keys for pianos, heads for drums) are being analyzed. Using the subject assignment interface from FIG. 11, each finger can be assigned a marker so that specific feedback can be provided to the user (e.g., “place 2^ndfinger on the A string on the 2^ndfret.”) in a format that is understandable and useful to them. While there are many different ways to learn guitar, no other system looks at the proper hand (1802) and body (1800) position. Because the capture device (110) can be networked with other devices, instruction can be given holistically and complex behaviors and patterns such as rhythm and pick/strum technique (1805) can be analyzed effectively. Models can be created to inform behaviors varying from proper bow technique for violin to proper posture when playing keyboard. In one embodiment, advanced composition models and challenge models can be loaded into the database, making the system useful not just for beginners but anyone looking to improve their practice regimens. These models can be used as part of a curriculum to instruct, test and certify music students remotely. As with FIG. 15, a human expert can monitor the process and provide input or the sensing, analyzing, composing and directing can be completely autonomous. In another embodiment, renditions and covers of existing songs can also be scored and compared against the original and other covers, providing a video-game like experience but with fewer hardware requirements and greater freedom.

FIG. 19 shows an example of a golf swing (1901) to illustrate the potential of a database of models. Once the swing has been scanned, with that pre-modeled club or putter, that model is available for immediate application and stored in a Database (600). And a plurality of sensed movements can be synthesized into one, so that leading practice golf swings are sufficiently documented. Once stored, the models can be converted to compositional models, so that analysis and comparison can take place between the sensed movements and stored compositional swing, and direction and feedback can be given to the user (1902, 1903).

FIG. 20 is a systematic view of an integrated system for Composition and Orchestration of a 3D or 2.5D space illustrating communication between user and their devices to server through a network (902) or SDN (911, 912, 913), according to one embodiment. In one embodiment a user or multiple users can connect to the Processing Unit (900) that hosts the composition event. In another embodiment, the user hardware such as a sensor (2001), TV (2003), camera (2004), mobile device such as a tablet or smartphone etc. (2005), wearable (2006), server (2007), laptop (2008) or desktop computer (2009) or any wireless device, or any electronic device can communicate directly with other devices in a network or to the devices of specific users (2002, 902). For example, in one embodiment the orchestration system might privately send unique positions and directions to four separate devices (e.g., watch, smartphone, quadcopter (1706), and an internet-connected TV) in quickly composing high-quality and repeatable photographs of actors and fans at a meet-and-greet event.

Claims

1. A method, comprising:

Capturing a 2D image in a specific format of an object, subject, and scene using a device;

Sensing an object, subject, and scene automatically and continuously using the device;

Analyzing the 2D image of the object, subject, and scene captured to determine the most relevant composition and direction model;

Transforming an object, subject, and scene into a 3D model using existing reference composition/architecture model; and

Storing the 3D model of the scene in a database for use and maintaining it in a feedback loop.

2. The method of claim 1, further comprising:

Performing continuous contextual analysis of an image and its resulting 3D model to provide an update to subsequent 3D modeling processes; and

Dynamically updating and responding to contextual analytics performed.

3. The method of claim 2, further comprising:

Coordinating accurate tracking of objects and subjects in a scene by orchestrating autonomous equipment movements using a feedback loop.

4. The method of claim 3, further comprising:

Controlling the direction of a scene and its subjects via devices using a feedback loop

5. The method of claim 4, further comprising:

Creating and dynamically modifying in real-time the 2D or 3D model for the subject, object, scene, and equipment in any spatial orientation and Providing immediate feedback in a user interface.

6. The method of claim 1, wherein the device is at least one of a camera, wearable device, desktop computer, laptop computer, phone, tablet, and other mobile computer.

7. A system, comprising:

A processing unit that can exist on a user device, on-premise, or as an off-premise service to house the following modules;

A sensing module that can understand the subjects and context of a scene over time via models;

An analytics module that can analyze sensed scenes and subjects to determine the most relevant composition and direction models or create them if necessary;

A composition/architecture module that can simultaneously store the direction of multiple subjects or objects of a scene according to one or more composition models;

A direction/control module that can provide direction and control to each subject, object, and equipment individually and relative to a scene model; and

A database that can store models for use and maintain them in a feedback loop with the above modules.