PARTS BASED OBJECT TRACKING METHOD AND APPARATUS
The method and apparatus segment parts from a main target image, represent each part as a vertex in a spanning tree, use a detector to generate a confidence map of a location of each part in a succeeding video frame and apply scale change to detector sliding windows centered about each pixel in the part location image. In the succeeding video frame, the target location is sampled and a tracking probability is generated for each part bounding box, with the tracking probability having the maximum value being selected as the location of the target in the succeeding video frame.
Latest Toyota Patents:
- METHOD OF PRODUCING ALL-SOLID-STATE BATTERY AND ALL-SOLID-STATE BATTERY
- INTERFACE STRUCTURED FOR DETACHABLY CONNECTING A VEHICLE TEST BODY TO A SLED BASE IN A VEHICLE BODY TESTING ASSEMBLY
- NOTIFICATION SYSTEM, VEHICLE CONTROL DEVICE, AND NOTIFICATION METHOD
- SYSTEMS AND METHODS FOR INCREASING A RESONATOR QUALITY FACTOR
- SYSTEMS AND METHODS FOR ABSORPTION TUNING FOR TOTAL WAVE ABSORPTION AND REFLECTION
The present disclosure relates, in general, to vehicle object detection and tracking method and apparatus and, more specifically, to methods and apparatus for detecting and tracking objects from a moving vehicle.
Vehicle safety is deemed enhanced by computer and sensor based systems which detect objects, such as vehicles, pedestrians, as well as stationary poles and signs which may be in the path of a moving vehicle and could result in a collision.
Autonomous driverless vehicles are also being proposed. Such autonomous vehicles require a current view of surrounding objects, such as vehicles, poles, pedestrians, etc., which may be moving or stationary relative to the moving vehicle.
In order to accurately implement vehicle based collision based warning and avoidance systems as well as to implement autonomous driverless vehicles, object detection and tracking methods have been proposed.
SUMMARYA method and apparatus for detecting and tracking objects includes receiving a video sequence including a plurality of sequential video frames from a video camera, selecting a target image in a first video frame, segmenting the target image into at least one identifiable part in a target bounding box, and training a detector for each of the at least one part in the first video frame. In a next video frame, the method samples the position of the target bounding box in a particle filter framework. For each part of the at least one part, a confidence map is generated from the search region for the possible location of the part by applying a plurality of sliding windows of polygonally arranged pixels, each sliding window centered around each pixel in the search region for the at least one part located in the first video frame, and generating a tracking score for each sliding window corresponding to a probability of the part being centered in each sliding window in the next video frame,
The segmenting step includes segmenting the target image into a plurality of separate parts, each in a separate part bounding box, and then representing each of the plurality of parts as a vertex in a spanning tree.
The method applies scale change to each sliding window for each part.
The method varies the size of the sliding windows applied to each pixel and repeats the generating step for each sliding window for each scale change.
The method, in the next video frame, updates the confidence map for each part in each sampled target location by multiplying a detector output confidence map for each part by a Gaussian distribution centered at an ideal location of each part in each sampled whole target bounding box.
The method generates the tracking score encoding the appearance similarity between the detected part and the reference part and a deformation factor for each individual part penalized from the ideal location.
The method includes forming the detector uses a detection algorithm based on a deep learning network.
The method includes outputting, by the detector, a video frame detection probability of a candidate region pixel by pixel in each of the video frames.
The method defines the search region of at least one part in the next video frame as a rectangular area of a predetermined size centered on the position of the at least one part in the previous video frame.
The method applies the detector with different scale sliding windows on the same video frame image and repeats the generation of a confidence map pixel by pixel for all scales.
The method for each scale applies the detector for the at least one part only to a search region centered on the location of the at least one part from a previous video frame.
The method models the state of the target by motion parameters of position and scale of a rectangular bounding box.
The method samples the position and scale of the target bounding box in each video frame to obtain an image patch for the target.
The method defines an ideal location of at least one part in each video frame within the whole target image patch.
The method defines a change in a position of the at least one part between two successive video frames as a function measuring a degree of conformity when the part is in a first location and when the part is in a second location.
The method defines the tracking score as:
c(x)=λm(x)+Σmi(ti)+Σdij(ti,tj)
where m(x) is a matching score for the target bounding box and λ is a weighting factor between the target and its parts.
The method generates the detection score by defining the configuration of the parts that maximizes a second term and a third term of the tracking score equation using dynamic programming.
The method generates the tracking result for a sample having a maximum probability by
where Γ is the normalization factor, a is the tuning parameter, and yi is the observation.
The method removes at least one part from a tracking sequence when a maximum tracking score in the part confidence map for the at least one part is below a threshold.
The various features, advantages and other uses of the present part based object tracking method and apparatus will become more apparent by referring to the following detailed description and drawing in which:
Referring now to the drawing, and to
The apparatus, which can be mounted on a moving vehicle 80, includes a video camera 82. By way of example, the video camera 82 may be a high definition video camera with a resolution of 1024 pixels per frame.
The camera 82, in addition to possibly being a high definition camera 82, for example, can also be a big-picture/macro-view (normal or wide angle) or a telephoto camera. Two types of cameras, one having normal video output and the other providing the big- picture/macro-view normal or wide angle or telephoto output, may be employed on the vehicle 80. This would enable one camera to produce a big-picture/macro view, while the other camera could zoom in on parts of the target.
The apparatus, using the camera 82 and the control system described hereafter and shown in
The method is implemented on and the apparatus includes a computing device shown in a block diagram form in
The computing device 100 can also include secondary, additional, or external storage 114, for example, a memory card, flash drive, or other forms of computer readable medium. The installed applications 112 can be stored in whole or in part in the secondary storage 114 and loaded into the memory 104 as needed for processing.
The computing device 100 receives an input 116 in the form of sequential video frame image data from the camera or mounted on the vehicle 80.
The video image data may be stored in the memory 104 and/or the secondary storage 114.
Using a high definition video output 116 from the camera 82, the target will have a reasonable size, as shown in
The method and apparatus employ a tracking methodology which selects a whole target bounding box 130 on the target, which, as shown in
By way of example, the target bounding box 130 includes a number of smaller part bounding boxes 134, 136, 138, 140, 142 and 144. The number and size of each part bounding box 134-144 can be selected based on the size of the target bounding box 130 and the number of individual discriminable parts which can be successfully segmented and distinguished in the target bounding box 130.
Fast image segmentation is applied to the target 84 to select the main bounding box 130. The fast image segmentation also allows selection of the individual target bounding boxes 134-144 within the main bounding box 130.
Each part bounding box 134 is based on a discriminability and suitability check for each part shape and size. Each part bounding box 134-144 is represented by a rectangle, for example, so the individual parts need to be suitable such that part occupies most of the rectangular region of each part bounding box 134-144 and is easily discriminative from the surrounding portion of the target image in the target box 130.
The various method steps are shown pictorially in
In step 200, the target image, such as the rear end 132 of the vehicle 84 in
It should be noted that the individual part bounding boxes 134-144 can be overlapped for tracking purposes.
Next, in step 204, each part bounding box 134-144 is represented as a vertex 150 centered within the part bounding box 134-144. A complete graph of the vertices 150 is then built. A weight is associated with each edge of the part bounding boxes 134-144. The weight discloses the similarities of the vertices 150 that are connected by the edge in terms of features, such as color, size, distance, texture, etc. The CPU 102 then generates a spanning tree, such as a minimum spanning tree (MST 152,
The CPU 102 then generates a detector for each part using a suitable detection algorithm, for example, an algorithm based on a deep learning network. The deep learning network is composed of multiple layers of hidden units, and the input to the network is the raw normalized pixel intensities of the image patch or the features extracted from the image patch, as color, texture, etc. Although the deep learning network may develop rotational invariance through training on image patches, it may also be useful to use rotational invariance features as additional inputs to the network.
The individual part detectors output a detection probability of the candidate region.
In the next video frame, each part detector is applied to the search region of the respective part using a sliding window operation. The search region for each part is defined as the rectangular area with a certain size that is centered on the position of the part from the previous video frame. Each pixel in a confidence map output by the part detector of the search region for each part is the output of the matching probability of the sliding window centered on that pixel. The sliding window is generated for each pixel and the confidence map generates an output equivalent to the probability of the part being centered in that pixel. This is repeated for all of the pixels in a video frame for each part.
To handle a scale change of the target 132, the part detector applies a different size of the sliding scale windows from the same image in step 208. Scale change is necessary since the detected object, such as the vehicle 84, may be closer or further away from the vehicle 80 carrying the camera 82 in two successive video frames.
Steps 204 and 207 are then repeated for each scale. For K different scales, K, confidence maps 162, 164, 166,
The state of the target 132 is modeled by the motion parameters position ((u,w) in the image) and scale (s) of a rectangular target bounding box. In the next frame, the position and scale (scale affixed to K scales) of the target is sampled to obtain an image patch for each sample. Given the position of the whole image patch 168
A function d (ti, tj) measures the degree of conformity of the model for the part vi is placed at location ti and part vj is placed at location tj. The tracking score for each sample is:
c(x)=λm(x)+Σmi(ti)+Σdij(ti,tj)
where m(x) is the tracking score for the whole target bounding box 130 and λ is the weighting factor between the whole target bounding box 130 and the parts. For each sample, the target location is given so m(x) is fixed. The problem is then reduced to find the configuration of the parts that maximizes the second and third terms of the equation. The best configuration for the tree structured part model can then be solved efficiently using dynamic programming in step 220.
In the particle filter framework, the likelihood p (yi|xi) for particle sample xi is defined as
where Γ is the normalization factor, a is the tooling parameter and yi is the observation. The tracking result is the sample 216,
In part based tracking, parts may be occluded from video frame to video frame due to the part being blocked by another object, background distraction, etc. The present method effectively handles occlusions by removing any of the parts from consideration in a particular video frame if the maximum tracking score in the part confidence map is too small or below a threshold which is indicative of being caused by at least a partial occlusion.
The method and apparatus updates the parts location relative to the target bounding box 130 if no parts are occluded and the matching score c(x) is above a predetermined threshold.
Claims
1. A method for detecting and tracking objects comprising:
- receiving a video sequence including a plurality of sequential video frames from a video camera;
- selecting a target image in a first video frame;
- segmenting the target image into at least one identifiable part in a part bounding box;
- training a detector for each of the at least one part in the first video frame;
- in a next video frame, sampling the position of a part bounding box in a particle filter framework;
- for each at least one part, generating a confidence map from a search region for the possible location of the at least one part by applying a plurality of sliding windows of polygonally arranged pixels, each sliding window centered around each pixel in the search region for the at least one part located in the first video frame; and
- generating a tracking score for each sliding window corresponding to a probability of the part being centered in each sliding window in the next video frame.
2. The method of claim 1 further comprising:
- the step of segmenting the target image includes segmenting the target image into a plurality of separate parts, each separate part contained in a separate part bounding box; and
- representing each of the plurality of parts as a vertex in a spanning tree.
3. The method of claim 2 further comprising:
- applying scale change to each sliding window for each part.
4. The method of claim 3 further comprising:
- varying a size of the sliding windows applied to each pixel and repeating the generating step for each sliding window in each scale change.
5. The method of claim 1 further comprising:
- in a next video frame, updating the confidence map for each part in each sampled target location by multiplying a detector output confidence map for each part by a Gaussian distribution centered at an ideal location of each part in each sampled target location.
6. The method of claim 5 further comprising:
- generating the detection score encoding an appearance similarity between the detected part and a reference part and a deformation factor for each individual part penalized from the ideal location.
7. The method of claim 1 further comprising:
- forming the detector using a detection algorithm based on a deep learning network.
8. The method of claim 1 further comprising:
- outputting, by the detector, a video frame tracking probability of a candidate region pixel by pixel in each of the video frames.
9. The method of claim 1 further comprising:
- defining a search region of at least one part in a next video frame as a rectangular area of a predetermined size centered on the position of the at least one part in a previous video frame.
10. The method of claim 1 further comprising:
- applying the detector with different scale sliding windows on the same video frame image and repeating the generation of a the confidence map pixel by pixel for all scales.
11. The method of claim 10 further comprising:
- for each scale, applying the detector for the at least one part only to a search region centered on the location of the at least one part from a previous video frame.
12. The method of claim 11 further comprising:
- modeling a state of the target by motion parameters of position and scale of a rectangular bounding box.
13. The method of claim 12 further comprising:
- sampling the position and scale of the target bounding box in each video frame to obtain an image patch for the target.
14. The method of claim 13 further comprising:
- defining an ideal location of the at least one part in each video frame within a whole target image patch.
15. The method of claim 1 further comprising:
- defining a change in a position of the at least one part between two successive video frames as a function measuring a degree of conformity when the part is in a first location and when the part is in a second location.
16. The method of claim 1 further comprising:
- defining the tracking score as: c(x)=λm(x)+Σmi(ti)+Σdij(ti,tj)
- where m(x) is the tracking score for the target bounding box and λ is a weighting factor between the target bounding box the part.
17. The method of claim 16 further comprising:
- generating the tracking score by defining a configuration of a part that maximizes a second term and a third term of a tracking score equation using dynamic programming.
18. The method of claim 17 further comprising: p ( y i | x i ) = 1 Γ exp { α c ( x i ) }, where Γ is a normalization factor, a is a tuning parameter, and yi is an observation.
- generating the tracking result for a sample having a maximum probability by
19. The method of claim 1 further comprising:
- removing the at least one part from a tracking sequence when a maximum tracking score in a part confidence map for the at least one part is below a threshold.
Type: Application
Filed: Feb 14, 2014
Publication Date: Aug 20, 2015
Patent Grant number: 9607228
Applicant: Toyota Motor Engineering & Manufacturing North America, Inc. (Erlanger, KY)
Inventors: Xue Mei (Ann Arbor, MI), Danil Prokhorov (Canton, MI)
Application Number: 14/180,620