Abstract: A tracking system and method are suited to tracking multiple of objects of different categories in a video sequence. A sequence of video frames is received and a set of windows is extracted from each frame in turn, based on a computed probability that the respective window contains an object, without reference to any specific category of object. For each of these windows, a feature representation is extracted. A trained detector for a selected category detects windows that constitute targets in that category, based on the respective feature representations. More than one detector can be used when there is more than one category of objects to be tracked. A target-specific appearance model is generated for each of the targets (e.g., learned or updated, if the target is present in a prior frame). The detected targets are tracked over one or more subsequent frames based on the target-specific appearance models of the targets.