Abstract: Apparatus and methods of identifying potentially similar content include utilizing workflow metadata to identify potential similarities in content to be processed, or between content to be processed and known content. As a result, a subset of potentially similar content is identified, and the subset can be used in data reduction operations to reduce data in the content to be processed.