Abstract: Described are computer-implemented methods and computing systems for automatically deduplicating a target dataset relative to a baseline dataset by providing distributed analysis of a first dataset to automatically generate a baseline dataset of the most common blocks of the first dataset, wherein the analysis is conducted in a distributed computing environment comprising a master computer system connected via a computer network to a plurality of computer systems.