Abstract: A method is provided for preventing dark data in a data set. At a time t1, a first version of the data set is received. The first version is analyzed and its parameters are gathered in a first statistical profile. The first statistical profile is stored. At a time t2, a second version of the data set is received. The second version is analyzed and its parameters are gathered in a second statistical profile. The second statistical profile is stored. The first and second statistical profiles are compared and a similarity index is created. If the similarity index exceeds a pre-set threshold, dissimilarity is flagged and a responsive action is taken.