Abstract: Methods and systems for identifying sensitive data (SD) stored on data repositories is disclosed. The data is processed to calculate a plurality of float feature (FF) vectors associated with the data. The FF vectors are clustered into a plurality of clusters, each cluster associated with a respective subset of the data. A DNA vector representative of the cluster is generated for each cluster. The DNA vectors of respective clusters are compared to one or more FF vectors calculated for a respective one or more user supplied examples of SD. One or more clusters are classified as SD based on the result of the comparing, thereby identifying respective subsets of data as SD.
Abstract: A method of clustering files, comprises, by a processing unit: obtaining a clustering structure comprising a plurality of nodes arranged in hierarchical levels Li, with i from 1 to N, obtaining at least one data (Dsignal) representative of a file (Dfile) to be assigned to a category; (O1) comparing said data to each centroid of each node of the first level, (O2) if said comparison matches an acceptance threshold of one or more nodes, selecting a node among these nodes, (O3) comparing Dsignal to each centroid of each node of a next level which is linked to said selected node, (O4) if said comparison matches an acceptance threshold of one or more nodes, selecting a node among these nodes, repeating O3 and O4 until a stopping condition is met, thereby indicating that Dsignal or Draw belongs to a category of files represented by said selected node.
Abstract: Generating a data object identifier by dividing the data in the data object into a plurality of chunks; processing each chunk using a clustering algorithm to generate, for each chunk, a pair of values characterizing the data in the chunk, thereby giving rise to a plurality of pairs of values (PoV); generating a plurality of nodes in a two dimensional space each corresponding to a respective PoV, wherein, for any given PoV, the values in the given PoV are indicative of location coordinates of the corresponding node in the two dimensional space; generating a plurality of features related to the plurality of nodes, each feature characterizing a spatial relationship between three or more nodes; and generating the data object identifier by arranging the features in a feature vector in accordance with predetermined rules.