Abstract: Approaches for referential sampling of disparate datasets. An execution mode and a sampling mode are determined for each entity in a plurality of disparate datasets. A directed acyclic graph (DAG) for each entity in the plurality of disparate datasets is created. The directed acyclic graph (DAG) is topologically sorted to produce a topologically sorted directed acyclic graph (DAG). One or more sampled datasets are retrieved from the plurality of disparate datasets using the topologically sorted directed acyclic graph (DAG). Advantageously, the one or more sampled datasets are a consistent sample that honors all referential constraints in the plurality of disparate datasets.
Abstract: Approaches for reducing a storage footprint for one or more files. A file type associated with a digital file is determined. A deduplication process is performed on the digital file based, at least in part, on the determined file type. The deduplication process may be performed differently on the digital file based on whether the digital file is an image or audio file, a compressed file, or a columnar file, for example. By considering the type of file being deduplicated, enhanced reductions in the storage footprint of the digital file may be realized.