Abstract: The present disclosure provides a high-performance data lake system and a data storage method. The data storage method includes the following steps: S1: converting a file into a file stream; S2: converting the file stream into an array in which multiple subarrays are nested; and S3: converting the array into a resilient distributed dataset (RDD), and storing the RDD to a storage layer of a data lake. The present disclosure provides a nested field structure, which lays the foundation for parallel processing in reading, and effectively improves read performance. Furthermore, the present disclosure flexibly generates a number of nested subarrays according to hardware cores, such that the data lake achieves better extension performance, and can keep optimal writing efficiency for different users.
Type:
Grant
Filed:
November 17, 2022
Date of Patent:
October 17, 2023
Assignees:
Nanhu Laboratory, Advanced Institute of Big Data, Beijing
Abstract: The present disclosure provides a high-performance data lake system and a data storage method. The data storage method includes the following steps: S1: converting a file into a file stream; S2: converting the file stream into an array in which multiple subarrays are nested; and S3: converting the array into a resilient distributed dataset (RDD), and storing the RDD to a storage layer of a data lake. The present disclosure provides a nested field structure, which lays the foundation for parallel processing in reading, and effectively improves read performance. Furthermore, the present disclosure flexibly generates a number of nested subarrays according to hardware cores, such that the data lake achieves better extension performance, and can keep optimal writing efficiency for different users.
Type:
Application
Filed:
November 17, 2022
Publication date:
May 18, 2023
Applicants:
Nanhu Laboratory, Advanced Institute of Big Data, Beijing