Definíció
A data lake is a system or repository of data stored in its natural/raw format, usually object blobs or files. A data lake is usually a single store of data including raw copies of data, and transformed data. A data lake can include structured data from relational databases, semi-structured data (CSV, logs), unstructured data (emails, documents) and binary data (images, audio, video). A data lake can be established “on premises” or “in the cloud”.
– Wikipedia
A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data, and run different types of analytics.
– Amazon AWS
Eszközök
- The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing.
- Amazon S3 is an object storage built to retrieve any amount of data from anywhere.
- MinIO is a high-performance, S3 compatible object store.
Cikkek
- Wilson, Christy: Testing the Waters: How to Get a Hadoop Data Lake Set Up Right. Precisely Blog, 2016-11-18.
- Heinrich, Gordon: Build a Data Lake Foundation with AWS Glue and Amazon S3. AWS Big Data Blog, 2017-10-27.
- Seifert, Victor: How to build a data lake from scratch – Part 1: The setup. Towards Data Science, 2021-11-18.
- Seifert, Victor: How to build a data lake from scratch – Part 2: Connecting the components. Towards Data Science, 2022-01-23.
- BasuMallick, Chiradeep: What Is MapReduce? Meaning, Working, Features, and Uses. Spiceworks, 2022-08-29.