Definíció

A data lake is a system or repository of data stored in its natural/raw format, usually object blobs or files. A data lake is usually a single store of data including raw copies of data, and transformed data. A data lake can include structured data from relational databases, semi-structured data (CSV, logs), unstructured data (emails, documents) and binary data (images, audio, video). A data lake can be established “on premises” or “in the cloud”.
Wikipedia

A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data, and run different types of analytics.
Amazon AWS

Eszközök

  • The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing.
  • Amazon S3 is an object storage built to retrieve any amount of data from anywhere.
  • MinIO is a high-performance, S3 compatible object store.

Cikkek