What are the risks of dumping data into a data lake?

Asked 2 years ago

Good afternoon everyone. Having a data pool within our company that is accessible to all employees would be a good idea. But while everything has its pros, I am aware that there are cons as well. So are there any risks of a data pool?

Dallas Duncan

Sunday, March 27, 2022

Here are the 2 main risks associated with dumping your data into a data lake:

  1. If the data sources have distinct distributions or biases, you may need to consider how to merge them. You'll have to compare it to the distribution in testing and production.
  2. It's prone to error since you could combine the data sources incorrectly. Some of the features may also require transformations specific to different data sources.

Reid Hardin

Sunday, July 24, 2022

Data Pool vs. Data Lake: The Data pool is isolated and independent; hence it is less complex than the data lake. In contrast, the data lake has many data pools of the same organization.

The biggest risk of data dumping in a data lake is its conversion to a data swamp. Additionally, combining the data sources makes it prone to errors. The prevention is made possible by organizing and assigning the data to appropriate metadata.

