How To Get Started With Data Lakes

ODSC - Open Data Science
4 min readMay 14, 2019

--

Data has to be stored somewhere. Data warehouses are repositories for your cleaned, processed data, but what about all that unstructured data your organization is starting to notice. Where does it go?

Data Lakes are the newest old thing on the block, so to speak. The concept has been around forever (storing data), but the iteration allows businesses to expand what data can be housed and how to access it. Let’s examine this concept and see how it could be used to your business’s advantage.

What Are Data Lakes?

Data comes in many different forms. Structured data that’s been cleaned and labeled is fit for easy consumption. No worrying about “garbage in, garbage out.” No need to devote hours of human power trying to make sense of it all. Instead, it’s accessible and ready for training sets or simple insights.

Data warehouses store ready-made data designed to match your operational teams. The data is processed and ready for insights. There’s no need for complicated programming or multi-level deployment. Instead, insights are readily available.

For your data science team, however, this data may not be enough. Data Lakes don’t discriminate. All data flows in whether it’s scrubbed or still unstructured and in raw form. Data scientists can comb Data Lakes for new information not presented in any reports, and all data is kept because it could be useful someday.

[Related article: Should You Build or Buy Your Data Science Platform?]

Do You Need A Data Lake?

If your information is mostly structured, a Data Lake isn’t going to be worth it. Data warehouses or even simple databases are all you need to keep your information secure, available to the right people, and organized for future insight. If you’ve already got a well-established data warehouse, you don’t need to scrap your entire ecosystem and start fresh.

Data Lakes are suitable for organizations just beginning to delve into the different types of data available now. It can handle unstructured data without a substantial commitment from your team to scrub and label what might be impossible to completely clean. It can handle more complex inquiries and is overall more flexible.

How To Design A Data Lake

Just because Data Lakes store all kinds of structured and unstructured data indiscriminately, it doesn’t mean you can just throw all your data in there without thinking it through. To have the best chance of success, you need to consider a few things.

What’s the Reason for your Data Lake?

Ideally, all the data would be stored first into an ideal system, and then every answer could be revealed to you. Instead, focus on a reason you’re initiating the Data Lake. Building an initial use case for your business gives direction to how you organize the lake.

An actual business problem increases focus. Instead of turning the data lake into one huge experiment (or playground), building on a specific situation or question helps develop the lake faster and more efficiently, giving management answers and teams clarity.

Who Will Manage Your Data Lake?

This is a multilevel question. Data Lakes can be unwieldy, and you can’t just assign them to anyone with database familiarity. You need a logical plan for either hiring the people you need or training the people you have.

There isn’t a single platform system that can manage a data lake. Instead, multiple types of technology are required to operate the various aspects and extract maximum value. Be sure you know what you’re getting into and your team is ready to handle not just the information but the infrastructure.

It will also need to integrate with your existing infrastructure, so a manager is critical. A proper supervisor can handle these integrations and make decisions about infrastructure best practices so that your team experiences minimal obstacle getting up and running and minimal frustration retrieving your data.

You must also control who uploads data through a governance policy. Without a curator, a data lake becomes a data swamp, messy and impossible to navigate. This person (or team) controls redundancy, labels, and metadata so that when teams have inquiries, they’re greeted with a best-case scenario.

How Do You Secure Your Data?

Because Data Lakes are inherently more complicated than a simple database, not reviewing safety could leave you open to attack. Getting a team in place to manage the Data Lake helps ensure permissions are configured correctlyand your data isn’t too readable to be secure.

Build a logical system of both user authentication and user authorization to allow different permission tiers that allow your teams to work but don’t leave you vulnerable. You’ll also need encryption policies for your data both when it’s at rest and when your team is actively working.

Other Considerations

Your Data Lake needs to evolve with your needs as your organization grows larger and stores more information. Data Lakes should always be scalable, so when the time comes to uplevel operations, you aren’t wading through difficult repositories.

Consider also the full data management lifecycle as you get started. The best idea is to begin with the data you’ve already covered. As you build your pipeline, you’re more confident in your foundations. After, you can begin amassing your unstructured data with confidence that your baseline is stable.

Getting Started With A Data Lake

The entire concept is built with flexibility in mind, but putting in the work ahead of time to develop the right foundation will ensure that your team can retrieve information and work with the Lake in a way that provides business insight. It’s an adaptation designed to handle our complicated relationship with different types of data and help us find insights where we need them.

Original post here.

Read more data science articles on OpenDataScience.com, including tutorials and guides from beginner to advanced levels! Subscribe to our weekly newsletter here and receive the latest news every Thursday.

--

--

ODSC - Open Data Science
ODSC - Open Data Science

Written by ODSC - Open Data Science

Our passion is bringing thousands of the best and brightest data scientists together under one roof for an incredible learning and networking experience.

No responses yet