Unlock Safety & Savings: Mastering a Secure, Cost-Effective Cloud Data Lake

ODSC - Open Data Science
4 min readMay 8, 2024

Have you ever experienced a surge in your cloud data lake expenses? Is this surge indicating a malicious activity or a legitimate operation? Data lakes have become a cornerstone of the digital age, prized for their flexibility and cost-effectiveness. Yet, as they expand, they bring forth challenges in security, access control, cost management, and monitoring. The stakes are high: unauthorized access can lead to data breaches, while even legitimate users can inadvertently drive up costs.

With the growth in usage comes far more complexity. The size of data, together with the number of objects, are growing rapidly. A growing number of users, both human and application, are performing constant operations on the data lake. The large number of operations makes access and cost control a hard and ongoing task. Monitoring is also a complex task, since there are many access options, and all should be monitored.

Attackers can also take advantage of the many access options to the data lake. They can use object store and query engines advanced functionally for reconnaissance and to effectively traverse, locate, and track sensitive data.

Figure: Data lake access

Traditional monitoring methods often fall short. Tracking object store access can be overwhelming, with a single query generating thousands of log records. Monitoring at the query engine level demands a unique solution for each engine, adding complexity.

We suggest a two-tiered approach to deal with these issues.

The first tier is to adopt best practices, such as:

  • Using roles instead of keys
  • Using unique credentials and not sharing them between users and services
  • Using tailored, instead of wide access permissions
  • Applying lifecycle management, query size limitations, alerts and other general rules

The second tier is monitoring your data for anomalies. By logging the queries performed on your data lake you can detect and stop numerous cases of abuse and misuse. Let us explain how.

The data lake is often accessed via query by two major user types:

  • Humans — employees will often query the data to get information or during the process of development.
  • Applications — a deployed application will access the data as part of its normal function.

The major difference between the two is the usage pattern. While human queries are sporadic in nature, they are also normally limited to their working hours and their areas of work. Humans who work in marketing don’t normally wake up at 3AM to start a new project on production tables.

Applications either work in a periodic schedule, such as ETLs, or work on demand per user request, but they are normally limited to a predefined number of tables and often have a clear usage baseline. We don’t expect applications to change their queries, access new tables, or suddenly switch from a periodic schedule to an irregular one.

You should manage and protect your data lake carefully:

  • Adopt best practices for permissions management
  • Monitor access and use the monitoring data to create actionable insights

It will help you prevent data leakage and manage your costs by detecting data abuse and misuse.

About the Authors:

Ori Nakar is a principal cyber-security researcher, a data engineer, and a data scientist at Imperva Threat Research group. Ori has many years of experience as a software engineer and engineering manager, focused on cloud technologies and big data infrastructure. In the Threat Research group, Ori is responsible for the data infrastructure and is involved in analytics projects, machine learning, and innovation projects.

Johnathan Azaria is a tech lead in data science @ Imperva, specializing in AI-driven security algorithms and digital protection.

Originally posted on OpenDataScience.com

Read more data science articles on OpenDataScience.com, including tutorials and guides from beginner to advanced levels! Subscribe to our weekly newsletter here and receive the latest news every Thursday. You can also get data science training on-demand wherever you are with our Ai+ Training platform. Interested in attending an ODSC event? Learn more about our upcoming events here.

--

--

ODSC - Open Data Science

Our passion is bringing thousands of the best and brightest data scientists together under one roof for an incredible learning and networking experience.