The Continuing Path of Data Science Maturity
2020 was a real outlier in data terms. None of us could have predicted most of the year’s trends and developments. I spend a lot of time engaging with organizations to help them use data science and AI to improve their business and create value. In 2020, many of them were forced to change their plans for digital transformation. Topics that had not previously been considered important suddenly became crucial. With that in mind, I’m not going to start making predictions for 2021. However, I think there are some key trends that every data scientist should be monitoring this year, mostly because it is part of the data science maturity curve we are all on.
The ethics of data and algorithms for data science maturity
In many disciplines, we are now starting to use models that support or automate decisions that have traditionally been made by people, and these often have ethical implications. For example, in medicine, physicians are required to follow ethical principles such as balancing benefits against risks and avoiding harm. However, what happens when algorithms start to get involved in making diagnoses and deciding treatments? How can we ensure that they are subject to the same standards? Models need ethical scrutiny to ensure that they are fair and also provide accurate results. I believe that everyone involved in data science needs to be thinking about how these ethical standards will be developed and enforced.
The importance of assessing uncertainty and confidence in data analysis
Data scientists, and indeed, anyone who has studied statistics, will be familiar with the concepts of error and confidence in the analysis. These concepts are crucial to accurate decision-making because they tell us how far we can really believe analytical findings. In other words, they quantify the quality of the analysis. However, they are also really difficult to explain to non-specialists. I believe that the whole data science community needs to be thinking about how we can explain these concepts more clearly, particularly as more organizations start to become data-driven.
How we generate, process, and curate data
Volumes of both helpful information and potentially harmful misinformation are continuously increasing. It is therefore essential to develop an ability to identify high-quality data sources. Well-processed and curated data are critical for making good decisions. Data scientists need to be thinking about how data were generated and why, what is included in any analysis, and assessing privacy and access restriction. In the future, I think we will see the use of automated processes to facilitate the transformation from data generation to analysis. It is up to us as data scientists to ensure that this is done appropriately.
The importance of data governance, especially when sharing data and models
Sharing data, software, and models are all becoming key elements of any analytics and digital transformation project. However, there are risks to data sharing, and it needs careful governance. However, to my mind the biggest risk is that we fall into the trap of thinking that we must either share all data or none. We need to contribute to a debate about how we can share the data that we need while still providing privacy and fairness to data owners and others. This is especially important as we increase the use of artificial intelligence, which needs huge volumes of training data.
Developing new tools and features for data communication and visualization
We often say that a picture is worth a thousand words, and data visualization has the power to transcend language and cultural barriers, and speak to anyone, anywhere. Data visualization techniques continue to evolve every year, and the tools available are becoming increasingly powerful. Over the next few years, I think we will see tools that combine the nuance of human reasoning with the precision of computer algorithms. Devices such as wearables, large-scale displays, augmented and virtual reality will promote seamless integration of data visualizations into our everyday lives, making data more accessible to everyone. Data scientists need to be alert to the potential.
The wider use of Model Operationalization techniques to get value from analytics
Data are important, and so are analytical models — but they only start to generate value once they are embedded in decision-making processes. It is a long journey from the first “proc sql; create table; run;” to having someone actually using your solution. One step is key to this process: putting the model into production. However, almost 87% of data science projects never make this step. This process, known as model operationalization, or ModelOps, will become more important as more organizations start to become data-driven, and the number of models increases.
For more on ModelOps, beyond just data science maturity, check out this ebook.
Read more data science articles on OpenDataScience.com, including tutorials and guides from beginner to advanced levels! Subscribe to our weekly newsletter here and receive the latest news every Thursday. You can also get data science training on-demand wherever you are with our Ai+ Training platform.