Garbage In, Garbage Out: Automated Machine Learning Begins with Quality Data

5 min readOct 24, 2019

It’s no secret that machine learning methods are highly dependent on the quality of the data they receive as input. If you think of machine learning as a manufacturing process, the higher the quality of the input data, the more likely it is that the final product is of high quality as well. This relationship presents a big challenge to analytics teams when it comes to figuring out the right data for helping to solve business problems. It is necessary for those teams is to prepare all datasets to achieve a machine learning process free of errors. This involves setting up quality standards and fixing data issues like missing values or columns with low statistical variance, as well as selecting the right data types, removing duplicate data, and more. Automated machine learning can assist with this.

According to the CrowdFlower survey, data preparation and cleaning take roughly 60% of the time of data scientists and analytics professionals. This does not take into account the time needed to first collect and aggregate the required data for the problem at hand. However, data preparation is critical, as the efficacy of machine learning algorithms directly depends on the quality of the inputs as well as their relevance to the use case. It is not surprising, then, that data scientists and other data professionals spend countless hours gathering data and fixing problems in it to make sure algorithms yield the best results.

To help address this need, SparkCognition™ has developed the Darwin™ platform, an automated machine learning product that empowers users to quickly prototype use cases and achieve results faster than traditional data science methods. Darwin accelerates data science at scale, enabling you to assess the quality of your dataset and advising you on how to fix problems to make it suitable for the model-building process. Darwin then automates time-consuming tasks that range from model creation and optimization to model deployment and continuous maintenance. This way, Darwin aims to accelerate the data science cycle with productive automation workflows.

Getting Your Data Ready for Machine Learning

As soon as data is ingested, Darwin offers a guided data preparation workflow to help you proactively discover potential problems within your dataset. These problems could include columns with missing data, columns with low variance, or columns with too many categories. Darwin also offers suggestions on more appropriate data types for the problem at hand. During this process, Darwin provides a series of recommendations on how to address these problems to make sure the data is useful for the automated model building process.

Assessing the Overall Quality of Your Data

When the dataset is ingested, Darwin automatically runs an analysis of the data to be able to provide a qualitative assessment with regards to its usefulness for the data science process. This score is built based on the columns than can be directly used, marked in green; the columns that will require some pre-processing, marked in yellow; and the columns that will be dropped, marked in red.

Columns marked in yellow typically contain problems such as missing data or suggestions on different data types that could work better for the problem at hand. Darwin will automatically pick the best method to fix these problems. In the case of missing data, Darwin will propose the best imputation method based on the data type of the column. These methods can also be changed by the user to effectively create different data cleaning profiles and ultimately influence the model building process.

Columns marked in red typically contain a high amount of missing data, a large amount of unique categorical values or low statistical variance. Darwin will automatically remove these columns from the model building efforts to make sure they do not interfere in the performance of the machine learning algorithms. This way, Darwin guides users during the initial data preparation tasks to build a data cleaning profile and make sure that the dataset will be useful for the next stages of the data science process.

Quality Data = Quality Models

With the dataset ready to go, Darwin kicks off the automated model building process with its patented blend of evolutionary algorithms and deep learning methods. This method specializes in discovering novel, elegant network architectures, while also supporting hyper-parameter search for common algorithms such as Random Forest and XGBoost. Darwin first takes the output of the data preparation phase to automate the following major steps:

Execution of the data cleaning profile
Feature generation to enrich the dataset
Construction of a supervised or unsupervised model

For the construction of the model, rather than simply choosing the best performer in a tournament of predefined algorithms or blueprints, Darwin uses an iterative genetic process to scratch build model topologies that are optimized with each passing generation. This approach to automated machine learning effectively creates unique solutions that correctly and accurately reflect your data, translating into higher-quality predictions.

Quality Models = Faster Operationalization

Starting with high-quality datasets translates into better models, but also faster deployment cycles. Darwin’s automated workflows around data quality and model creation allow a faster turnaround of use cases, enabling organizations to operationalize the output of data science and innovation teams faster. These workflows also serve as a foundation for subsequent tasks in the life cycle of models, including monitoring their health, retraining them with new data, and continuous maintenance. This approach effectively transforms organizations into factories of use cases that efficiently operate on their data to positively impact what matters: the bottom line.

Original post here.

Read more data science articles on OpenDataScience.com, including tutorials and guides from beginner to advanced levels! Subscribe to our weekly newsletter here and receive the latest news every Thursday.

Garbage In, Garbage Out: Automated Machine Learning Begins with Quality Data

Getting Your Data Ready for Machine Learning

Assessing the Overall Quality of Your Data

Quality Data = Quality Models

Quality Models = Faster Operationalization

Written by ODSC - Open Data Science

No responses yet