Trending Data Science Topics & Tools for 2020
As the entire world has entered a paradigm shift in 2020 due to the virus, trends across every industry may have changed to meet these changing times. In data science and AI, many practitioners and researchers have had to shift their focus to meet the demands of their company, academic institutions, or personal research endeavors. Now that the year is more than halfway over, what has stood out in 2020 so far, and what are leading data scientists seeing in their work?
Model bias remains an issue
Dr. Jon Krohn, Chief Data Scientist | untapt
The big item for me is that the ML community is beginning to wake up to the widespread, unwanted bias that is present in data-driven models. Whether it’s related to machine vision, natural language processing, or other applications, the researchers and developers devising the models underlying these applications are not a representative sample of the broader population demographics. The data sets that they tend to work with likewise are not a representative sample of broader population demographics. The result is that many production ML models today are less effective for some demographic groups and, in a startling number of instances, can reinforce unwanted historical biases against these groups.
The growing prevalence of facial recognition software & increased concern
Facial recognition technology has long sparked debate in the data science community and beyond. For years, companies and individuals have struggled to find ethical uses of the technology and to minimize bias. There are also major concerns in the U.S. especially in regards to political and law enforcement use, as noted with the recent protests sweeping the country. Big names in tech like Microsoft, IBM, and Amazon have even gone so far as to ban law enforcement from using their facial recognition technology moving forward.
More love for AutoML and other automation tools
Daniel Gutierrez, ODSC writer, teacher, and practicing data scientist.
Data and algorithms are expanding rapidly, but human capabilities — even those of data scientists and other quantitative professionals — are not. It’s for this reason that a growing number of enterprises are using a new breed of tools to automate many of the activities involved with machine learning in order to meet the increasing demand for analytical capabilities. Automated machine learning or AutoML is the technology solution designed to address the short supply of these capabilities.
MLOps is becoming a must-have for data science teams
MLOps is communication between data scientists and the operations or production team. It’s deeply collaborative in nature, designed to eliminate waste, automate as much as possible, and produce richer, more consistent insights with machine learning. ML can be a game-changer for a business, but without some form of systemization, it can devolve into a science experiment.
As Stephanie Kirmer, Data Science Technical Lead at Journera, said, “The development of a subdiscipline for ML Ops is a big topic I have heard a lot about this year. Managing the infra for machine learning is hard and it’s looking like that will be a specialist field soon.”
Complex models require improved workflows — enter Apache Airflow
Tomasz Urbaszek, Software Engineer & Apache Airflow Committer | Polidea | Apache Software Foundation
Apache Airflow is a tool created by the community to programmatically author, schedule, and monitor workflows. The biggest advantage of Airflow is the fact that it does not limit the scope of pipelines. Airflow can be used for building Machine Learning models, transferring data, or managing the infrastructure. Let’s take a closer look at the trending workflow management tool.
The first-ever Airflow Summit ended recently and the event was attended by 6,000 participants which shows how many people are interested in this tool. Personally I think that Apache Superset — yet another Apache project — is also becoming more popular. It offers an opensource alternative to expensive BI tools and the community behind the project is already super active which foretells a good future!
Transparency is key for data
Jordan Bean | Liberty Mutual Insurance
The democratization of big data and accessibility and power of open-source analytics platforms has created a transition from the novelty of what can be done with data to emphasizing its explainability and interpretability. Accelerated by COVID modeling and examples of gender or racially biased model predictions, being able to create a transparent, “plain English” explanation of a model and its predictions are becoming a necessity.
In corporate America, delivering impact with analytics and ML more and more requires this explanation of findings over pure predictive power. This has led to a growing importance of learning “data storytelling” as the numbers and predictions no longer just speak for themselves; developing this skill will become the next evolution of data science and ML.
More hype for federated learning
Dr. Kirk Borne, Principal Data Scientist | Booze Allen Hamilton
Federated Machine Learning (FML) is another “orphan” concept (formerly called Distributed Data Mining a decade ago) that has found new life in modeling requirements, algorithms, and applications this year. To some extent, the pandemic has contributed to this because FML enforces data privacy by essentially removing data-sharing as a requirement for model-building across multiple datasets, multiple organizations, and multiple applications. ML model training is done locally on the local dataset, with the meta-parameters of the local models then being shared with a central model-inference engine (which does not see any of the private data). The central ML engine then builds a global model, which is communicated back to the local nodes. Multiple iterations in parameter-updating and hyperparameter-tuning can occur between local nodes and the central inference engine, until satisfactory model accuracy is achieved. All through these training stages, data privacy is preserved, while allowing for the generation of globally useful, distributable, and accurate models.
NLP is a must-have
Kimberly Fessel, Senior Data Scientist and Instructor | Metis
Three letters: NLP. Natural language processing (NLP) has experienced continued rapid growth in recent years, both in terms of research as well as practical usage. In 2020, we have already seen the release of GPT-3 and additional BERT variants. But NLP use is also making its way into the mainstream. Many companies now see NLP as a critical piece of their strategic advantage — just note how many current job postings mention NLP as a required skill!
How to get started with these trending topics?
As the data science professionals above noted, these topics aren’t just important in the field of AI, as they’re becoming mainstream. We often hear about data bias, facial recognition, and transparency in the news, and knowing these topics are pivotal for anyone involved in data science and AI.
To get started, consider checking out a few of these talks at the ODSC Europe Virtual Conference this September 17–19:
– “Removing Unfair Bias in Machine Learning” Margriet Groenendijk, PhD, Data & AI Developer Advocate | IBM
– “Introduction To Face Processing With Computer Vision” Gabriel Bianconi, Founder | Scalar Research
– “Machine Learning Operations: Latent Conditions and Active Failures” Flavio Clesio, Machine Learning Engineer | MyHammer AG
These are some trending topic talks coming to the ODSC West Virtual Conference this October 27–30:
– “Model Governance: A Checklist for Getting AI Safely to Production” David Talby, CTO | Pacific AI, John Snow Labs
– “MLOps in DL model development” — Anna Petrovicheva, Chief Technical Officer | OpenCV.ai
– “Advanced NLP with TensorFlow and PyTorch: LSTMs, Self-attention and Transformers” — Daniel Whitenack, PhD, Instructor & Data Scientist | Data Dan