Be or Not to be an Anomaly?

Why even bother to detect outliers?

  • To react. If such an unusual behavior appears, especially a negative one, the reaction is a key. The sooner a fraud email is detected, the sooner it can be removed in order not to endanger the user. Detecting a machine’s fault in time may even save lives.
  • To know ‘normality’. Taking the information about outliers into account while inferring may lead to incorrect conclusions. If a student failed one test while nailing all the rest, the ‘normal’ behavior is still the key for judgment (even if reaction — see point 1 — may be a good idea).
  • To accurately predict.

How to detect anomalies?

From the modeling point of view, anomalies can be found in a lot of ways!

  • A side-effect of a supervised approach. Let’s forget about outliers and just model the variable at hand in the best possible way, preferably with exogenous variables. Then, using prediction errors, identify observations with the highest discrepancies. Given the pattern capturing model has troubles fitting them, there is a pretty high chance those are not typical observations. Also, some methods like X13 have an outlier detection build in them.
  • Unsupervised methods. Have a higher touch of uncertainty than the alternatives, but you may leverage those models right away, data and business knowledge are really all you need to start.
  • Mixed approach. Anomaly detection is like playing detective — you arrive at the point of having a suspect, but still, human feedback may empower you the ‘evidence’. That’s why for example if you sign in the email from another device you are asked if that’s indeed you — an anomaly was detected, but for the model to improve, a confirmation is needed. The mixed approach is my personal favorite.

Walkthrough an example: detection

Let’s focus on univariate outliers and on unsupervised and side-effects of supervised methods representatives. For demonstration purposes, I’ll be using my own Fitbit data regarding climbed floors per day since the beginning of 2018:

Walkthrough an example: inference

An immediate output of the above-presented anomalies detection techniques was the identification of hiking days and lazy days. But what about inference? What could be the actual business case?

  1. Identifying sequences of outliers (also known as temporary changes).
  2. Manually labeling the temporary changes.

Is that all?

We already feel pretty confident and intuitive in the world of one-dimensional outliers. Now let’s imagine a dataset with each day represented by time series instead of only one number. Then the anomaly detection complicates itself significantly, as not only the floors numbers differ, but also the dynamics of climbing them during the day. I’ll be talking more about this quite interesting challenge at this year’s ODSC Europe Conference. At my talk, “Multivariate (Flight) Anomalies Detection,” you will learn how to detect anomalies in multidimensional space and preceding that — how to distinguish data quality issues from anomalies. This time context will be set in the aviation industry, so flight profiles will be ‘taken under the data science microscope’. However, the proposed modeling approach is transferable to any other domain. I hope to see you there!

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store