Be or Not to be an Anomaly?

Why even bother to detect outliers?

  • To react. If such an unusual behavior appears, especially a negative one, the reaction is a key. The sooner a fraud email is detected, the sooner it can be removed in order not to endanger the user. Detecting a machine’s fault in time may even save lives.
  • To know ‘normality’. Taking the information about outliers into account while inferring may lead to incorrect conclusions. If a student failed one test while nailing all the rest, the ‘normal’ behavior is still the key for judgment (even if reaction — see point 1 — may be a good idea).
  • To accurately predict.

How to detect anomalies?

  • Intended supervised approach. The most costly, with a high entry-level, not immune to pattern changes, yet quite effective in a stable environment. It requires the manual labeling of your data points as outliers and as typical observations. Once having the labels, good old classification methods may be applied. To speed up the process, the visual tool with the automatically retraining model behind the scenes is pretty useful. However, tool or no tool, labeling usually has the flavor of tediousness to it.
  • A side-effect of a supervised approach. Let’s forget about outliers and just model the variable at hand in the best possible way, preferably with exogenous variables. Then, using prediction errors, identify observations with the highest discrepancies. Given the pattern capturing model has troubles fitting them, there is a pretty high chance those are not typical observations. Also, some methods like X13 have an outlier detection build in them.
  • Unsupervised methods. Have a higher touch of uncertainty than the alternatives, but you may leverage those models right away, data and business knowledge are really all you need to start.
  • Mixed approach. Anomaly detection is like playing detective — you arrive at the point of having a suspect, but still, human feedback may empower you the ‘evidence’. That’s why for example if you sign in the email from another device you are asked if that’s indeed you — an anomaly was detected, but for the model to improve, a confirmation is needed. The mixed approach is my personal favorite.

Walkthrough an example: detection

Walkthrough an example: inference

  1. Finding candidates for outliers.
  2. Identifying sequences of outliers (also known as temporary changes).
  3. Manually labeling the temporary changes.

Is that all?

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
ODSC - Open Data Science

ODSC - Open Data Science

Our passion is bringing thousands of the best and brightest data scientists together under one roof for an incredible learning and networking experience.