SciTech

Algorithm learns to identify anomalous activity online with high degree of accuracy

At the IEEE International Conference on Big Data Security in New York City this month, researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) and the machine learning start-up PatternEx, presented a paper about their new security system that combines machine learning approaches and input from human security experts. This system, called AI2 (named by merging “artificial intelligence” and “analyst intuition”), has an 85 percent success rate in identifying threats and a false positive rate of 4.4 percent over a raw data set of 3.6 billion log lines. According to the paper, the three major challenges faced by the security industry are a lack of labelled examples to model learning models on, constant evolution of attacker’s methods, and limited reliance on security analysts to determine each threat’s risk factor.
In fact, stand-alone analyst-driven approaches are limited in their effectiveness because of the fact that attackers learn the behavior used by such systems to predict possible threats, and then work their way around that learned behavior in order to bypass security systems.

Furthermore, only machine learning-based approaches can be inefficient based on the fact that they raise a need for human investigation every time they come across an anomaly.
The research team’s answer to these hurdles in the field of security systems is to build a supervised model using three different unsupervised models (learning models that don’t use any prior data or ‘labels’), which ‘continuously refines itself.’

The paper lists four basic features that define the system: a big data behavioral analytics platform, various feature detection methods (outlier detection methods), a mechanism for obtaining feedback from security analysts, and a supervised learning module.

Basically, the system analyzes large-scale raw data using three learning models, after which it presents the analysis with a small portion of the data set based on the three models to obtain labels for each of the sets it wishes to identify.

After collecting an analyst’s feedback about these data sets, it uses a supervised learning model (a learning model that predicts based on some previous training data) to further refine the process.
The supervised model, used in conjunction with the three unsupervised ones, helps the system narrow down the data presented to the analyst in a short period of time, thereby cutting costs and inefficiency.
In identification of potential attacks, the system uses variables that pinpoint to a particular behavior. Such variables define the behavior that is a signature of an attack, which is simply a series of all the steps required to commit the security breach.
Since this study was conducted with data from an online enterprise system (typically a business organization), some of the data that the learning system studied included the login success and failures over a time unit from a given IP address, minimum time from login to checkout, the different geographical locations that the user has checked into during a given time unit.

All of these factors, combined in a unique manner, might be used to produce the behavioral signature of an attack, that the learning algorithm ‘learns’ to identify suspicious activity.

Frauds that violate the terms of the service agreement often have a distinct behavior signature.

For example, abusive use of promotional codes and manipulating web browser cookies to take part in a poll or draw multiple times.

Therefore, as the system learns to identify previously ‘unseen’ threats, it becomes better, showing 3.41 times improvement as compared to other unsupervised anomaly detectors, and decreasing false positives by fivefold.

Since threats keep evolving, along with the increasing importance of an infallible security system, any such system will involve human presence along with artificial intelligence to combat possible threats, as scientists try to study ways that would completely automate the process.