Go Back

Research from Blueliv honored at Artificial Intelligence & Machine Learning conference

November 3, 2017

Blueliv recently participated in the 20th International Conference of the Catalan Association for Artificial Intelligence (Congrés Català en Intel·ligència Artificial or CCIA), whose objective is to foster discussion among the local Artificial Intelligence & Machine Learning research community.

Blueliv’s Daniel Gibert presented a poster of his collaborative work on ‘Convolutional Neural Networks for Classification of Malware Assembly Code,’ and we are delighted to announce that he was awarded Best Poster at CCIA’17 for his efforts. Congratulations!

The publication, summarized in this blog post, is part of an original research project called Design of a System for Detection and Classification of Web Pages and Malicious Software (in Catalan: “Disseny d’un sistem de classificació de pàgines web i software maliciós”) carried out by Blueliv in collaboration with the University of Lleida and funded by AGAUR.

The proceedings of the conference and papers will shortly be published in Recent Advances in Artificial Intelligence Research and Development in the Frontiers in Artificial Intelligence and Application series (Amsterdam: IOS Press).


Convolutional Neural Networks for Classification of Malware Assembly Code

Existing literature states that machine learning-based methods for malware detection and classification rely mainly on a set of hand-crafted features previously defined by experts. Thus, approaches can be divided into two components:

On the one hand, a subset of features is extracted to represent malicious software. Most of the approaches use N-Gram based features extracted from byte sequences or instruction opcodes.

An N-Gram is a contiguous sequence of N items from a given sequence of text. Similar to byte-sequence N-grams, opcode N-gram patterns have been used in the literature to detect and classify malware. An opcode (abbreviated from operation code) is the portion of a machine language instruction that specifies the operation to be performed. Other features are entropy statistical measures, the use of the Windows Application Programming Interface (API) function calls, the use of the registers, etc.

On the other hand, the extracted features are used as input for a machine learning classifier (Neural Networks, Support Vector Machines, Decision Trees, etc.).

The main drawback of these approaches is that the feature extraction and feature reduction step is time-consuming and relies on expert analysis to design the discriminative features. These features are passed to the machine learning system used to make the final classification decision.

In the paper, we present a novel end-to-end deep learning framework to group malicious software into families by extracting N-Gram like signatures from malware’s assembly language instructions. Our method takes inspiration from existing N-Gram based methods but instead of exhaustively enumerate a large number of N-Grams during training, our deep learning system learns to detect N-Gram like signatures by learning to detect subsequences of opcodes that are indicative of malware.

Moreover, our method allows very long N-Gram type signatures to be discovered, which is impractical if all N-Grams counts are required (i.e. N > 4). Our approach eliminates the need for counting millions of N-Grams during training and also the dimensionality reduction step.

In our work, we apply convolutional neural networks (CNNs) to the problem of malware classification. The CNN learns to detect patterns in the disassembled byte-code that are indicative of one malware family or another.


CNN Layers Description

The convolutional neural network is at least composed of the following layers:



An assembly program is represented as a concatenation of mnemonics

x1:n = x1 x2 ⊕ · · · ⊕ xn

where n is the length of the program and xi ∈ Rk corresponds to the i-th mnemonic in the program.



Every mnemonic is represented as a low-dimensional vector of real values (word embedding).



A convolution operation involves a filter w ∈ Rhk where h is the number of mnemonics to which is applied and k is the size of the word embedding. In particular, filters are applied to sequences containing from 2 to 7 mnemonics.

A feature ci is generated from a window of mnemonics xi:i+h−1 (it comprises all mnemonics between position i and i + h − 1) and is defined as

ci = f (w · xi:i+h−1 + b),

where f is a rectifier linear unit (ReLU) function and b the bias term.



The maximum value ĉ = max{c} is taken as the feature corresponding to the filter by applying the max pooling operator over the feature map c = {c0, c1, cn}.



The extracted features are passed to a fully-connected softmax layer whose output is the probability distribution over families.

To demonstrate the suitability of our approach, we evaluated our model on the data provided by Microsoft for the Big Data Innovators Gathering (BIG 2015) Anti-Malware Prediction Challenge. Experiments show that our approach achieved competitive results in comparison with the state of the art and, what is more important, it outperformed N-Gram based methods in the literature in terms of predictive power and computational time. In addition, the nature of the convolutional neural network provides resilience to the function and subroutine reordering techniques commonly employed by malware authors for obfuscation purposes.


Fig.1 Data Transformation


Fig.2 A complete overview of the architecture


Fig.3 T-SNE Visualization. The N-Gram like features learned by the convolutional layers are highly discriminant and can be used to clusterize malware into groups.



We used convolutional neural networks (CNN) because they act as feature extractors. Here, given a sequence of opcodes representing the assembly language instructions executed by the malicious software, the CNN automatically learns which are the subsequences more discriminative of one family or another.

Previous approaches relied on counting a huge number of N-Grams, then performing the feature reduction step, and finally training a machine learning classifier. In our case, this process of feature extraction, feature reduction, and classification is carried out entirely by the CNN.

Additionally, the computational time of our approach is much lower than previous solutions – our results are better both in terms of accuracy and predictive power.

Applying this methodology at Blueliv means we can classify whether a malware belongs to one family or another (or whether a software program is benign or not) with even greater accuracy and speed than before, providing even more value for those using our services.

Community Support Demo