What two general approaches are there to malware detection?
signature based
data driven
What are features of signature-based malware detection?
reliable
expensive to maintain -> keep signatures up to date
no zero-day detection
can be broken by small changes
What are features to data-driven malware detection?
questionable reliability
can detect zero-days and modified malware
less expensive to maintain
What are the requirements for ML driven malware detection?
large, representative dataset
interpretable model
very low false-positives
able to quickly react to defense by malware
What should a good feature be?
descriptive
dense
transferable between downstream tasks
cheap to compute
How can one conduct feature extraction?
no semantics / count-based
naive semantics
stronger semantics
alternative:
static vs dynamic
structured vs unstructured
What is the difference between static and dynamic features?
static
-> same feature will always have same meaning (regardless of where it occurs…)
i.e. header fields…
dynamic
change depending on context
i.e. system call sequence…
What is the difference between structured and unstructured feature extraction?=
structured:
i.e. header
structureless:
no information on relation or order of stuff
-> i.e. entropy analysis
What is the difference between count based, weak and strong semantic features?
conunt based:
no semantic, simply consider occurence…
weak/naive semantic
i.e. BoW, tf-idf, n-gram
consider i.e. some co-occurences but no real semantic understanding
i.e. use count based or weak features
create embedding to introduce more semantic meaning
i.e. embedd whole control flow graph…
How can we embedd control flow graphs?
divide CFG in code segments with no branches
code block that is always straigt executed…
for each basic block create feature vector xi
use e.g. count-based, optcode n-grams or tf-idf (we used optcode n-grams)
embedding step
How does the embedding step work in CFG embedding?
Embedding of whole graph is sum over embedding of blocks
embedding of blocks is found recursively using a NN
the next iteration embedding is
based on the initial feature vector and the sum of the neighboring embeddings
the more we iterate, the more we take (further away) neighbors into consideration
-> iteration count is the distance we take other blocks into consideration
=> T-hop neighborhood
What network was proposed for CFG embedding?
W is some matrix with dimensinality [embedding size x feature size]
F is some non-linear NN
that takes as input the sum of the neighboring embeddings
has output dimensionality p
=> weighted base vector plus the output of the network on the sum of neighboring embeddings
How is the network trained?
we have labeled dataset X,Y
(gi, gj) yij == 1, if they were compiled from the same source but for different platforms and different compiler optimization levels
-> thus network should indicate them to be the same…
use siamese network and cosine similarity
How is cosine similarity calcualted?
cossim of two graphs
=
cos(embedding of first, embedding of second)
dot product of two embeddings
/
sum of length of both embeddings || my || …
What is the training objective in CosSim?
basically minimize wrong classifications…
How does the siamese network work?
have two identical network that share weights
-> one uses the one graph embedding, the other the other graph embedding
-> for training, compute cossim of both outputs
for inference: we get two embeddings and then can use the embeddings to calcualte the similairty…
Last changed2 years ago