undefined

Buffl

ML ITSecv

by Jensen J.

What two general approaches are there to malware detection?

signature based
data driven

What are features of signature-based malware detection?

reliable
expensive to maintain -> keep signatures up to date
no zero-day detection
can be broken by small changes

What are features to data-driven malware detection?

questionable reliability
can detect zero-days and modified malware
less expensive to maintain

What are the requirements for ML driven malware detection?

large, representative dataset
interpretable model
very low false-positives
able to quickly react to defense by malware

What should a good feature be?

descriptive
dense
transferable between downstream tasks
cheap to compute

How can one conduct feature extraction?

no semantics / count-based
naive semantics
stronger semantics

alternative:

static vs dynamic
structured vs unstructured

What is the difference between static and dynamic features?

static
- -> same feature will always have same meaning (regardless of where it occurs…)
- i.e. header fields…
dynamic
- change depending on context
- i.e. system call sequence…

What is the difference between structured and unstructured feature extraction?=

structured:
- i.e. header
structureless:
- no information on relation or order of stuff
- -> i.e. entropy analysis

What is the difference between count based, weak and strong semantic features?

conunt based:
- no semantic, simply consider occurence…
weak/naive semantic
- i.e. BoW, tf-idf, n-gram
- consider i.e. some co-occurences but no real semantic understanding
stronger semantics
- i.e. use count based or weak features
- create embedding to introduce more semantic meaning
- i.e. embedd whole control flow graph…

How can we embedd control flow graphs?

divide CFG in code segments with no branches
- code block that is always straigt executed…
for each basic block create feature vector xi
use e.g. count-based, optcode n-grams or tf-idf (we used optcode n-grams)
embedding step

How does the embedding step work in CFG embedding?

Embedding of whole graph is sum over embedding of blocks
embedding of blocks is found recursively using a NN

the next iteration embedding is
based on the initial feature vector and the sum of the neighboring embeddings
the more we iterate, the more we take (further away) neighbors into consideration
-> iteration count is the distance we take other blocks into consideration
=> T-hop neighborhood

What network was proposed for CFG embedding?

W is some matrix with dimensinality [embedding size x feature size]
F is some non-linear NN
- that takes as input the sum of the neighboring embeddings
- has output dimensionality p
=> weighted base vector plus the output of the network on the sum of neighboring embeddings

How is the network trained?

we have labeled dataset X,Y

(gi, gj) yij == 1, if they were compiled from the same source but for different platforms and different compiler optimization levels
-> thus network should indicate them to be the same…
use siamese network and cosine similarity

How is cosine similarity calcualted?

cossim of two graphs
=
cos(embedding of first, embedding of second)
=
dot product of two embeddings
/
sum of length of both embeddings || my || …

What is the training objective in CosSim?

basically minimize wrong classifications…

How does the siamese network work?

have two identical network that share weights
-> one uses the one graph embedding, the other the other graph embedding
-> for training, compute cossim of both outputs
for inference: we get two embeddings and then can use the embeddings to calcualte the similairty…

Join Course

Preview

Author

Jensen J.

Information

Last changed
2 years ago

Report course

3. NLP-III

Author

Jensen J.

Information