undefined

Buffl

ML neu

by Jensen J.

What are the two basic malware detection approaches?

signature based
data driven

What are feats of signature based malware detection?

reliable
expensive to maintain (update signatures…)
no zero day detection
can be broken by small changes

What are feats of data driven malware detection?

reliability (?)
can detect zero days and modifided malware
less expensive to maintain

What are requirements on a dataa driven malware detection mechanism?

large representative dataset
-> interpretable model
very low false positives !!!!!
able to quickly react to defendse by malware

What are the pre-processing steps we do for malware detection?

embed the binary in a feature vector
use them for downstream tasks (e.g. calculate similarity…)

What are the properties of a good feature?

descriptive
dense
transferable between downstream tasks
cheap to compute

What potential taxonomy for NLP feature extraction exist?

no semantics / count based
naive semantics
stronger semantics

static vs dynamic
(structured vs unstructured)

Give an example for count based features

file header (e.g. portable executable file header)
=> fixed structure and static

entrpoy analysis
-> do byte n-gram
calculate shannon entopy with sliding window…

danymic:

analysis of system calls -> observe sequence…

What is an example for weak semantics?

understand what certain values might mean
-> e.g. string representation of disaeembly… (where we can for esample find the word password…)
get sequence of stuff with e.g. n-gram
=> but no deep understanding of what is happening…

How can we create strong semantics?

embed things and learn the structurte of it (e..g. similar to word2vec)

How can we embed programs with strong semantics?

embed each individla code block in the control flow (e..g using count based features, n-gram,…)
use individual code block embeddings to learn representation of control flow…

How do we actually calculate control flow embedding?

embed individual code blocks
recursively refine these embeddings by learning about neighbor relaitons
calculate sum of individual embeddings

How can we recusrively learn about rneighbore relations in contrrol flow embedding?

rec embedding of block i at step t+1
output of model on (embedding of i, sum over rec embeddings of neighbors at time t)
where the rec emedding of all blocks at time 0 is null vector…

What neural network was proposed to use for CFG embedding?

tanh(Wx_i + F(sum over neighbor rec embeddings))
where F is some non-linear NN

What data is used to train CFG embedding?

X = [(g1, g2), (g1, g3),…]
Y = [1,0,1,…]
=> train on binary function combinations
=> where y=1 indicates that they were compiled from the same source code
- but for different platforms using different optimization levels….

What loss do we use in CFG embedding?

cosine similarty and train via siamense network
CosSim(g, g’) =
cos(f(g), f(g’))
=
<f(g), f(g’)>
/
||f(g)|| * ||f(g’)||

where f(g) = embedding of graph…

=> Use this to arg minimize sum over all training pairs

(CosSim(g, g’) - y)^2

where the arguments are W and the parameters of the NN (parameters of basic block and parameters for NN for rec embedded neighbors)…

Why use cosSim ?

maximum similarity -> 0
bounded and does not get arbitrarily large for dissimilar …

Join Course

Preview

Author

Jensen J.

Information

Last changed
2 years ago

Report course

NLP-3

Author

Jensen J.

Information