What are the two basic malware detection approaches?
signature based
data driven
What are feats of signature based malware detection?
reliable
expensive to maintain (update signatures…)
no zero day detection
can be broken by small changes
What are feats of data driven malware detection?
reliability (?)
can detect zero days and modifided malware
less expensive to maintain
What are requirements on a dataa driven malware detection mechanism?
large representative dataset
-> interpretable model
very low false positives !!!!!
able to quickly react to defendse by malware
What are the pre-processing steps we do for malware detection?
embed the binary in a feature vector
use them for downstream tasks (e.g. calculate similarity…)
What are the properties of a good feature?
descriptive
dense
transferable between downstream tasks
cheap to compute
What potential taxonomy for NLP feature extraction exist?
no semantics / count based
naive semantics
stronger semantics
static vs dynamic
(structured vs unstructured)
Give an example for count based features
file header (e.g. portable executable file header)
=> fixed structure and static
entrpoy analysis
-> do byte n-gram
calculate shannon entopy with sliding window…
danymic:
analysis of system calls -> observe sequence…
What is an example for weak semantics?
understand what certain values might mean
-> e.g. string representation of disaeembly… (where we can for esample find the word password…)
get sequence of stuff with e.g. n-gram
=> but no deep understanding of what is happening…
How can we create strong semantics?
embed things and learn the structurte of it (e..g. similar to word2vec)
How can we embed programs with strong semantics?
embed each individla code block in the control flow (e..g using count based features, n-gram,…)
use individual code block embeddings to learn representation of control flow…
How do we actually calculate control flow embedding?
embed individual code blocks
recursively refine these embeddings by learning about neighbor relaitons
calculate sum of individual embeddings
How can we recusrively learn about rneighbore relations in contrrol flow embedding?
rec embedding of block i at step t+1
output of model on (embedding of i, sum over rec embeddings of neighbors at time t)
where the rec emedding of all blocks at time 0 is null vector…
What neural network was proposed to use for CFG embedding?
tanh(Wx_i + F(sum over neighbor rec embeddings))
where F is some non-linear NN
What data is used to train CFG embedding?
X = [(g1, g2), (g1, g3),…]
Y = [1,0,1,…]
=> train on binary function combinations
=> where y=1 indicates that they were compiled from the same source code
but for different platforms using different optimization levels….
What loss do we use in CFG embedding?
cosine similarty and train via siamense network
CosSim(g, g’) =
cos(f(g), f(g’))
=
<f(g), f(g’)>
/
||f(g)|| * ||f(g’)||
where f(g) = embedding of graph…
=> Use this to arg minimize sum over all training pairs
(CosSim(g, g’) - y)^2
where the arguments are W and the parameters of the NN (parameters of basic block and parameters for NN for rec embedded neighbors)…
Why use cosSim ?
maximum similarity -> 0
bounded and does not get arbitrarily large for dissimilar …
Last changed2 years ago