What are some examples of malware (private and state)?
private:
botnet rentals
ransom ware
sales of stolen identities, credit card numbers, …
phishing and advertisement
state actors:
stuxnet
What are the two basic detection approaches for malware?
signature based
vs
data driven
What is signature based malware detection ? What are the attributes of it?
Learn signatures from known malware (e.g. regular antivirus…)
-> reliable
-> expensive to maintain (find and include newly detected malware…)
-> no zero-day detection
can be broken by small changes (in the malware)
What is data driven malware detection? What are the attributes of it?
e.g. anomaly detectoin based on exchanged data…
-> reliability (?)
can detect zero-days and modified malware
less expensive to maintain (?)
What are the requirements of ML-driven malware detection?
large, representative datasets
interpretable model
very low false-positives
able to quickly react to defense by malware
When is a feature a good feature?
descriptive
dense
transferable between downstream tasks
cheap to compute
What is a taxonomy for feature extractino?
no semantics / count based
naive semantics
stronger semantics
alternative:
static vs dynamic
structured vs unstructured
Give an example for count-based features
PE (portable executable) - header
structured, static
static:
we have no sequence -> e.g. only finite values in the header…
structured:
we have knowledge of structuer (e.g. different fields)
-> can count several things like number of…
What is entropy analysis?
count based feature
static, structureless
-> again static, as we have no sequence of values
-> structureless, as we have no underlying assumption of the strucutre fo the data…
-> byte n-grams (look at n successive bytes)
-> sliding window over data to compute ‘local’ shannon entropy (dark = low entropy, light = high entropy)
-> e.g. use ‘strings’ (5+consecutive bytes with ascii values…)
Give an exmaple of a dynamic count-based feature
observe sys call sequence
When can analysing sys calls be problematic?
anti-debug
anti-hooking
metods are present
Example of features with weak semantics
disassembly
-> have knowledge of e.g. sequence of different opcodes (after disassembling them)
-> but no “clear / strong” meangin that can be easily identified…
What are for example stronger semantics?
correlate e.g. different blocks of opcodes…
with branches…
and structure
How can one create features with strong semantics from binary?
disassemble
have opcodes…
-> for each basic block (code segment with no branches / straigth execution)
-> create feature vector containing e.g.
count based featuers
opcode n-grams
tf-idf…..
=> embed these basic blocks to have stronger semantics…
=> the strong semantic is the control flow graph of the binary…
-> use for binary similarity detection…
What can binary similarity detection be used fro?
vulnerability detection
e.g. cross platform vulnarabilities in firmware…
malware detection
plagiraism detecoitn
What is the code similarity embedding problem?
geven two binariy functions f1, f2, we require oracle
oracle pi := pi(f1, f2) =
1 if f1, f2 are similar
-1 if f1, f2 are dissimilar
oracle specific to each task and typically unknown
=> objective of problem:
find mapping that maps the code function graph of a function f to a vector representation my
Whys use embedding in code similartiy?
allows for efficient computation
-> inexpensive similarity funciton between two cevtors can be applied…
What is a way to learn the mapping funciton? (mapping f to a vector representaiton my)
What are advantages
use a neural network
-> easily adaptable (can be re-trained)
-> efficiency as does not rely on graph matching algorithm
How is the graph embedding done?
structure2vec
-> have basic blocks and vertices
basic blocks are in feature vector representation (e.g. featuers we extracted like count based ones, opcode n-grams, tf-idf,…)
-> embed control flow graph by aggregating individual embeddings of basic blocks
=> we use simple sum of the vectors…
How do we embed basic blocks?
my i (basic block embedding)
-> recursive function
-> my_i(t+1) = NN (x_i, sum over neighbors of my_i(t))
where my_0 = 0-vector
What is the effect of embedding neighbors recursively=
propagates the featuers x_i of each basic block throughout the network…
-> after T iterations -> the embedding of my_i (embedding of basic block i) contains informations about its T-hop neighbors…
What neural network do we use?
tanh(W x_i + F(Sum of neighbor embeddings my(t))
where F is some non-linear NN
What dataset do we use to train our model? For what purpose?
e.g. find wether same binary for other platforms
-> X=[(g1,g2), (g1,g3), …]
-> Y = [1,0,…
=> if gi,gj binary functions compiled from same source code, than y = 1; else y = 0
=> meaning: same binary from different assemblers…
What loss do we use?
Cosine similarity
-> cosine on model output of both functions
-> CisSim (g, g’) = cos(ML(g), ML(g’))
training objective:
arg min _ {W, Sigma} Sum_{i=1 to K}
(Sim(g_i, g_i’) - y_i)^2
actual calculation:
calculate vector product of both
divide by product of vector distance (sqrt of sum of squares)
minus y value
square all
Last changed2 years ago