What two approaches are there to create spam classifiers?
create features by hand -> and then i.e. use in naive bayes
learn features, i.e. TF-IDF
How does the probabiltiy chain rule work?
Probability that An, …, A1 occur at the same time
-> product of the probabities of each Ai bedingt on the othe As already occured
-> Simply stretch the left side
P(A,B,C) = P(A|B,C) * P(B|C) * P(C)
What does Bayes theorem state?
P(A,B) = P(B,A)
P(A|B) * P(B) = P(B|A) * P(A)
What are the different elements of the bayes theorem called?
P(F) -> Evidence
P(F|S) -> Likelihood
P(S) -> Prior
P(S|F) -> posterior
How much space would we need to store p(F1, F2, F…, Fn | S) ?
table form with all possible combinations…
-> O(2⁽n+1⁾)
-> 2^n for all combinations of the features
2 for all combiatinos S, not S…
What is the problem to store all combinations? How is it solved?
infeasible for large n…
=> use naive assumption that all Fi are independent of each other
How does naive assumption work?
When in P(F1, F2, …, Fn | S), all F are conditionally independent from each other
-> simply resolves to product
Prod P(Fi | S) over all i…
=> requires only 2n (for F and S, not S)
-> as not F can simply be calculated by 1-x…
Why dont we simply do the same assumption for the nenner of the bayes term?
would need to introduce other independence assumption
need to compute and store another n values
-> use simply as normalization constant…
How do we handle the nenner as normalizuation constant?
simply compare p(S|…) and p(not S|…)
-> as we only want to know which one is larger
-> thus, nenner kürzt sich raus…
Why is naive bayes useful compared to simply throwing a NN at the problem?
new training data can simply be incorporated
adding new features is easy
=> Naive bayes has unique properties w.r.t. online learning…
Why dont we calculate P(S|F1,…,Fn) directly?
there is no rule for it…
Last changed2 years ago