Based on what are spam classifiers built?
dataset of emails
What approaches exist for building a spam classifier?
had crafted freatures -> naive bayes
learned features -> tf-idf (term ferquency inversed frequency), word embeddings
What is the intuition of naive bayes?
determin set of featuers (e.g. object has word A, contains a link, …)
-> calculate on dataset the probability P(features|is spam)
-> use this to then lateron calculate the inverse probability (P(spam | features))
-> do this for all feature combinations and use this information then to later classify spam…
What is P(A,B)?
probability that A and B occurs simultaneously
How can one calculate P(A,B)?
chain rule
P(An, …, A1) =
Product from k=1 to n
P(Ak|Ak-1,…,A1)
e.g. P(1,2,3) =
P(1|2,3) * P(2|3) * P(3)
Base Bayes Rule
P(A,B)
=
P(A|B) * P(B)
P(B|A) * P(A)
=> by this invert conditional probability if we only know one…
How can bayes rule be used?
we want to know p(is spam | contains link)
and know
p(contains link | is spam)
=> through counting on dataset…
Terminology for Feature F and Class S
p(F1, … ,Fn) = Evidence (collected probability of feature occuring…)
p(F1,…,Fn|S) likelihood
P(S) prior
P(S|F1,…,Fn) posterior
What is a drawback of naive bayes?
if n is large (num of features)
-> we require 2^n+1 table size to store all possible featue occurence combinations…
=> infeasible for large n
When are two variables conditionally independent?
p(F1, F2|S) = p(F1|S) * p(F2|S)
p(F1, F2) = p(F1) * p(F2)
What to assume to make bayes computationally feasible for large n?
assume that Fi are conditionally independent given s, even if that may not be the cas….
=> oversimplification formaly unjustified
=> but help making everythign computable
Why is naive bayes called naive?
because we do naive assumption of Fi being conditionally independent…
How does bayes theorem change for conditional independency?
p(S|F1,…,Fn) = (p(F1,…,Fn|S) *p(s)) / p(F1, …, Fn)
to
p(S|F1,…,Fn) = Product p(Fi|S) * p(S) / p(F1,…,Fn)
How does complexity change due to naivety?
2n -> as each 1 or 0 of the n featurs
but 4n -> as we have to include for S and not S….
What is problematic of the nenner p(F1,…,Fn)?
also comp infeasible as again 2^n
-> assume independence again?
=> disadvantageous as need to introduce other independence asumption
=> and because we need to compute and store another n values…
How can one overcome the dificulty of the nenner?
compare P(S|F1,…,Fn) and p(not S|F1,…,Fn)
=> is constant for both cases and thus is only normalistaion constant
=> as we are only interested wether the one is bigger than the other…
=> cna more or less disregarded—-
so basically only compute zählre and compare…
Why is naive bayes useful, why not just using supervised learning algo?
unique properties w.r.t. to learning
=> adding new training data
=> adding new features
How do we store training data?
columns = [features, sum]
rows = [number of class = true (e.g. spam) , number of total instances]
=> 2*(n+1) entries
=> allows us to calculate all required values (e.g. not F1 and S = S minus F1 and S…)
How can we add training data?
simply add the new training instances to the table by adding the amount of each situaiton…
(e.g. f1 yes, f2 no spam yes
-> …
Why not compute p(S|F1,…,Fn) direct?
THERE IS NO FACTORIZATION RULE
where p(S|F1,…,Fn) = product p(S|Fi)…
Zuletzt geändertvor 2 Jahren