Data
Mining & MacHine learning Final
Project
Group 2
R95922027 00涵00/u>
R95922034 瀛0000/font>
R95922081 瑷卞000/font>
R95942129
000缍0/font>
Outline
Experiment
setting
enron
Removing html tags Factoring important headers Six folders from enron1 to enron6. Contain totally 13496 spam mails & 15045 ham mailsOutline
Feature
Extration
Transmitted
Time of the Mail
& Number of the Receiver
Spam:
Non-uniform
Distribution
Spam:
Only Single Receiver
Probability of being Spam for Transmitted Time & Receiver Size
Attachment,
Images, and URL
Symbols
in Mail Titles
Title Absentness Spam senders add titles now. Arabic Numeral : Almost equal probability (Date, ID) Non-alphanumeric Character & Punctuation Marks:
Appear more often
in Spam
Appear more often in ham
Mail-body
Outline
Na茂ve
Bayes
Given a bag of words (x1, x2, x3,00xn), Na茂ve Bayes is powerful for document classification.
Vector
Space Model
Create a word-document
(mail) matrix by SRILM.
For every mail (column) pair, a similarity value can be calculated.
KNN
(Vector Space Model)
As K = 1, the KNN classification model show the best accuracy.
Maximum
Entropy
The elements in word-document matrix are modified to the binary value {0, 1}.
SVM
Binary :
Select binary value {0,1} to represent that this word appears or not
Normalized :
Count the occurrence of each word and divide them by their maximum occurrence
counts.
Outline
Single-layered-perceptron
Hybrid Model
The accuracy of NN-based Hybrid Model is always the highest.
Committee-based
Hybrid-model
The voting model averages the classification result, promoting the ability of the filter slightly. However, sometimes voting might reduce the accuracy because of misjudgments of majority.
Knn + na茂ve Bayes + Maximum Entropy na茂ve Bayes + Maximum Entropy + SVMOutline
Conclusion
Reference
