search

 Data Mining & MacHine learning Final Project

0 comments

file time: 2008-02-16

filetype:pptx

Click Here To Download...

>  

Data Mining & MacHine learning Final Project 

Group 2 

R95922027 00涵00/u>

R95922034 瀛0000/font>

R95922081 瑷卞000/font>

R95942129 000缍0/font> 
 

 

Outline 

Experiment setting Feature extraction Model training Hybrid-Model Conclusion Reference  

Experiment setting  
 

Selected online corpus:

enron

Removing  html tags Factoring important headers Six folders from enron1 to enron6. Contain totally 13496 spam mails & 15045 ham mails  

Outline 

Experiment setting Feature extraction Model training Hybrid-Model Conclusion Reference  

Feature Extration  
 

Transmitted Time of the Mail Number of the Receiver Existence of Attachment Existence of images in mail Existence of Cited URLs in mail Symbols in Mail Title Mail-body  

Transmitted Time of the Mail 
& Number of the Receiver 
 

Spam:

Non-uniform Distribution 

Spam:

Only Single Receiver

 

Probability of being Spam for Transmitted Time & Receiver Size

 

Attachment, Images, and URL 

  Attachment Image URL Spam    0.0307% 0.6816% 30.779% Ham 7.3712%  0% 7.0521%  

Symbols in Mail Titles  

Marks Probability of being Spam Mail Feature Showing Rate ~ ^ | * % [] ! ? = 0.911 28% in spam \ / ; & 0.182 16% in ham  
Title Absentness Spam senders add titles now. Arabic Numeral : Almost equal probability (Date, ID) Non-alphanumeric Character & Punctuation Marks:  

Appear more often in Spam 

Appear more often in ham

 

Mail-body 

Build the internal structure of words Use a good NLP tool called Treetagger to help us do word stemming Given the stemmed words appeared in each mail, we build a sparse format vector to represent the 00emantic00of a mail  

Outline 

Experiment setting Feature extraction Model training Hybrid-Model Conclusion Reference  

Na茂ve Bayes 

Given a bag of words (x1, x2, x3,00xn), Na茂ve Bayes is powerful  for document classification.

 

Vector Space Model 

Create a word-document (mail) matrix by SRILM. 
 

For every mail (column) pair, a similarity value can be calculated.

 

KNN (Vector Space Model) 

As K = 1, the KNN classification model show the best accuracy.

 

Maximum Entropy 

Maximize the entropy and minimize the Kullback-Leiber distance between model and the real distribution.  
The elements in word-document matrix are modified to the binary value {0, 1}.  

SVM 

Binary :

     Select binary value {0,1} to represent that this word appears or not

Normalized :

     Count the occurrence of each word and divide them by their maximum occurrence counts. 

 

Outline 

Experiment setting Feature extraction Model training Hybrid-Model Conclusion Reference  

Single-layered-perceptron Hybrid Model  

The accuracy of NN-based Hybrid Model is always the highest.

 

Committee-based  Hybrid-model  
 

The voting model averages the classification result, promoting the ability of the filter slightly.   However, sometimes voting might reduce the accuracy because of misjudgments of majority.

Knn + na茂ve Bayes + Maximum Entropy na茂ve Bayes + Maximum Entropy + SVM  

Outline 

Experiment setting Feature extraction Model training Hybrid-Model Conclusion Reference  

Conclusion 

7 features are shown mail type discrimination. Transmitted Time & Receiver Size Attachment, Image, and URL Non-alphanumeric Character & Punctuation Marks 5 populous Machine Learning are proved suitable for spam filter Na茂ve Bayes, KNN, SVM 2 Model combination ways are tested. Committee-based & Single Neural Network  

Reference 

[1]. M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz, "A Bayesian Approach to Filtering Junk E-Mail," in Proc. AAAI 1998, Jul. 1998. [2] A plan for spam:

   download Data Mining & MacHine learning Final Project

Responses to Data Mining & MacHine learning Final Project

It's no comment...

 

Your Name:
Your Email:
Your Talk: