Machine Learning in Fraud Analysis

title_pic

The main purpose of this project is using machine learning algorithm to detect fraudsters in Enron scandal. Those persons are main criminals in Enron scandal. We would like to make a learning model to predict those people by using multiple variables which are reported in financial sheets and emailling list. Because we have already known person of interests (POI), we could use supervised machine learning to do this task.

#!/usr/bin/python

import sys
import pickle
sys.path.append("../tools/")
import tester

import tester
from feature_format import featureFormat, targetFeatureSplit
from tester import dump_classifier_and_data
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from time import time
from sklearn.cross_validation import train_test_split
from sklearn import linear_model
import pandas as pd
import numpy as np
from sklearn.metrics import classification_report
from tabulate import tabulate
import seaborn as sns

from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import StratifiedShuffleSplit

from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier  #GBM algorithm
from sklearn.linear_model import LogisticRegression
from xgboost.sklearn import XGBClassifier
import xgboost as xgb

%matplotlib inline 

1. Dataset Exploration

### Load the dictionary containing the dataset
with open("final_project_dataset.pkl", "r") as data_file:
    data_dict = pickle.load(data_file)
### features_list is a list of strings, each of which is a feature name.
### The first feature must be "poi".
target_list = ['poi']
report_list = ['salary','deferral_payments', 'total_payments',
               'loan_advances', 'bonus', 'restricted_stock_deferred', 
               'deferred_income', 'total_stock_value', 'expenses',
               'exercised_stock_options', 'other', 'long_term_incentive', 
               'restricted_stock', 'director_fees','to_messages',
               'from_poi_to_this_person', 'from_messages', 'from_this_person_to_poi', 
               'shared_receipt_with_poi']
features_list =  target_list + report_list
### Store to my_dataset for easy export below.
my_dataset = data_dict
counter = 0
for entry in my_dataset:
    if my_dataset[entry]['poi']:
        counter += 1
print 'Number of data points in this dataset: %f' %len(my_dataset)
print 'Number of POI in this dataset: %f' %counter
print 'Number of feature in this dataset: %f' %len(features_list)
Number of data points in this dataset: 146.000000
Number of POI in this dataset: 18.000000
Number of feature in this dataset: 20.000000

There are total 18 POIs in total 146 data points. Next step will be the data auditing to determine missing values or outliers.

Data audit

### Determine missing values in Enron report
value_report_list = [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
dict_report_list = dict(zip(report_list,value_report_list))
for entry in my_dataset:
    for name in report_list:
        val = my_dataset[entry][name]
        if val == 'NaN':
            dict_report_list[name] += 1
NaN_table = pd.DataFrame(dict_report_list.items(), columns=['Feature', 'No of Missing Values'])
print NaN_table.sort(columns = 'No of Missing Values',ascending=False)
                      Feature  No of Missing Values
11              loan_advances                   142
6               director_fees                   129
7   restricted_stock_deferred                   128
2           deferral_payments                   107
15            deferred_income                    97
18        long_term_incentive                    80
5                       bonus                    64
16    shared_receipt_with_poi                    60
14    from_this_person_to_poi                    60
1                 to_messages                    60
10    from_poi_to_this_person                    60
12              from_messages                    60
13                      other                    53
0                      salary                    51
9                    expenses                    51
4     exercised_stock_options                    44
17           restricted_stock                    36
3              total_payments                    21
8           total_stock_value                    20


C:\Anaconda2\lib\site-packages\ipykernel\__main__.py:10: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)

Through above table, every features except ‘poi’ has missing values at least 20 times. Top 3 features that have highest missing values are ‘loan_advance’, ‘director_fees’ and ‘restricted_stock_deferred’.

According to Enron printed sheet, I found two outlier that is “TOTAL” and “THE TRAVEL AGENCY IN THE PARK”.

  • “TOTAL” is the summary of every column in Enron sheet and it is a spreadsheet quirk.
  • “THE TRAVEL AGENCY IN THE PARK” is the account of business-related travel to The Travel Agency in the Park. It is not a person of interest.

Another possibility is the observation with no information (all “NaN” value) which will determined by the script beblow.

outlier_list = []
#### Next script is used to remove observation which have all value 'NaN' in Enron report. 
for entry in my_dataset:
    for name in report_list:
        val = my_dataset[entry][name]
        if val == 'NaN':
            if entry not in outlier_list:
                outlier_list.append(entry)
        elif val != 'NaN':
            try:
                outlier_list.remove(entry)
                break
            except ValueError:
                break

outlier_list.append('TOTAL') #insert 'TOTAL' observation
outlier_list.append('THE TRAVEL AGENCY IN THE PARK')
print 'The outlier list is: '
print outlier_list
The outlier list is: 
['LOCKHART EUGENE E', 'TOTAL', 'THE TRAVEL AGENCY IN THE PARK']

Remove outlier

#### Remove outlier
for name in outlier_list:
    my_dataset.pop(name,0)
print 'Number of data points after outlier removing in this dataset: %f' %len(my_dataset)
Number of data points after outlier removing in this dataset: 143.000000

2. Feature Selection

Create new feature

Through the Enron report, we have the email list of Enron POI and non-POI. It will be smart to create a new feature based on the ratio between to/from POI and to/from a person. There is a chance that they can be used in POI prediction later.

#### Making new feature based on other features.
def computeFraction( poi_messages, all_messages ):
    """ given a number messages to/from POI (numerator) 
        and number of all messages to/from a person (denominator),
        return the fraction of messages to/from that person
        that are from/to a POI
   """
   
    ### you fill in this code, so that it returns either
    ###     the fraction of all messages to this person that come from POIs
    ###     or
    ###     the fraction of all messages from this person that are sent to POIs
    ### the same code can be used to compute either quantity

    ### beware of "NaN" when there is no known email address (and so
    ### no filled email features), and integer division!
    ### in case of poi_messages or all_messages having "NaN" value, return 0.
    if poi_messages == "NaN":
        fraction = 0.
    elif all_messages == "NaN":
        fraction = 0.
    else:
        fraction = poi_messages*1.0/all_messages

    return fraction

for name in my_dataset:
    data_point = my_dataset[name]
    from_poi_to_this_person = data_point["from_poi_to_this_person"]
    to_messages = data_point["to_messages"]
    fraction_from_poi = computeFraction( from_poi_to_this_person, to_messages )
    my_dataset[name]["fraction_from_poi"] = fraction_from_poi

    from_this_person_to_poi = data_point["from_this_person_to_poi"]
    from_messages = data_point["from_messages"]
    fraction_to_poi = computeFraction( from_this_person_to_poi, from_messages )
    my_dataset[name]["fraction_to_poi"] = fraction_to_poi

### Feature list with addtional features
features_list = features_list + ['fraction_from_poi','fraction_to_poi']
print features_list
['poi', 'salary', 'deferral_payments', 'total_payments', 'loan_advances', 'bonus', 'restricted_stock_deferred', 'deferred_income', 'total_stock_value', 'expenses', 'exercised_stock_options', 'other', 'long_term_incentive', 'restricted_stock', 'director_fees', 'to_messages', 'from_poi_to_this_person', 'from_messages', 'from_this_person_to_poi', 'shared_receipt_with_poi', 'fraction_from_poi', 'fraction_to_poi']

Feature selection

In order to determine the best features for prediction, I have used ‘SelectKBest’ function to evaluate the importance of each features. The function was applied in training set created by ‘train_test_split’ function.

data = featureFormat(my_dataset, features_list, sort_keys = True)
labels, features = targetFeatureSplit(data)
### Using SelectKBest to determine number of selection feature.
SKB = SelectKBest()
SKB.fit(features,labels)
scores = SKB.scores_
unsorted_pairs = zip(features_list[1:], scores)
sorted_pairs = list(reversed(sorted(unsorted_pairs, key=lambda x: x[1])))
KBest_table = pd.DataFrame(sorted_pairs,columns=['Feature','Score'])
print KBest_table
                      Feature      Score
0     exercised_stock_options  24.815080
1           total_stock_value  24.182899
2                       bonus  20.792252
3                      salary  18.289684
4             fraction_to_poi  16.409713
5             deferred_income  11.458477
6         long_term_incentive   9.922186
7            restricted_stock   9.212811
8              total_payments   8.772778
9     shared_receipt_with_poi   8.589421
10              loan_advances   7.184056
11                   expenses   6.094173
12    from_poi_to_this_person   5.243450
13                      other   4.187478
14          fraction_from_poi   3.128092
15    from_this_person_to_poi   2.382612
16              director_fees   2.126328
17                to_messages   1.646341
18          deferral_payments   0.224611
19              from_messages   0.169701
20  restricted_stock_deferred   0.065500
sns.barplot(KBest_table['Feature'],KBest_table['Score'])
_, labels = plt.xticks()
plt.setp(labels, rotation=90)
plt.ylabel('SKBest score')
plt.title('Classifier')
plt.show()

_config.yml

According to the result of SelectKBest, I selected 6 features of interest since they have highest impact for the model fitting (score > 10). Clearly, we see that our new feature “fraction_to_poi” has high impact on the data model when it is at top 5th. Would this enginereed variable affect the validation of POIs in classifer ? To answer this question, I made an classifer testing with two feature lists: original list (without ‘fraction_to_poi’) and engineered list (with ‘fraction_to_poi’).

### Validate the original feature list in Naive Bayes classifier
old_features_list = target_list + list(KBest_table['Feature'].iloc[:6])
old_features_list.remove('fraction_to_poi')
print old_features_list
# Re-create the data for model fitting
data = featureFormat(my_dataset, old_features_list, sort_keys = True)
labels, features = targetFeatureSplit(data)
# Example starting point. Create train and test set from original dataset. 30% of data goes into test set.
features_train, features_test, labels_train, labels_test = \
    train_test_split(features, labels, test_size=0.3, random_state=42)
    
clf_NB = GaussianNB()
clf_NB.fit(features_train,labels_train)
labels_pred = clf_NB.predict(features_test)
print classification_report(labels_test, labels_pred)
['poi', 'exercised_stock_options', 'total_stock_value', 'bonus', 'salary', 'deferred_income']
             precision    recall  f1-score   support

        0.0       0.95      0.95      0.95        39
        1.0       0.33      0.33      0.33         3

avg / total       0.90      0.90      0.90        42
### Validate the engineered feature list in Naive Bayes classifier
new_features_list = target_list + list(KBest_table['Feature'].iloc[:6])
print new_features_list
# Re-create the data for model fitting
data = featureFormat(my_dataset, new_features_list, sort_keys = True)
labels, features = targetFeatureSplit(data)
# Example starting point. Create train and test set from original dataset. 30% of data goes into test set.
features_train, features_test, labels_train, labels_test = \
    train_test_split(features, labels, test_size=0.3, random_state=42)
    
clf_NB = GaussianNB()
clf_NB.fit(features_train,labels_train)
labels_pred = clf_NB.predict(features_test)
print classification_report(labels_test, labels_pred)
['poi', 'exercised_stock_options', 'total_stock_value', 'bonus', 'salary', 'fraction_to_poi', 'deferred_income']
             precision    recall  f1-score   support

        0.0       0.92      0.94      0.93        36
        1.0       0.60      0.50      0.55         6

avg / total       0.87      0.88      0.88        42

The precision and recall scores of my engineered list is higher than original list in POI detection (class 1). So I suggested to include ‘fraction_to_poi’ in feature set used for further investigation.

print tabulate([['Original List', 0.33, 0.33], ['Engineered list', 0.60, 0.50]], headers=['Name', 'Precision','Recall'])
Name               Precision    Recall
---------------  -----------  --------
Original List           0.33      0.33
Engineered list         0.6       0.5
data = featureFormat(my_dataset, new_features_list, sort_keys = True)
labels, features = targetFeatureSplit(data)
# Example starting point. Create train and test set from original dataset. 30% of data goes into test set.
features_train, features_test, labels_train, labels_test = \
    train_test_split(features, labels, test_size=0.3, random_state=42)
#### Reshape the training and test sets.
features_train = np.array(features_train)
labels_train = np.array(labels_train)
features_test = np.array(features_test)
labels_test = np.array(labels_test)

3. Evaluation of classifiers

There are many classifiers for classification. For the purpose of this porject and time saving, I only picked 6 common models for data fitting. I created pipeline that includes Scaling, Feature Selection, Principle Component Analysis (PCA) and Classifiers. I had to use feature scaling because some model need that processing.

Through running defaut model with 6 features of interest, their ‘f1’ scores were compared after running ‘tester.py’ . The reason for chosing ‘f1’ because it is the weighted average of Precision and Recall.

### Try a varity of classifiers
### Please name your classifier clf for easy export below.
### Note that if you want to do PCA or other multi-stage operations,
### you'll need to use Pipelines. For more info:
### http://scikit-learn.org/stable/modules/pipeline.html

#### Checking the final result with all ML that are used for classification:
#### Using GridSearchCV to optimize number of features + SelectKBest
#### Create cross-validation  
sss = StratifiedShuffleSplit(labels_train, 50, test_size = 0.3, random_state = 42)
mm_scaler = MinMaxScaler()

name_list = ['Naive Bayes',
             'Support Vector Machine',
             'Decision Tree',
             'KNN',
             'RandomForest','AdaBoost'
            ]
clf_list = [GaussianNB(),
            SVC(),
            DecisionTreeClassifier(),
            KNeighborsClassifier(),
            RandomForestClassifier(),
            AdaBoostClassifier()
           ]

for name, clf in zip(name_list,clf_list):
    t0 = time() 
    pipe = Pipeline(steps =[('scaler',mm_scaler),('SKB',SelectKBest(k = 6)),('CLF',clf)])
    pipe.fit(features_train,labels_train)
    labels_pred = pipe.predict(features_test)
    clf_final = pipe
    #clf_final = pipe
    print '\t'
    print name
    print 'Fit to training dataset'
    print classification_report(labels_test, labels_pred)    
    print '\t'
    print 'Test on tester.py'
    f1 = tester.test_classifier(clf_final, my_dataset, new_features_list)
    print f1
    print "done in %0.3fs" % (time() - t0)
Naive Bayes
Fit to training dataset
             precision    recall  f1-score   support

        0.0       0.92      0.92      0.92        36
        1.0       0.50      0.50      0.50         6

avg / total       0.86      0.86      0.86        42

	
Test on tester.py
Pipeline(steps=[('scaler', MinMaxScaler(copy=True, feature_range=(0, 1))), ('SKB', SelectKBest(k=6, score_func=<function f_classif at 0x0000000007B63F98>)), ('CLF', GaussianNB())])
	Accuracy: 0.85236	Precision: 0.47723	Recall: 0.35100	F1: 0.40449	F2: 0.37061
	Total predictions: 14000	True positives:  702	False positives:  769	False negatives: 1298	True negatives: 11231

None
done in 2.235s
	
Support Vector Machine
Fit to training dataset
             precision    recall  f1-score   support

        0.0       0.86      1.00      0.92        36
        1.0       0.00      0.00      0.00         6

avg / total       0.73      0.86      0.79        42

	
Test on tester.py


C:\Anaconda2\lib\site-packages\sklearn\metrics\classification.py:1074: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)


Got a divide by zero when trying out: Pipeline(steps=[('scaler', MinMaxScaler(copy=True, feature_range=(0, 1))), ('SKB', SelectKBest(k=6, score_func=<function f_classif at 0x0000000007B63F98>)), ('CLF', SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])
Precision or recall may be undefined due to a lack of true positive predicitons.
None
done in 2.101s
	
Decision Tree
Fit to training dataset
             precision    recall  f1-score   support

        0.0       0.91      0.89      0.90        36
        1.0       0.43      0.50      0.46         6

avg / total       0.84      0.83      0.84        42

	
Test on tester.py
Pipeline(steps=[('scaler', MinMaxScaler(copy=True, feature_range=(0, 1))), ('SKB', SelectKBest(k=6, score_func=<function f_classif at 0x0000000007B63F98>)), ('CLF', DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best'))])
	Accuracy: 0.79071	Precision: 0.26912	Recall: 0.27100	F1: 0.27005	F2: 0.27062
	Total predictions: 14000	True positives:  542	False positives: 1472	False negatives: 1458	True negatives: 10528

None
done in 2.007s
	
KNN
Fit to training dataset
             precision    recall  f1-score   support

        0.0       0.85      0.94      0.89        36
        1.0       0.00      0.00      0.00         6

avg / total       0.73      0.81      0.77        42

	
Test on tester.py
Pipeline(steps=[('scaler', MinMaxScaler(copy=True, feature_range=(0, 1))), ('SKB', SelectKBest(k=6, score_func=<function f_classif at 0x0000000007B63F98>)), ('CLF', KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'))])
	Accuracy: 0.83514	Precision: 0.23448	Recall: 0.06800	F1: 0.10543	F2: 0.07925
	Total predictions: 14000	True positives:  136	False positives:  444	False negatives: 1864	True negatives: 11556

None
done in 2.417s
	
RandomForest
Fit to training dataset
             precision    recall  f1-score   support

        0.0       0.89      0.92      0.90        36
        1.0       0.40      0.33      0.36         6

avg / total       0.82      0.83      0.83        42

	
Test on tester.py
Pipeline(steps=[('scaler', MinMaxScaler(copy=True, feature_range=(0, 1))), ('SKB', SelectKBest(k=6, score_func=<function f_classif at 0x0000000007B63F98>)), ('CLF', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None...n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))])
	Accuracy: 0.85629	Precision: 0.49239	Recall: 0.19400	F1: 0.27834	F2: 0.22076
	Total predictions: 14000	True positives:  388	False positives:  400	False negatives: 1612	True negatives: 11600

None
done in 27.533s
	
AdaBoost
Fit to training dataset
             precision    recall  f1-score   support

        0.0       0.92      0.92      0.92        36
        1.0       0.50      0.50      0.50         6

avg / total       0.86      0.86      0.86        42

	
Test on tester.py
Pipeline(steps=[('scaler', MinMaxScaler(copy=True, feature_range=(0, 1))), ('SKB', SelectKBest(k=6, score_func=<function f_classif at 0x0000000007B63F98>)), ('CLF', AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=None))])
	Accuracy: 0.82414	Precision: 0.35154	Recall: 0.27350	F1: 0.30765	F2: 0.28621
	Total predictions: 14000	True positives:  547	False positives: 1009	False negatives: 1453	True negatives: 10991

None
done in 118.370s

Visualization of F1 scores

### Create array for f1
f1_score = [0.40449,0.0,0.27005,0.10543,0.27834,0.30765]

name_list = ('Naive Bayes','Support Vector Machine',
             'Decision Tree','K-NN',
             'RandomForest','AdaBoost')

plt.bar(np.arange(len(name_list)),f1_score,align='center', color = 'rgbkymc',alpha = 0.5)
plt.xticks(np.arange(len(name_list)), name_list, rotation = 90)
plt.ylabel('F1 score')
plt.title('Classifier')
plt.show()

_config.yml

I found that Naive Bayes model had the highest F1-score. I decided to select Naive Bayes and Decision Tree for my model optimization. The final model is the one which highest F1 after parameter tuning.

4. Model optimization

The objective of this task is trying to optimize model fitting in order to obtain at least 0.3 in both precision and recall using testing script.

The strategy for tuning is using GridSearchCV to optimize parameters from each models and apply cross-validation to determine the best F1 score.

4.1 Naive Bayes

My strategy is to optimize:

  • The number of features: SelectKBest(k = 5,6)
  • The number of components: PCA(n_components = 2,3,4)
### Create cross-validation
sss = StratifiedShuffleSplit(labels, 50, random_state = 42)
t0 = time()
# PCA
pca = PCA()
#Pipeline
pipeline = Pipeline([('scale',mm_scaler),('SKB',SelectKBest()),('PCA',PCA()),('NB', GaussianNB())])
# clf's parameters
parameters = {'SKB__k':[5,6],
              'PCA__n_components':[2,3,4]
             }
#GridSearchCV
gs = GridSearchCV(pipeline, parameters, cv=sss, scoring='f1')
gs.fit(features, labels)

clf_NB = gs.best_estimator_
tester.test_classifier(clf_NB, my_dataset, new_features_list)
print "done in %0.3fs" % (time() - t0)
C:\Anaconda2\lib\site-packages\sklearn\metrics\classification.py:1074: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no predicted samples.
  'precision', 'predicted', average, warn_for)


Pipeline(steps=[('scale', MinMaxScaler(copy=True, feature_range=(0, 1))), ('SKB', SelectKBest(k=6, score_func=<function f_classif at 0x0000000007B63F98>)), ('PCA', PCA(copy=True, n_components=4, whiten=False)), ('NB', GaussianNB())])
	Accuracy: 0.85564	Precision: 0.49256	Recall: 0.34750	F1: 0.40751	F2: 0.36925
	Total predictions: 14000	True positives:  695	False positives:  716	False negatives: 1305	True negatives: 11284

done in 5.306s

4.2 Decision Tree

In this classifier, there are many parameters that I could tune. Because of heavy computation, I selected ony 5 parameters: criterion, splitter, min_samples_split, max_depth, max_leaf_nodes. Those parameters will be put in the cross-validation with feature selection and component analysis.

t0 = time()
# PCA
pca = PCA()
#Pipeline
pipeline = Pipeline([('scale',mm_scaler),('SKB',SelectKBest()),('PCA',PCA()),('DT', DecisionTreeClassifier())])
# clf's parameters
parameters = {'DT__criterion': ['gini','entropy'],
              'DT__splitter':['best','random'],
              'DT__min_samples_split':[2,10,20],
              'DT__max_depth':[10,15,20,25,30],
              'DT__max_leaf_nodes':[5,10,30],
              'SKB__k':[5,6],
              'PCA__n_components':[2,3,4]}
#GridSearchCV
gs = GridSearchCV(pipeline, parameters, cv=sss, scoring='f1')
gs.fit(features, labels)

clf_DT = gs.best_estimator_
tester.test_classifier(clf_DT, my_dataset, new_features_list)
print "done in %0.3fs" % (time() - t0)
Pipeline(steps=[('scale', MinMaxScaler(copy=True, feature_range=(0, 1))), ('SKB', SelectKBest(k=5, score_func=<function f_classif at 0x0000000007B63F98>)), ('PCA', PCA(copy=True, n_components=2, whiten=False)), ('DT', DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=25,
            max_features=None, max_leaf_nodes=30, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best'))])
	Accuracy: 0.82007	Precision: 0.35639	Recall: 0.32200	F1: 0.33832	F2: 0.32834
	Total predictions: 14000	True positives:  644	False positives: 1163	False negatives: 1356	True negatives: 10837

done in 669.987s

4.3 Discussion

In this section, I tried to to make an machine system that can built the best model from data. In machine system, chosen learning algorithm goes with multiple parameters. All those parameters will effect the quality of data models. However, it is not easy to find the correct values of those parameters. Thus, tuning is one of the most important in machine learning. This step will help us to obtain optimal values to complete learning task in the best way as possible. The “best” of those values is dependent on our criteria. For instance, I mainly tuned parameters of Naive Bayes and Decision Tree models to obtain at least 0.3 in both precision and recall in this project.

In two final results, Naive Bayes provide better score of precision and recall. It is logical since the Enron dataset is pretty small so simpler model will be easily fit. In conclusion, Naive Bayes model is my final model for this Enron fraud classification.

clf = clf_NB
dump_classifier_and_data(clf, my_dataset, new_features_list)

5. Validation and Evaluation

The main purpose of this project is to identify the best learning model for Enron dataset. From previous analysis, there are more than one approach to build that model. In learning approach, one of process is parameter tuning. The “best” choice of parameters requires validation. Validation estimates how well my model has been trained. Validation includes splitting Enron data into multiple train/test sets. This step is essential because it limits problem such as data overfitting or underfitting. In this project, I have used “train_test_split” and “StratifiedShuffleSplit” for re-sample Enron data. “TrainTestSplit” created small dataset to choose the appropriate classifiers. “StratifiedShuffleSplit” was used for cross-validation of optimal parameters.

Instead of using simpler cross-validation methods, I picked “StratifiedShuffleSplit” for my final learning algorithm. Enron dataset is quite small so “StratifiedShuffleSplit” allows to create randomly multiple training and test sets then averages the results over all the tests. Moreover, Enrone dataset has unbalanced classes where number of POIs is greatly smaller than number of non-POIs. “StratifiedShuffleSplit” makes sure the ratio of POI/non-POIs is the same in the training and test sets as it was in the larger dataset.

I used precision and recall to estimate my model properties. The precision measures a classifiers exactness and the recall measures a classifiers completeness. In small and unbalanced like Enron dataset, both of them are more important than accuracy. If validation is not good, my model will fall in high level of error type I and II. We always want to have high chance to catch criminal but ignore innocent person.

The good estimators for my Enron classification model is the one which has the score of precision and recall above 0.3.

6. Conclusion

This Enron dataset is an unbalaned data where POIs is greatly samller than non-POIs. There are 3 outliers and many of missing values in each features of data point. I have removed all of outliers and replaced missing values by ‘0’ during data auditing.

On the other hands, I optimized Feature Selection by creating two new features: “fraction_to_poi” and “fraction_from_poi”. According to the impact scores, I have selected 6 features to build my learning model. One of them was my new engineered feature. Without it, I have lower score in term of precision and recall.

For learning model, I validated a series of common classifers. Based on my F1-score, I picked two classifiers: Naive Bayes and Decisition Tree. My final model is Naive Bayes model in which I receive the better score of precision (0.49) and recall (0.34) after parameter tuning.

Written on February 14, 2017