Intro

Le dataset sur lequel j'ai travaillé provient du UCI Machine Learning Repository.
Source : https://archive.ics.uci.edu/ml/datasets/census+income.
Les données du set proviennent du recensement de 1994 aux Etats-unis. L’objectif sera ici de prédire si le revenu d’un individu est de plus 50K$ ou pas (grâce à la variable « Income »).
Comme il n'y a que 2 résultats possible, la méthode de ML sera celle de classification : régression logistic, SVM, forest tree etc…
On pourra tester plusieurs de ces modèles pour déterminer le plus efficace.

In [2]:
# Importation des librairies:

import numpy as np
import pandas as pd

%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt

Analyse et nettoyage du dataset

La première étape est d’analyser les données et de les nettoyer

In [3]:
adult_data = pd.DataFrame(pd.read_csv('dataset/adult.data', header=None))
adult_data.columns=['age','workclass','fnlwgt','education','education-num','marital-status','occupation','relationship','race','sex','capital-gain','capital-loss','hours-per-week','native-country','income']
In [4]:
adult_data.head()
Out[4]:
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country income
0 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States <=50K
1 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States <=50K
2 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K
3 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States <=50K
4 28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba <=50K
In [5]:
adult_data.shape
Out[5]:
(32561, 15)
In [6]:
adult_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
age               32561 non-null int64
workclass         32561 non-null object
fnlwgt            32561 non-null int64
education         32561 non-null object
education-num     32561 non-null int64
marital-status    32561 non-null object
occupation        32561 non-null object
relationship      32561 non-null object
race              32561 non-null object
sex               32561 non-null object
capital-gain      32561 non-null int64
capital-loss      32561 non-null int64
hours-per-week    32561 non-null int64
native-country    32561 non-null object
income            32561 non-null object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB
In [7]:
#select les ligne <50k puis occupation avec unique idem >50k puis 
    #adult_data[adult_data['income']==' <=50K'].occupation.unique()
    #adult_data[adult_data['income']==' >50K'].occupation.unique()

La dataset contient 32561 lignes et 15 colonnes qui pour la suite pourraient se diviser ainsi :

  • 6 variables sont numériques, les 9 autres sont catégorielles.
  • 14 sont les features
  • 1 la target.

FEATURES

  • 'age',
  • 'workclass': Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
  • 'fnlwgt': continuous.
  • 'education': Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
  • 'education-num': continuous.
  • 'marital-status': Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
  • 'occupation': Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
  • 'relationship': Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
  • 'race': White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
  • 'sex': Female, Male.
  • 'capital-gain': continuous.
  • 'capital-loss': continuous.
  • 'hours-per-week': continuous.
  • 'native-country': United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

TARGET

  • 'income' : >50K, <=50K

Après recherche, la variable "fnlwgt" représente le poids final, qui correspond au « nombre d'unités de la population cible représentées par l'unité répondante ». </br>

La variable "education_num" correspond au nombre total d'années d'études, elle est la représentation numérique de la variable éducation. Je peux tout de même faire un table de corrélation pour confirmer cette info.</br>

La variable "relationship" représente de la personne interrogée au sein de la famille. </br> "capital_gain" et "capital_loss" sont des revenus autres que le salaire.</br>

Pour la suite du travail, on pourra supprimer la variable « education » puisque nous avons sa représentation numérique. Ce qui nous évitera pour cette variable de passer par l’étape get_dummies.

Correlation des valeurs numériques

Pour supprimer d'éventuelles colonnes qui auraient un effet sur le modèle.

In [8]:
adult_data_num = adult_data.select_dtypes(include=['int64'])
adult_data_num_corr = adult_data_num.corr()

# Set up the matplotlib figure
fig, ax = plt.subplots(figsize=(9, 7))

# Generate a custom diverging colormap #
cmap="YlGnBu"

# Draw the heatmap with the mask and correct aspect ratio (mask=mask)
ax = sns.heatmap(adult_data_num_corr, cmap=cmap, annot=True, annot_kws={"size": 10},square=True, linewidths=.5)

Les variables les plus corrélées sont celles qui dans le graph ont un grand nombre de cases à la couleur très sombre. Ici, les variables numériques ne sont pas corrélées entre elles.

La variable "native-country"

In [9]:
print('Nb de pays:',adult_data['native-country'].value_counts().count()) ,adult_data['native-country'].value_counts()
Nb de pays: 42
Out[9]:
(None,  United-States                 29170
  Mexico                          643
  ?                               583
  Philippines                     198
  Germany                         137
  Canada                          121
  Puerto-Rico                     114
  El-Salvador                     106
  India                           100
  Cuba                             95
  England                          90
  Jamaica                          81
  South                            80
  China                            75
  Italy                            73
  Dominican-Republic               70
  Vietnam                          67
  Guatemala                        64
  Japan                            62
  Poland                           60
  Columbia                         59
  Taiwan                           51
  Haiti                            44
  Iran                             43
  Portugal                         37
  Nicaragua                        34
  Peru                             31
  Greece                           29
  France                           29
  Ecuador                          28
  Ireland                          24
  Hong                             20
  Trinadad&Tobago                  19
  Cambodia                         19
  Laos                             18
  Thailand                         18
  Yugoslavia                       16
  Outlying-US(Guam-USVI-etc)       14
  Honduras                         13
  Hungary                          13
  Scotland                         12
  Holand-Netherlands                1
 Name: native-country, dtype: int64)
In [10]:
adult_data['native-country'].value_counts()/len(adult_data)*100
Out[10]:
 United-States                 89.585701
 Mexico                         1.974755
 ?                              1.790486
 Philippines                    0.608089
 Germany                        0.420749
 Canada                         0.371610
 Puerto-Rico                    0.350112
 El-Salvador                    0.325543
 India                          0.307116
 Cuba                           0.291760
 England                        0.276404
 Jamaica                        0.248764
 South                          0.245693
 China                          0.230337
 Italy                          0.224195
 Dominican-Republic             0.214981
 Vietnam                        0.205768
 Guatemala                      0.196554
 Japan                          0.190412
 Poland                         0.184270
 Columbia                       0.181198
 Taiwan                         0.156629
 Haiti                          0.135131
 Iran                           0.132060
 Portugal                       0.113633
 Nicaragua                      0.104419
 Peru                           0.095206
 Greece                         0.089064
 France                         0.089064
 Ecuador                        0.085992
 Ireland                        0.073708
 Hong                           0.061423
 Trinadad&Tobago                0.058352
 Cambodia                       0.058352
 Laos                           0.055281
 Thailand                       0.055281
 Yugoslavia                     0.049139
 Outlying-US(Guam-USVI-etc)     0.042996
 Honduras                       0.039925
 Hungary                        0.039925
 Scotland                       0.036854
 Holand-Netherlands             0.003071
Name: native-country, dtype: float64

Pour ce qui est de la variable « Country », on compte 42 entrées : 40 pays, 1 'South'et 1 de type NC (« ? »).
On peut pour cette colonne par exemple créer une nouvelle variable qui regrouperait les pays par continent dès lors que ce n’est pas les USA qui représentent 89,5% des répondants.
Pour les lignes NC, je vais les supprimer. Ca fera 583 lignes en moins sur 32561. On travaillera donc avec 31898 lignes.

In [11]:
adult_data['native-country'].unique()
Out[11]:
array([' United-States', ' Cuba', ' Jamaica', ' India', ' ?', ' Mexico',
       ' South', ' Puerto-Rico', ' Honduras', ' England', ' Canada',
       ' Germany', ' Iran', ' Philippines', ' Italy', ' Poland',
       ' Columbia', ' Cambodia', ' Thailand', ' Ecuador', ' Laos',
       ' Taiwan', ' Haiti', ' Portugal', ' Dominican-Republic',
       ' El-Salvador', ' France', ' Guatemala', ' China', ' Japan',
       ' Yugoslavia', ' Peru', ' Outlying-US(Guam-USVI-etc)', ' Scotland',
       ' Trinadad&Tobago', ' Greece', ' Nicaragua', ' Vietnam', ' Hong',
       ' Ireland', ' Hungary', ' Holand-Netherlands'], dtype=object)
In [12]:
America=[' Cuba', ' Jamaica', ' Mexico',' Puerto-Rico', ' Honduras',' Canada',' Columbia',' Ecuador',' Haiti', ' Dominican-Republic',' Guatemala', ' Peru', ' Outlying-US(Guam-USVI-etc)', ' Trinadad&Tobago', ' Nicaragua',' El-Salvador']
Asia = [' India',' Iran', ' Philippines',' Cambodia', ' Thailand',' Laos',' Taiwan', ' China', ' Japan',' Vietnam', ' Hong']
Europe =[' England', ' Germany',' Italy', ' Poland',' Portugal',' France', ' Yugoslavia',' Scotland', ' Greece', ' Ireland', ' Hungary', ' Holand-Netherlands']
adult_data = adult_data.assign(group_country=adult_data['native-country'].map(lambda x: 'America' if x in America else ('Asia' if x in Asia else ('Europe' if x in Europe else x))))
adult_data['group_country'].value_counts()
Out[12]:
 United-States    29170
America            1536
Asia                671
 ?                  583
Europe              521
 South               80
Name: group_country, dtype: int64

Je vais peux ensuite supprimer la colonne 'native-country' et les lignes contenant "?" et "South" dont on ne sait pas à quoi cela se réfère.

In [13]:
adult_data.drop(adult_data[adult_data['group_country'] ==' ?'].index, inplace = True )
adult_data.drop(adult_data[adult_data['group_country'] ==' South'].index, inplace = True )
In [14]:
adult_data['group_country'].value_counts()
Out[14]:
 United-States    29170
America            1536
Asia                671
Europe              521
Name: group_country, dtype: int64
In [15]:
adult_data.drop(columns = 'native-country',inplace = True)
In [16]:
adult_data.head()
Out[16]:
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week income group_country
0 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 <=50K United-States
1 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 <=50K United-States
2 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 <=50K United-States
3 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 <=50K United-States
4 28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 <=50K America
In [17]:
adult_data.shape
Out[17]:
(31898, 15)

La variable : Income

Je regarde la distribution de la Target.

In [18]:
print(adult_data.groupby('income').size())
income
 <=50K    24219
 >50K      7679
dtype: int64

Je passe cette colonne en ordinal pour plus faciliter les calculs.

In [19]:
adult_data['income_ordinal'] = adult_data['income'].map(lambda x: 1 if x==' >50K' else 0)
adult_data['income_ordinal'].value_counts()
Out[19]:
0    24219
1     7679
Name: income_ordinal, dtype: int64
In [20]:
sns.countplot(x='income',data=adult_data)

plt.title('Distribution Income')
plt.xlabel('Income')
plt.ylabel('Nb of persons')
Out[20]:
Text(0, 0.5, 'Nb of persons')
In [21]:
adult_data.head()
Out[21]:
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week income group_country income_ordinal
0 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 <=50K United-States 0
1 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 <=50K United-States 0
2 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 <=50K United-States 0
3 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 <=50K United-States 0
4 28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 <=50K America 0

Les valeurs manquantes

Pour les autres variables, je n'ai rien vu de particulier. Mis à part qu'il faudra vérifier s'il n'y a pas de valeurs manquantes.

In [22]:
adult_data.isna().sum().sum()
Out[22]:
0

Il n'y a pas de valeurs manquantes. Je peux passer à la partie ML

Machine Learning

Je sépare les variables en Features et Target

In [23]:
# y pour la Target
y= adult_data["income_ordinal"]

# X pour les Features
X=adult_data.drop(columns=['income_ordinal','income'])

Je peux maintenant créer des variables ordinales depuis les variables catégorielles pour les Features.

In [24]:
X = pd.get_dummies(X)
X.head()
Out[24]:
age fnlwgt education-num capital-gain capital-loss hours-per-week workclass_ ? workclass_ Federal-gov workclass_ Local-gov workclass_ Never-worked ... race_ Asian-Pac-Islander race_ Black race_ Other race_ White sex_ Female sex_ Male group_country_ United-States group_country_America group_country_Asia group_country_Europe
0 39 77516 13 2174 0 40 0 0 0 0 ... 0 0 0 1 0 1 1 0 0 0
1 50 83311 13 0 0 13 0 0 0 0 ... 0 0 0 1 0 1 1 0 0 0
2 38 215646 9 0 0 40 0 0 0 0 ... 0 0 0 1 0 1 1 0 0 0
3 53 234721 7 0 0 40 0 0 0 0 ... 0 1 0 0 0 1 1 0 0 0
4 28 338409 13 0 0 40 0 0 0 0 ... 0 1 0 0 1 0 0 1 0 0

5 rows × 70 columns

1. Split des données en training (80%) et testing (20%) .

Je réalise un split classique de type 80% pour le training et 20% pour le test.

In [25]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y ,train_size=0.80, test_size=0.20)

X_train.shape, X_test.shape, y_train.shape, y_test.shape
Out[25]:
((25518, 70), (6380, 70), (25518,), (6380,))

2. Sélection du modèle via la Cross-Validation

Je décide de passer par la méthode de cross-validation pour choisir mon modèle de prédiction.

La validation croisée en français est une méthode qui permet d'estimer les performances de généralisation de plusieurs modèles à partir de l’ensemble d’apprentissage. Cette estimation pourra être comparée à l’estimation obtenue sur l’ensemble de test mis de côté au départ.

La cross validation consiste à recommencer sur plusieurs découpages train/test différents du jeu de données initial de manière à s’assurer que la prédiction est stable et éviter d'overfitting.

In [26]:
from sklearn.model_selection import cross_val_score
from sklearn import metrics

print('Liste des scores existants dans le module sklearn:\n\n',metrics.SCORERS.keys())
Liste des scores existants dans le module sklearn:

 dict_keys(['explained_variance', 'r2', 'max_error', 'neg_median_absolute_error', 'neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_mean_squared_log_error', 'accuracy', 'roc_auc', 'balanced_accuracy', 'average_precision', 'neg_log_loss', 'brier_score_loss', 'adjusted_rand_score', 'homogeneity_score', 'completeness_score', 'v_measure_score', 'mutual_info_score', 'adjusted_mutual_info_score', 'normalized_mutual_info_score', 'fowlkes_mallows_score', 'precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'jaccard', 'jaccard_macro', 'jaccard_micro', 'jaccard_samples', 'jaccard_weighted'])

Le score de comparaison que je choisis est l'Accuracy : nb de bonnes predictions (TP) sur l'ensemble des prédictions.
Je vais tester les modèles suivant :LogisticRegression, DecisionTreeClassifier, KNeighborsClassifier et RandomForestClassifier.

In [27]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

print("Rf:")
accuracy = cross_val_score(RandomForestClassifier(n_estimators=100, max_depth=2,random_state=0), X_train, y_train, scoring='accuracy', cv = 10)
print(accuracy)
print("Accuracy of Random Forest is: " , accuracy.mean()*100)

print("\n\nLog:")
#print(cross_val_score(log_class, X_train, y_train, scoring='accuracy', cv = 10))
accuracy = cross_val_score(LogisticRegression(solver='liblinear'), X_train, y_train, scoring='accuracy', cv = 10)
print(accuracy)
print("Accuracy of LogisticRegression is: " , accuracy.mean()*100)

print("\n\nK Neighbors:")
#print(cross_val_score(knn_class, X_train, y_train, scoring='accuracy', cv = 10))
accuracy = cross_val_score(KNeighborsClassifier(n_neighbors=3), X_train, y_train, scoring='accuracy', cv = 10)
print(accuracy)
print("Accuracy of K Neighbors is: " , accuracy.mean()*100)

print("\n\nDecision Tree:")
#print(cross_val_score(dtc_class, X_train, y_train, scoring='accuracy', cv = 10))
accuracy = cross_val_score(DecisionTreeClassifier(),X_train, y_train, scoring='accuracy', cv = 10)
print(accuracy)
print("Accuracy of Decision Tree is: " , accuracy.mean()*100)

Rf:
[0.76811594 0.76929103 0.76920063 0.77625392 0.77468652 0.77037618
 0.76871815 0.76832615 0.76675813 0.77224618]
Accuracy of Random Forest is:  77.03972821957758


Log:
[0.80101841 0.78926753 0.78605016 0.80525078 0.79153605 0.79663009
 0.80086241 0.79459036 0.79223834 0.80517444]
Accuracy of LogisticRegression is:  79.62618565675866


K Neighbors:
[0.76184881 0.75597336 0.75352665 0.75352665 0.75744514 0.74177116
 0.76048608 0.76754214 0.74794198 0.75891807]
Accuracy of K Neighbors is:  75.58980041578805


Decision Tree:
[0.81629456 0.80846063 0.80525078 0.82131661 0.81347962 0.81543887
 0.81066249 0.81575853 0.81144649 0.80399843]
Accuracy of Decision Tree is:  81.22107018316989

Au regard des scores le modèle le plus probant est celui du Decision Tree. Je peux aussi travailler sur le Logistic Regression

1) established the parameters to gris-search. it is important  to note that these parameters change depending on what type of model we are building. Be sure to look up the model specific documentation for parameter explanation.
2) Running the grid search. this step may take a long time to run depending on the data and number of parameters.
3) Setting up the grid-serach. note how the parameter are wrapped within a dictionary
4) fitting the grid to the data
5) Depending on the designed parameter for measuring the model, will print the best score throughout the Grid Search
6) Print the best parameters used for the highest score of the model

<a>https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV</a>

<a>https://medium.com/@elutins/grid-searching-in-machine-learning-quick-explanation-and-python-implementation-550552200596</a>
In [31]:
from sklearn.model_selection import GridSearchCV

model = DecisionTreeClassifier()

criterion=['gini','entropy']
max_depth=[5,10,50,None]
splitter=['best','random']

grid = GridSearchCV(estimator=model, cv=5, param_grid=dict(criterion=criterion,max_depth=max_depth,splitter=splitter))

criterion : string, optional (default=”gini”)

The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain.

splitter : string, optional (default=”best”)

The strategy used to choose the split at each node. Supported strategies are “best” to choose the best split and “random” to choose the best random split.

max_depth : int or None, optional (default=None)

The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
In [32]:
grid.fit(X_train,y_train)
Out[32]:
GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=DecisionTreeClassifier(class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features=None,
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              presort=False, random_state=None,
                                              splitter='best'),
             iid='warn', n_jobs=None,
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': [5, 10, 50, None],
                         'splitter': ['best', 'random']},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)
In [33]:
print(grid.best_score_)
print(grid.best_params_)
0.8553961909240536
{'criterion': 'entropy', 'max_depth': 10, 'splitter': 'best'}

Grid search pour le Logistic Regression

In [34]:
from sklearn.model_selection import GridSearchCV

grid={"C":np.logspace(-3,3,7), "penalty":["l1","l2"]}
logreg=LogisticRegression()
logreg_cv=GridSearchCV(logreg,grid,cv=5)

logreg_cv.fit(X_train,y_train)

print(logreg_cv.best_score_)
print(logreg_cv.best_params_)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
0.8520652088721686
{'C': 1.0, 'penalty': 'l1'}

Cross-validation #2

In [35]:
print("\n\nLog:")
#print(cross_val_score(log_class, X_train, y_train, scoring='accuracy', cv = 10))
accuracy = cross_val_score(LogisticRegression(C=1.0, penalty ='l1', solver='liblinear'),
                           X_train, y_train, scoring='accuracy', cv = 10)
print(accuracy)
print("Accuracy of LogisticRegression is: " , accuracy.mean()*100)

print("\n\nDecision Tree:")
#print(cross_val_score(dtc_class, X_train, y_train, scoring='accuracy', cv = 10))
accuracy = cross_val_score(DecisionTreeClassifier(criterion='entropy', max_depth=10, splitter='best'),
                           X_train, y_train, scoring='accuracy', cv = 10)
print(accuracy)
print("Accuracy of Decision Tree is: " , accuracy.mean()*100)

Log:
[0.85507246 0.84567176 0.85736677 0.85344828 0.84757053 0.84992163
 0.86005488 0.85613485 0.84398275 0.85182281]
Accuracy of LogisticRegression is:  85.21046728477069


Decision Tree:
[0.86486486 0.8511555  0.85579937 0.85070533 0.86442006 0.85775862
 0.85613485 0.85064681 0.84907879 0.8494708 ]
Accuracy of Decision Tree is:  85.5003499642416

3. Travail avec le modèle de Decision Tree

Je travaille avec le modèle retenu par la cross validation : Decision tree en intégrant les paramètres retenus par la méthode du Grid-search.

Note : En tenant compte du nombre de lignes. faire un value counts sur le données originel pour voir la répartition des classes (positve:négatif) cela permet de déduire la paramètre de CV pour bien faire apprendre le modele

In [36]:
dtc_class = DecisionTreeClassifier(criterion='entropy', max_depth=10, splitter='best')

# fitting the model
dtc_class.fit(X_train, y_train)

# predict the response
y_test_predict_dtc = dtc_class.predict(X_test)
In [37]:
from sklearn.metrics import confusion_matrix

# Confusion_matrix
confusion_matrix_dtc = confusion_matrix(y_test, y_test_predict_dtc)
print('\nconfusion_matrix:\n', confusion_matrix_dtc)
confusion_matrix:
 [[4586  277]
 [ 637  880]]
In [39]:
def plot_confusion_matrix(cm, title='Confusion matrix', cmap=plt.cm.Oranges):
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()

    ax = plt.gca()
    ax.set_xticklabels((ax.get_xticks() +1).astype(str))

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

cm = confusion_matrix_dtc
np.set_printoptions(precision=1) #!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
print('Confusion matrix')
print(cm)


fig, ax = plt.subplots()
plot_confusion_matrix(cm)

plt.show()
Confusion matrix
[[4586  277]
 [ 637  880]]
In [40]:
print('\nFeature_importances:\n', dict(zip(X.columns,dtc_class.fit(X_train, y_train).feature_importances_)))
Feature_importances:
 {'age': 0.07362651107275976, 'fnlwgt': 0.018991213384840677, 'education-num': 0.18744260110822109, 'capital-gain': 0.1867414003016885, 'capital-loss': 0.06375567594603535, 'hours-per-week': 0.04026391335236473, 'workclass_ ?': 0.0, 'workclass_ Federal-gov': 0.0017301016030388313, 'workclass_ Local-gov': 0.002833234287434381, 'workclass_ Never-worked': 0.0, 'workclass_ Private': 0.0015527901994196525, 'workclass_ Self-emp-inc': 0.0006094857848936029, 'workclass_ Self-emp-not-inc': 0.0012022032822337461, 'workclass_ State-gov': 0.00024197161921363843, 'workclass_ Without-pay': 0.0, 'education_ 10th': 0.0, 'education_ 11th': 0.0, 'education_ 12th': 0.0004418955420129914, 'education_ 1st-4th': 0.0, 'education_ 5th-6th': 0.0005131282960588023, 'education_ 7th-8th': 0.0006587147150296562, 'education_ 9th': 0.00037777513561856906, 'education_ Assoc-acdm': 0.0, 'education_ Assoc-voc': 0.0002752124279075858, 'education_ Bachelors': 0.0005046269974757068, 'education_ Doctorate': 0.0, 'education_ HS-grad': 0.0022786769643466775, 'education_ Masters': 0.0, 'education_ Preschool': 0.0, 'education_ Prof-school': 0.0, 'education_ Some-college': 0.0010485281910174325, 'marital-status_ Divorced': 0.0005919293404822086, 'marital-status_ Married-AF-spouse': 0.0009029236923970884, 'marital-status_ Married-civ-spouse': 0.37914164023069363, 'marital-status_ Married-spouse-absent': 0.0, 'marital-status_ Never-married': 0.0010190216356231138, 'marital-status_ Separated': 0.0011380688004011313, 'marital-status_ Widowed': 0.00035140029423904093, 'occupation_ ?': 0.0, 'occupation_ Adm-clerical': 0.0, 'occupation_ Armed-Forces': 0.0, 'occupation_ Craft-repair': 0.0003150840474388622, 'occupation_ Exec-managerial': 0.008291131064258292, 'occupation_ Farming-fishing': 0.0018360844033099694, 'occupation_ Handlers-cleaners': 0.0, 'occupation_ Machine-op-inspct': 0.0, 'occupation_ Other-service': 0.004026625075382765, 'occupation_ Priv-house-serv': 0.0, 'occupation_ Prof-specialty': 0.003661901212033862, 'occupation_ Protective-serv': 0.0, 'occupation_ Sales': 0.000331008663758739, 'occupation_ Tech-support': 0.0005291771027751811, 'occupation_ Transport-moving': 0.0015476777084400227, 'relationship_ Husband': 0.0, 'relationship_ Not-in-family': 0.0015404199882885202, 'relationship_ Other-relative': 0.00047909530182792633, 'relationship_ Own-child': 0.0, 'relationship_ Unmarried': 0.00019416740204423376, 'relationship_ Wife': 0.002166586166772914, 'race_ Amer-Indian-Eskimo': 0.0, 'race_ Asian-Pac-Islander': 0.0, 'race_ Black': 0.0006070026354230765, 'race_ Other': 0.0, 'race_ White': 0.0, 'sex_ Female': 0.0013002796451605353, 'sex_ Male': 0.004274504663065744, 'group_country_ United-States': 0.0, 'group_country_America': 0.00029977928649757167, 'group_country_Asia': 0.0, 'group_country_Europe': 0.0003648314280743071}
In [41]:
pd.Series(dtc_class.feature_importances_, index=X_train.columns).nlargest(4).plot(kind='bar',color='#86bf91')
Out[41]:
<matplotlib.axes._subplots.AxesSubplot at 0x1252f3128>