Classification on Titanic Data set¶

In [805]:

%pylab inline
import pandas as pd
matplotlib.rcParams['figure.figsize'] = (10,6)

Populating the interactive namespace from numpy and matplotlib

/home/cokelaer/anaconda2/envs/py35/lib/python3.5/site-packages/IPython/core/magics/pylab.py:161: UserWarning: pylab import has clobbered these variables: ['test', 'clf', 'shuffle']
`%matplotlib` prevents importing * from pylab and numpy
  "\n`%matplotlib` prevents importing * from pylab and numpy"

The data¶

In [817]:

training= pd.read_csv("data/titanic_training.csv", index_col=0)
test = pd.read_csv("data/titanic_test.csv", index_col=0)

Data description¶

In [818]:

training.head(3)

Out[818]:

	pclass	survived	name	sex	age	sibsp	ticket	fare	cabin	embarked	boat	body	home.dest
584	2	1	Webber, Miss. Susan	female	32.5	0	27267	13.00	E101	S	12	NaN	England / Hartford, CT
328	2	0	Angle, Mr. William A	male	34.0	1	226875	26.00	NaN	S	NaN	NaN	Warwick, England
780	3	1	Drapkin, Miss. Jennie	female	23.0	0	SOTON/OQ 392083	8.05	NaN	S	NaN	NaN	London New York, NY

Survived : 1 means survived (remember True is 1)

Data visualisation¶

In [819]:

training.survived.value_counts().plot(kind="bar")
title("Survival (1=survived)")
grid()

In [820]:

training[["survived", "age"]].boxplot(by="survived")

Out[820]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f8d7ff2bbe0>

In [821]:

training.pclass.value_counts().plot.barh()
title("class distribution")
grid()

Cleanup¶

Remove useless features or features with lots of missing values
drop remaining rows with at least one missing values

In [822]:

# many missing values
print(len(training), len(test))
ignore = ["cabin", "home.dest", "body", "embarked", "boat"]
training.drop(ignore, axis=1, inplace=True)
test.drop(ignore, axis=1, inplace=True)
training.dropna(inplace=True)
test.dropna(inplace=True)
print(len(training), len(test))

916 393
726 319

Let us replace male by 0 and female by 1

In [823]:

training = training.replace("male", 0).replace("female", 1)
test = test.replace("male", 0).replace("female", 1)

Find a relevant feature of interest¶

In [824]:

male_survived = len(training.query("sex==0 and survived==1"))
N_male = len(training.query("sex==0"))
male_survival_rate = male_survived / N_male * 100

female_survived = len(training.query("sex==1 and survived==1"))
N_female = len(training.query("sex==1"))
female_survival_rate = female_survived / N_female * 100

bar([0,1], [male_survival_rate, female_survival_rate ])
xticks([0,1], ['male', 'female'])
xlabel("survival rate (%)")

Out[824]:

<matplotlib.text.Text at 0x7f8d7ff57be0>

The training and test data¶

In [825]:

features = ["sex"]

Y = training.survived.values
X = training.loc[:, features].values

Ytest = test.survived.values
Xtest = test.loc[:,features]

Your own naive classifier¶

Let us predict the survival as follows:

if it is a woman: survives
if it is a man: does not survive

This is a simple as :

In [826]:

Y_pred = (Xtest.sex == 1)

Accuracy = fraction of all instance that are correctly categorized

In [827]:

sum(Y_pred == Ytest)/len(Y_pred)

Out[827]:

0.76802507836990597

What about a proper classifier from sklearn ?¶

In [828]:

from sklearn.linear_model import SGDClassifier

In [941]:

clf = SGDClassifier()
clf.fit(X, Y)
# Let us save the prediction for later
Y_pred_1 = clf.predict(Xtest)

clf.score(Xtest, Ytest)

Out[941]:

0.76802507836990597

The score is not deterministic ! See later

Stochastic Gradient Descent classifier¶

Stochastic Gradient Descent (SGD) is a simple yet very efficient approach to discriminative learning of linear classifiers under convex loss functions such as (linear) Support Vector Machines and Logistic Regression.

SGD has been successfully applied to large-scale and sparse machine learning problems often encountered in text classification and natural language processing.

The advantages of Stochastic Gradient Descent are:

Efficiency.
Ease of implementation (lots of opportunities for code tuning).

The disadvantages of Stochastic Gradient Descent include:

SGD requires a number of hyperparameters such as the regularization parameter and the number of iterations.
SGD is sensitive to feature scaling.

Binary classifiers¶

Some definition¶

TPR = recall = sensitivity = $\frac{TP}{P}$

FPR = fall out = $\frac{FP}{N}$

See page 17-20 from https://f1000research.com/articles/4-1030/v1 for all definitions

ROC curve¶

In [841]:

from sklearn import metrics
fpr1, tpr1, thresholds = metrics.roc_curve(Ytest, Y_pred_1, pos_label=1)

In [842]:

metrics.roc_auc_score(Ytest, Y_pred_1)

Out[842]:

0.75437962070073927

In [843]:

plot(fpr1, tpr1, "o-")
plot([0,1], [0,1], "-k")

Out[843]:

[<matplotlib.lines.Line2D at 0x7f8d7ff55630>]

In [844]:

scores = []
for this in range(20):
    clf = SGDClassifier()
    clf.fit(X, Y)
    scores.append(clf.score(Xtest, Ytest))
    prediction = clf.predict(Xtest)
    fpr, tpr, thresholds = metrics.roc_curve(Ytest, prediction, pos_label=1)
    plot(fpr, tpr, "-o")
mean(scores), std(scores)

Out[844]:

(0.6608150470219436, 0.13932608845914329)

Add more features¶

In [845]:

features = ["sex", "pclass"]

Y = training.survived.values
X = training.loc[:, features].values

Ytest = test.survived.values
Xtest = test.loc[:,features].values

In [846]:

subplot(2,2,1)
training.query("sex==0 and pclass==3").survived.value_counts().plot(
    kind="bar", label='male class2')
legend(); xticks([0,1], ["survived", "died"])
subplot(2,2,2)
training.query("sex==0 and pclass!=3").survived.value_counts().plot(
    kind="bar", label='male class 2/3')
legend(); xticks([0,1], ["survived", "died"])
subplot(2,2,3)
training.query("sex==1 and pclass==3").survived.value_counts().plot(
    kind="bar", label='female class3')
legend(); xticks([0,1], ["survived", "died"])
subplot(2,2,4)
training.query("sex==1 and pclass!=3").survived.value_counts().plot(
    kind="bar", label='female class 2/3')
legend();  _ = xticks([0,1], ["survived", "died"])

In [847]:

clf = SGDClassifier()
clf.fit(X, Y)
clf.score(Xtest, Ytest)

Out[847]:

0.72413793103448276

Note that it is stochastic so sometimes 2 features is worst than 1 but on average it is better

In [875]:

scores = []
for this in range(20):
    clf = SGDClassifier().fit(X, Y)
    scores.append(clf.score(Xtest, Ytest))
    prediction = clf.predict(Xtest)
    fpr, tpr, thresholds = metrics.roc_curve(Ytest, prediction, pos_label=1)
    plot(fpr, tpr, "-o")
mean(scores), std(scores)

Out[875]:

(0.6967084639498432, 0.10295729883404001)

What about KNearestNeigbors classifier ?¶

In [868]:

features = ["sex"]
Y = training.survived.values
X = training.loc[:, features].values
Ytest = test.survived.values
Xtest = test.loc[:,features].values

In [869]:

from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier()
clf.fit(X, Y)
clf.score(Xtest, Ytest)

Out[869]:

0.76802507836990597

In [917]:

features = ["sex", "pclass",]
Y = training.survived.values
X = training.loc[:, features].values
Ytest = test.survived.values
Xtest = test.loc[:,features].values

In [918]:

from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier()
clf.fit(X, Y)
clf.score(Xtest, Ytest)

Out[918]:

0.72413793103448276

Decision Trees¶

Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.

some advantages:

simple to understand (see later)
low cost (log)
handle categorical and numerical data (less data preparation)

some drawbacks:

can be unstable. small variation may lead to different results
may perform overfitting (over complex tree)

In [951]:

from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf.fit(X, Y)
clf.score(Xtest, Ytest)

Out[951]:

0.78056426332288398

In [948]:

import pydotplus 
dot_data = tree.export_graphviz(clf, out_file=None) 
graph = pydotplus.graph_from_dot_data(dot_data) 
dot_data = tree.export_graphviz(clf, out_file=None, 
                         feature_names=features,  
                         class_names=["survived", "died"],  
                         filled=True, rounded=True,  special_characters=True)  
graph = pydotplus.graph_from_dot_data(dot_data)

In [949]:

from IPython.display import Image  
Image(graph.create_png())

Out[949]:

RandomForest¶

Random forests are an ensemble learning method for classification (and regression) that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes output by individual trees.

-- wikipedia

From sklearn website:

In random forests each tree in the ensemble is built from a sample drawn with replacement (i.e., a bootstrap sample) from the training set.

In addition, when splitting a node during the construction of the tree, the split that is chosen is no longer the best split among all features.

Instead, the split that is picked is the best split among a random subset of the features. As a result of this randomness, the bias of the forest usually slightly increases (with respect to the bias of a single non-random tree) but, due to averaging, its variance also decreases, usually more than compensating for the increase in bias, hence yielding an overall better model.

In [978]:

from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=5)
clf.fit(X, Y)
clf.score(Xtest, Ytest)

Out[978]:

0.78056426332288398

Neural network¶

In [1032]:

from sklearn.neural_network import MLPClassifier
clf = MLPClassifier(solver='lbfgs', alpha=1e-5,
                    hidden_layer_sizes=(10, 1), random_state=1)
_ = clf.fit(X, Y)
clf.score(Xtest, Ytest)

Out[1032]:

0.78056426332288398

Cross validation for binary classifiers¶

In [1041]:

training.head(3)

Out[1041]:

	pclass	survived	name	sex	age	sibsp	ticket	fare
584	2	1	Webber, Miss. Susan	1	32.5	0	27267	13.00
328	2	0	Angle, Mr. William A	0	34.0	1	226875	26.00
780	3	1	Drapkin, Miss. Jennie	1	23.0	0	SOTON/OQ 392083	8.05

In [1080]:

features = ["sex", "class"]
Y = training.survived.values
X = training.loc[:, features].values
Ytest = test.survived.values
Xtest = test.loc[:,features].values

In [1079]:

from sklearn.model_selection import cross_val_score
clf1 = tree.DecisionTreeClassifier()
clf2 = SGDClassifier()
clf3 = KNeighborsClassifier()
clf4 = RandomForestClassifier(n_estimators=10)

scores1 = cross_val_score(clf1, X, Y, cv=5)
scores2 = cross_val_score(clf2, X, Y, cv=5)
scores3 = cross_val_score(clf3, X, Y, cv=5)
scores4 = cross_val_score(clf4, X, Y, cv=5)


yerr = [std(this) for this in [scores1, scores2, scores3, scores4]]
mus = [mean(this) for this in [scores1, scores2, scores3, scores4]]
errorbar([0,1,2,3], mus, yerr=yerr, xerr=0.1, fmt="o")
ylim([0.5,1])
_ = xticks([0,1,2,3], ["tree", "sgd", "kn", "RF"], rotation=0, fontsize=40)
print(mus)

[0.78376003778932457, 0.58789796882380718, 0.78376003778932457, 0.78376003778932457]

Summary¶

Select the relevant features
Remove or impute missing data sets
choose a classifier. We have seen:
```
- KNeighbor
- DecisionTree 
- StochasticGradient
```
but there are many more. See sklearn website

This section was used to create the training and test set from the whole data¶

In [794]:

from sklearn.utils import shuffle
df = pd.read_csv("data/titanic3.csv")
df = shuffle(df)
training = df.ix[df.index[0:916]]
test = df.ix[df.index[916:]]
training.to_csv("data/titanic_training.csv")
test.to_csv("data/titanic_test.csv")

In [795]:

pd.read_csv("data/titanic_training.csv", index_col=0)

Out[795]:

	pclass	survived	name	sex	age	sibsp	parch	ticket	fare	cabin	embarked	boat	body	home.dest
584	2	1	Webber, Miss. Susan	female	32.5	0	0	27267	13.0000	E101	S	12	NaN	England / Hartford, CT
328	2	0	Angle, Mr. William A	male	34.0	1	0	226875	26.0000	NaN	S	NaN	NaN	Warwick, England
780	3	1	Drapkin, Miss. Jennie	female	23.0	0	0	SOTON/OQ 392083	8.0500	NaN	S	NaN	NaN	London New York, NY
495	2	0	Mangiavacchi, Mr. Serafino Emilio	male	NaN	0	0	SC/A.3 2861	15.5792	NaN	C	NaN	NaN	New York, NY
659	3	1	Baclini, Miss. Marie Catherine	female	5.0	2	1	2666	19.2583	NaN	C	C	NaN	Syria New York, NY
1152	3	0	Robins, Mr. Alexander A	male	50.0	1	0	A/5. 3337	14.5000	NaN	S	NaN	119.0	NaN
1071	3	1	O'Brien, Mrs. Thomas (Johanna "Hannah" Godfrey)	female	NaN	1	0	370365	15.5000	NaN	Q	NaN	NaN	NaN
232	1	0	Porter, Mr. Walter Chamberlain	male	47.0	0	0	110465	52.0000	C110	S	NaN	207.0	Worcester, MA
1302	3	0	Yousif, Mr. Wazli	male	NaN	0	0	2647	7.2250	NaN	C	NaN	NaN	NaN
491	2	0	Malachard, Mr. Noel	male	NaN	0	0	237735	15.0458	D	C	NaN	NaN	Paris
1275	3	0	Vander Planke, Mr. Leo Edmondus	male	16.0	2	0	345764	18.0000	NaN	S	NaN	NaN	NaN
431	2	0	Harper, Rev. John	male	28.0	0	1	248727	33.0000	NaN	S	NaN	NaN	Denmark Hill, Surrey / Chicago
1121	3	1	Peter, Master. Michael J	male	NaN	1	1	2668	22.3583	NaN	C	C	NaN	NaN
433	2	0	Harris, Mr. Walter	male	30.0	0	0	W/C 14208	10.5000	NaN	S	NaN	NaN	Walthamstow, England
123	1	1	Frolicher-Stehli, Mr. Maxmillian	male	60.0	1	1	13567	79.2000	B41	C	5	NaN	Zurich, Switzerland
414	2	0	Gale, Mr. Shadrach	male	34.0	1	0	28664	21.0000	NaN	S	NaN	NaN	Cornwall / Clear Creek, CO
1077	3	1	O'Driscoll, Miss. Bridget	female	NaN	0	0	14311	7.7500	NaN	Q	D	NaN	NaN
1158	3	0	Rosblom, Mrs. Viktor (Helena Wilhelmina)	female	41.0	0	2	370129	20.2125	NaN	S	NaN	NaN	NaN
927	3	0	Khalil, Mr. Betros	male	NaN	1	0	2660	14.4542	NaN	C	NaN	NaN	NaN
1237	3	0	Svensson, Mr. Olof	male	24.0	0	0	350035	7.7958	NaN	S	NaN	NaN	NaN
247	1	1	Rothschild, Mrs. Martin (Elizabeth L. Barrett)	female	54.0	1	0	PC 17603	59.4000	NaN	C	6	NaN	New York, NY
510	2	0	Mudd, Mr. Thomas Charles	male	16.0	0	0	S.O./P.P. 3	10.5000	NaN	S	NaN	NaN	Halesworth, England
191	1	0	Loring, Mr. Joseph Holland	male	30.0	0	0	113801	45.5000	NaN	S	NaN	NaN	London / New York, NY
1210	3	0	Skoog, Mr. Wilhelm	male	40.0	1	4	347088	27.9000	NaN	S	NaN	NaN	NaN
681	3	0	Boulos, Mrs. Joseph (Sultana)	female	NaN	0	2	2678	15.2458	NaN	C	NaN	NaN	Syria Kent, ON
1195	3	0	Shaughnessy, Mr. Patrick	male	NaN	0	0	370374	7.7500	NaN	Q	NaN	NaN	NaN
216	1	1	Newsom, Miss. Helen Monypeny	female	19.0	0	2	11752	26.2833	D47	S	5	NaN	New York, NY
644	3	0	Asplund, Mr. Carl Oscar Vilhelm Gustafsson	male	40.0	1	5	347077	31.3875	NaN	S	NaN	142.0	Sweden Worcester, MA
856	3	1	Healy, Miss. Hanora "Nora"	female	NaN	0	0	370375	7.7500	NaN	Q	16	NaN	NaN
528	2	0	Parkes, Mr. Francis "Frank"	male	NaN	0	0	239853	0.0000	NaN	S	NaN	NaN	Belfast
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
264	1	1	Simonius-Blumer, Col. Oberst Alfons	male	56.0	0	0	13213	35.5000	A26	C	3	NaN	Basel, Switzerland
1138	3	0	Reed, Mr. James George	male	NaN	0	0	362316	7.2500	NaN	S	NaN	NaN	NaN
309	1	1	Wick, Miss. Mary Natalie	female	31.0	0	2	36928	164.8667	C7	S	8	NaN	Youngstown, OH
1058	3	0	Nieminen, Miss. Manta Josefina	female	29.0	0	0	3101297	7.9250	NaN	S	NaN	NaN	NaN
899	3	1	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female	27.0	0	2	347742	11.1333	NaN	S	15	NaN	NaN
1107	3	0	Pasic, Mr. Jakob	male	21.0	0	0	315097	8.6625	NaN	S	NaN	NaN	NaN
832	3	0	Goodwin, Mrs. Frederick (Augusta Tyler)	female	43.0	1	6	CA 2144	46.9000	NaN	S	NaN	NaN	Wiltshire, England Niagara Falls, NY
915	3	0	Karlsson, Mr. Nils August	male	22.0	0	0	350060	7.5208	NaN	S	NaN	NaN	NaN
225	1	0	Payne, Mr. Vivian Ponsonby	male	23.0	0	0	12749	93.5000	B24	S	NaN	NaN	Montreal, PQ
872	3	1	Howard, Miss. May Elizabeth	female	NaN	0	0	A. 2. 39186	8.0500	NaN	S	C	NaN	NaN
77	1	0	Compton, Mr. Alexander Taylor Jr	male	37.0	1	1	PC 17756	83.1583	E52	C	NaN	NaN	Lakewood, NJ
961	3	0	Lennon, Miss. Mary	female	NaN	1	0	370371	15.5000	NaN	Q	NaN	NaN	NaN
213	1	1	Newell, Miss. Madeleine	female	31.0	1	0	35273	113.2750	D36	C	6	NaN	Lexington, MA
1068	3	0	Nysveen, Mr. Johan Hansen	male	61.0	0	0	345364	6.2375	NaN	S	NaN	NaN	NaN
317	1	1	Williams, Mr. Richard Norris II	male	21.0	0	1	PC 17597	61.3792	NaN	C	A	NaN	Geneva, Switzerland / Radnor, PA
275	1	1	Spedden, Mrs. Frederic Oakley (Margaretta Corn...	female	40.0	1	1	16966	134.5000	E34	C	3	NaN	Tuxedo Park, NY
840	3	0	Haas, Miss. Aloisia	female	24.0	0	0	349236	8.8500	NaN	S	NaN	NaN	NaN
297	1	1	Thorne, Mrs. Gertrude Maybelle	female	NaN	0	0	PC 17585	79.2000	NaN	C	D	NaN	New York, NY
56	1	1	Carter, Mr. William Ernest	male	36.0	1	2	113760	120.0000	B96 B98	S	C	NaN	Bryn Mawr, PA
122	1	1	Frolicher, Miss. Hedwig Margaritha	female	22.0	0	2	13568	49.5000	B39	C	5	NaN	Zurich, Switzerland
1308	3	0	Zimmerman, Mr. Leo	male	29.0	0	0	315082	7.8750	NaN	S	NaN	NaN	NaN
870	3	1	Honkanen, Miss. Eliina	female	27.0	0	0	STON/O2. 3101283	7.9250	NaN	S	NaN	NaN	NaN
108	1	1	Fleming, Miss. Margaret	female	NaN	0	0	17421	110.8833	NaN	C	4	NaN	NaN
787	3	0	Eklund, Mr. Hans Linus	male	16.0	0	0	347074	7.7750	NaN	S	NaN	NaN	Karberg, Sweden Jerome Junction, AZ
1036	3	1	Moubarek, Mrs. George (Omine "Amenia" Alexander)	female	NaN	0	2	2661	15.2458	NaN	C	C	NaN	NaN
1211	3	0	Skoog, Mrs. William (Anna Bernhardina Karlsson)	female	45.0	1	4	347088	27.9000	NaN	S	NaN	NaN	NaN
555	2	0	Sedgwick, Mr. Charles Frederick Waddington	male	25.0	0	0	244361	13.0000	NaN	S	NaN	NaN	Liverpool
695	3	0	Burns, Miss. Mary Delia	female	18.0	0	0	330963	7.8792	NaN	Q	NaN	NaN	Co Sligo, Ireland New York, NY
18	1	1	Bazzani, Miss. Albina	female	32.0	0	0	11813	76.2917	D15	C	8	NaN	NaN
1125	3	0	Petersen, Mr. Marius	male	24.0	0	0	342441	8.0500	NaN	S	NaN	NaN	NaN

916 rows × 14 columns