{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Metodología para clasificación - Ejemplo\n", "### Introducción a la Ciencia de Datos- UdelaR\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "En este notebook, veremos un ejemplo (muy simple!) aplicado de la metodología de clasificación vista en el cursos. Para ello, utilizaremos el (muy popular) conjunto de datos [Titanic](http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.txt). La tarea de aprendizaje será predecir, dado un pasajero del Titanic, si sobrevivirá. " ] }, { "cell_type": "code", "execution_count": 66, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import sklearn\n", "import sklearn.preprocessing\n", "import sklearn.feature_selection\n", "import sklearn.model_selection\n", "import graphviz\n", "import scipy.stats\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Ejemplo: Titanic Dataset: listado de pasajeros del Titanica, indicando si sobrevivieron o no. Más detalles [aquí](). " ] }, { "cell_type": "code", "execution_count": 67, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
row.namespclasssurvivednameageembarkedhome.destroomticketboatsex
011st1Allen, Miss Elisabeth Walton29.0000SouthamptonSt Louis, MOB-524160 L2212female
121st0Allison, Miss Helen Loraine2.0000SouthamptonMontreal, PQ / Chesterville, ONC26NaNNaNfemale
231st0Allison, Mr Hudson Joshua Creighton30.0000SouthamptonMontreal, PQ / Chesterville, ONC26NaN(135)male
341st0Allison, Mrs Hudson J.C. (Bessie Waldo Daniels)25.0000SouthamptonMontreal, PQ / Chesterville, ONC26NaNNaNfemale
451st1Allison, Master Hudson Trevor0.9167SouthamptonMontreal, PQ / Chesterville, ONC22NaN11male
....................................
130813093rd0Zakarian, Mr ArtunNaNNaNNaNNaNNaNNaNmale
130913103rd0Zakarian, Mr MapriederNaNNaNNaNNaNNaNNaNmale
131013113rd0Zenn, Mr PhilipNaNNaNNaNNaNNaNNaNmale
131113123rd0Zievens, ReneNaNNaNNaNNaNNaNNaNfemale
131213133rd0Zimmerman, LeoNaNNaNNaNNaNNaNNaNmale
\n", "

1313 rows × 11 columns

\n", "
" ], "text/plain": [ " row.names pclass survived \\\n", "0 1 1st 1 \n", "1 2 1st 0 \n", "2 3 1st 0 \n", "3 4 1st 0 \n", "4 5 1st 1 \n", "... ... ... ... \n", "1308 1309 3rd 0 \n", "1309 1310 3rd 0 \n", "1310 1311 3rd 0 \n", "1311 1312 3rd 0 \n", "1312 1313 3rd 0 \n", "\n", " name age embarked \\\n", "0 Allen, Miss Elisabeth Walton 29.0000 Southampton \n", "1 Allison, Miss Helen Loraine 2.0000 Southampton \n", "2 Allison, Mr Hudson Joshua Creighton 30.0000 Southampton \n", "3 Allison, Mrs Hudson J.C. (Bessie Waldo Daniels) 25.0000 Southampton \n", "4 Allison, Master Hudson Trevor 0.9167 Southampton \n", "... ... ... ... \n", "1308 Zakarian, Mr Artun NaN NaN \n", "1309 Zakarian, Mr Maprieder NaN NaN \n", "1310 Zenn, Mr Philip NaN NaN \n", "1311 Zievens, Rene NaN NaN \n", "1312 Zimmerman, Leo NaN NaN \n", "\n", " home.dest room ticket boat sex \n", "0 St Louis, MO B-5 24160 L221 2 female \n", "1 Montreal, PQ / Chesterville, ON C26 NaN NaN female \n", "2 Montreal, PQ / Chesterville, ON C26 NaN (135) male \n", "3 Montreal, PQ / Chesterville, ON C26 NaN NaN female \n", "4 Montreal, PQ / Chesterville, ON C22 NaN 11 male \n", "... ... ... ... ... ... \n", "1308 NaN NaN NaN NaN male \n", "1309 NaN NaN NaN NaN male \n", "1310 NaN NaN NaN NaN male \n", "1311 NaN NaN NaN NaN female \n", "1312 NaN NaN NaN NaN male \n", "\n", "[1313 rows x 11 columns]" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "titanic=pd.read_csv('https://raw.githubusercontent.com/pln-fing-udelar/curso_aa/master/data/titanic.csv')\n", "titanic" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Fase 1: Preprocesamiento" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Atributos faltantes\n", "\n", "Vamos a sustituir los atributos faltantes por el promedio de los valores" ] }, { "cell_type": "code", "execution_count": 68, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Cantidad de instancias sin valor: 680\n" ] }, { "data": { "text/plain": [ "31.19418104265403" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Contamos cuántos NaN son\n", "print(\"Cantidad de instancias sin valor: {0}\".format(titanic['age'].isna().sum()))\n", "\n", "# Vemos el promedio de edad de los sobrevivientes, según la clase\n", "mean_age=titanic.mean()['age']\n", "display(mean_age)\n", "\n", "# Actualizamos con la mean_age de cada grupo\n", "titanic.loc[titanic['age'].isna(),'age']=mean_age\n" ] }, { "cell_type": "code", "execution_count": 69, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
row.namespclasssurvivednameageembarkedhome.destroomticketboatsex
011st1Allen, Miss Elisabeth Walton29.000000SouthamptonSt Louis, MOB-524160 L2212female
121st0Allison, Miss Helen Loraine2.000000SouthamptonMontreal, PQ / Chesterville, ONC26NaNNaNfemale
231st0Allison, Mr Hudson Joshua Creighton30.000000SouthamptonMontreal, PQ / Chesterville, ONC26NaN(135)male
341st0Allison, Mrs Hudson J.C. (Bessie Waldo Daniels)25.000000SouthamptonMontreal, PQ / Chesterville, ONC26NaNNaNfemale
451st1Allison, Master Hudson Trevor0.916700SouthamptonMontreal, PQ / Chesterville, ONC22NaN11male
....................................
130813093rd0Zakarian, Mr Artun31.194181NaNNaNNaNNaNNaNmale
130913103rd0Zakarian, Mr Maprieder31.194181NaNNaNNaNNaNNaNmale
131013113rd0Zenn, Mr Philip31.194181NaNNaNNaNNaNNaNmale
131113123rd0Zievens, Rene31.194181NaNNaNNaNNaNNaNfemale
131213133rd0Zimmerman, Leo31.194181NaNNaNNaNNaNNaNmale
\n", "

1313 rows × 11 columns

\n", "
" ], "text/plain": [ " row.names pclass survived \\\n", "0 1 1st 1 \n", "1 2 1st 0 \n", "2 3 1st 0 \n", "3 4 1st 0 \n", "4 5 1st 1 \n", "... ... ... ... \n", "1308 1309 3rd 0 \n", "1309 1310 3rd 0 \n", "1310 1311 3rd 0 \n", "1311 1312 3rd 0 \n", "1312 1313 3rd 0 \n", "\n", " name age embarked \\\n", "0 Allen, Miss Elisabeth Walton 29.000000 Southampton \n", "1 Allison, Miss Helen Loraine 2.000000 Southampton \n", "2 Allison, Mr Hudson Joshua Creighton 30.000000 Southampton \n", "3 Allison, Mrs Hudson J.C. (Bessie Waldo Daniels) 25.000000 Southampton \n", "4 Allison, Master Hudson Trevor 0.916700 Southampton \n", "... ... ... ... \n", "1308 Zakarian, Mr Artun 31.194181 NaN \n", "1309 Zakarian, Mr Maprieder 31.194181 NaN \n", "1310 Zenn, Mr Philip 31.194181 NaN \n", "1311 Zievens, Rene 31.194181 NaN \n", "1312 Zimmerman, Leo 31.194181 NaN \n", "\n", " home.dest room ticket boat sex \n", "0 St Louis, MO B-5 24160 L221 2 female \n", "1 Montreal, PQ / Chesterville, ON C26 NaN NaN female \n", "2 Montreal, PQ / Chesterville, ON C26 NaN (135) male \n", "3 Montreal, PQ / Chesterville, ON C26 NaN NaN female \n", "4 Montreal, PQ / Chesterville, ON C22 NaN 11 male \n", "... ... ... ... ... ... \n", "1308 NaN NaN NaN NaN male \n", "1309 NaN NaN NaN NaN male \n", "1310 NaN NaN NaN NaN male \n", "1311 NaN NaN NaN NaN female \n", "1312 NaN NaN NaN NaN male \n", "\n", "[1313 rows x 11 columns]" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "titanic" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Ejemplo Titanic (cont): convertimos el atributo sex en binario (toma valores en {0,1}) " ] }, { "cell_type": "code", "execution_count": 70, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "# Creamos un labelEncoder utilizando scikit-learn\n", "le=sklearn.preprocessing.LabelEncoder()\n", "# Obtenemos las clases a partir de los valores del conjunto de entrenamiento\n", "le.fit(titanic['sex'])\n", "# Mostramos las clases obtenidas\n", "le.classes_\n", "# Ajustamos el campo sex, transformándolo\n", "titanic.loc[:,'sex'] = le.transform(titanic.loc[:,'sex'])\n" ] }, { "cell_type": "code", "execution_count": 71, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
row.namespclasssurvivednameageembarkedhome.destroomticketboatsex
011st1Allen, Miss Elisabeth Walton29.000000SouthamptonSt Louis, MOB-524160 L22120
121st0Allison, Miss Helen Loraine2.000000SouthamptonMontreal, PQ / Chesterville, ONC26NaNNaN0
231st0Allison, Mr Hudson Joshua Creighton30.000000SouthamptonMontreal, PQ / Chesterville, ONC26NaN(135)1
341st0Allison, Mrs Hudson J.C. (Bessie Waldo Daniels)25.000000SouthamptonMontreal, PQ / Chesterville, ONC26NaNNaN0
451st1Allison, Master Hudson Trevor0.916700SouthamptonMontreal, PQ / Chesterville, ONC22NaN111
....................................
130813093rd0Zakarian, Mr Artun31.194181NaNNaNNaNNaNNaN1
130913103rd0Zakarian, Mr Maprieder31.194181NaNNaNNaNNaNNaN1
131013113rd0Zenn, Mr Philip31.194181NaNNaNNaNNaNNaN1
131113123rd0Zievens, Rene31.194181NaNNaNNaNNaNNaN0
131213133rd0Zimmerman, Leo31.194181NaNNaNNaNNaNNaN1
\n", "

1313 rows × 11 columns

\n", "
" ], "text/plain": [ " row.names pclass survived \\\n", "0 1 1st 1 \n", "1 2 1st 0 \n", "2 3 1st 0 \n", "3 4 1st 0 \n", "4 5 1st 1 \n", "... ... ... ... \n", "1308 1309 3rd 0 \n", "1309 1310 3rd 0 \n", "1310 1311 3rd 0 \n", "1311 1312 3rd 0 \n", "1312 1313 3rd 0 \n", "\n", " name age embarked \\\n", "0 Allen, Miss Elisabeth Walton 29.000000 Southampton \n", "1 Allison, Miss Helen Loraine 2.000000 Southampton \n", "2 Allison, Mr Hudson Joshua Creighton 30.000000 Southampton \n", "3 Allison, Mrs Hudson J.C. (Bessie Waldo Daniels) 25.000000 Southampton \n", "4 Allison, Master Hudson Trevor 0.916700 Southampton \n", "... ... ... ... \n", "1308 Zakarian, Mr Artun 31.194181 NaN \n", "1309 Zakarian, Mr Maprieder 31.194181 NaN \n", "1310 Zenn, Mr Philip 31.194181 NaN \n", "1311 Zievens, Rene 31.194181 NaN \n", "1312 Zimmerman, Leo 31.194181 NaN \n", "\n", " home.dest room ticket boat sex \n", "0 St Louis, MO B-5 24160 L221 2 0 \n", "1 Montreal, PQ / Chesterville, ON C26 NaN NaN 0 \n", "2 Montreal, PQ / Chesterville, ON C26 NaN (135) 1 \n", "3 Montreal, PQ / Chesterville, ON C26 NaN NaN 0 \n", "4 Montreal, PQ / Chesterville, ON C22 NaN 11 1 \n", "... ... ... ... ... ... \n", "1308 NaN NaN NaN NaN 1 \n", "1309 NaN NaN NaN NaN 1 \n", "1310 NaN NaN NaN NaN 1 \n", "1311 NaN NaN NaN NaN 0 \n", "1312 NaN NaN NaN NaN 1 \n", "\n", "[1313 rows x 11 columns]" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "titanic" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Ejemplo titanic (cont): transformamos el campo pclass utilizando one-hot-encoding:" ] }, { "cell_type": "code", "execution_count": 72, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "[array(['1st', '2nd', '3rd'], dtype=object)]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Utilizamos scikit-learn para crear un one-hot-encoder\n", "ohe=sklearn.preprocessing.OneHotEncoder(sparse=False)\n", "\n", "# Obtenemos las categorías a partir de los datos de entrenamiento\n", "ohe.fit(titanic['pclass'].to_numpy().reshape(-1,1))\n", "display(ohe.categories_)\n", "\n", "# Obtenemos los nuevos valores a partir del valor original\n", "new=ohe.transform(titanic['pclass'].to_numpy().reshape(-1,1))\n", "\n", "# Creamos nuevos atributos\n", "titanic['class_1st']=new[:,0]\n", "titanic['class_2nd']=new[:,1]\n", "titanic['class_3rd']=new[:,2]\n", "\n" ] }, { "cell_type": "code", "execution_count": 73, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
row.namespclasssurvivednameageembarkedhome.destroomticketboatsexclass_1stclass_2ndclass_3rd
011st1Allen, Miss Elisabeth Walton29.000000SouthamptonSt Louis, MOB-524160 L221201.00.00.0
121st0Allison, Miss Helen Loraine2.000000SouthamptonMontreal, PQ / Chesterville, ONC26NaNNaN01.00.00.0
231st0Allison, Mr Hudson Joshua Creighton30.000000SouthamptonMontreal, PQ / Chesterville, ONC26NaN(135)11.00.00.0
341st0Allison, Mrs Hudson J.C. (Bessie Waldo Daniels)25.000000SouthamptonMontreal, PQ / Chesterville, ONC26NaNNaN01.00.00.0
451st1Allison, Master Hudson Trevor0.916700SouthamptonMontreal, PQ / Chesterville, ONC22NaN1111.00.00.0
.............................................
130813093rd0Zakarian, Mr Artun31.194181NaNNaNNaNNaNNaN10.00.01.0
130913103rd0Zakarian, Mr Maprieder31.194181NaNNaNNaNNaNNaN10.00.01.0
131013113rd0Zenn, Mr Philip31.194181NaNNaNNaNNaNNaN10.00.01.0
131113123rd0Zievens, Rene31.194181NaNNaNNaNNaNNaN00.00.01.0
131213133rd0Zimmerman, Leo31.194181NaNNaNNaNNaNNaN10.00.01.0
\n", "

1313 rows × 14 columns

\n", "
" ], "text/plain": [ " row.names pclass survived \\\n", "0 1 1st 1 \n", "1 2 1st 0 \n", "2 3 1st 0 \n", "3 4 1st 0 \n", "4 5 1st 1 \n", "... ... ... ... \n", "1308 1309 3rd 0 \n", "1309 1310 3rd 0 \n", "1310 1311 3rd 0 \n", "1311 1312 3rd 0 \n", "1312 1313 3rd 0 \n", "\n", " name age embarked \\\n", "0 Allen, Miss Elisabeth Walton 29.000000 Southampton \n", "1 Allison, Miss Helen Loraine 2.000000 Southampton \n", "2 Allison, Mr Hudson Joshua Creighton 30.000000 Southampton \n", "3 Allison, Mrs Hudson J.C. (Bessie Waldo Daniels) 25.000000 Southampton \n", "4 Allison, Master Hudson Trevor 0.916700 Southampton \n", "... ... ... ... \n", "1308 Zakarian, Mr Artun 31.194181 NaN \n", "1309 Zakarian, Mr Maprieder 31.194181 NaN \n", "1310 Zenn, Mr Philip 31.194181 NaN \n", "1311 Zievens, Rene 31.194181 NaN \n", "1312 Zimmerman, Leo 31.194181 NaN \n", "\n", " home.dest room ticket boat sex class_1st \\\n", "0 St Louis, MO B-5 24160 L221 2 0 1.0 \n", "1 Montreal, PQ / Chesterville, ON C26 NaN NaN 0 1.0 \n", "2 Montreal, PQ / Chesterville, ON C26 NaN (135) 1 1.0 \n", "3 Montreal, PQ / Chesterville, ON C26 NaN NaN 0 1.0 \n", "4 Montreal, PQ / Chesterville, ON C22 NaN 11 1 1.0 \n", "... ... ... ... ... ... ... \n", "1308 NaN NaN NaN NaN 1 0.0 \n", "1309 NaN NaN NaN NaN 1 0.0 \n", "1310 NaN NaN NaN NaN 1 0.0 \n", "1311 NaN NaN NaN NaN 0 0.0 \n", "1312 NaN NaN NaN NaN 1 0.0 \n", "\n", " class_2nd class_3rd \n", "0 0.0 0.0 \n", "1 0.0 0.0 \n", "2 0.0 0.0 \n", "3 0.0 0.0 \n", "4 0.0 0.0 \n", "... ... ... \n", "1308 0.0 1.0 \n", "1309 0.0 1.0 \n", "1310 0.0 1.0 \n", "1311 0.0 1.0 \n", "1312 0.0 1.0 \n", "\n", "[1313 rows x 14 columns]" ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "titanic" ] }, { "cell_type": "code", "execution_count": 74, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
survivedagesexclass_1stclass_2ndclass_3rd
0129.00000001.00.00.0
102.00000001.00.00.0
2030.00000011.00.00.0
3025.00000001.00.00.0
410.91670011.00.00.0
.....................
1308031.19418110.00.01.0
1309031.19418110.00.01.0
1310031.19418110.00.01.0
1311031.19418100.00.01.0
1312031.19418110.00.01.0
\n", "

1313 rows × 6 columns

\n", "
" ], "text/plain": [ " survived age sex class_1st class_2nd class_3rd\n", "0 1 29.000000 0 1.0 0.0 0.0\n", "1 0 2.000000 0 1.0 0.0 0.0\n", "2 0 30.000000 1 1.0 0.0 0.0\n", "3 0 25.000000 0 1.0 0.0 0.0\n", "4 1 0.916700 1 1.0 0.0 0.0\n", "... ... ... ... ... ... ...\n", "1308 0 31.194181 1 0.0 0.0 1.0\n", "1309 0 31.194181 1 0.0 0.0 1.0\n", "1310 0 31.194181 1 0.0 0.0 1.0\n", "1311 0 31.194181 0 0.0 0.0 1.0\n", "1312 0 31.194181 1 0.0 0.0 1.0\n", "\n", "[1313 rows x 6 columns]" ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Eliminamos algunos atributos que no parecen relevantes. \n", "# Esto podríamos afinarlo usando Feature Selection\n", "titanic.drop(['row.names','pclass', 'name', 'embarked', 'home.dest', 'room', 'ticket', 'boat'], axis=1, inplace=True)\n", "titanic" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Fase 2: División del Conjunto de Datos\n", "\n", "### Conjunto de entrenamiento, testeo, [y validación]\n", "\n", "\"Drawing\"\n" ] }, { "cell_type": "code", "execution_count": 75, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "# Primero separamos las X de las y\n", "\n", "titanic_X = titanic[['age','sex', 'class_1st', 'class_2nd', 'class_3rd']]\n", "titanic_y = titanic[['survived']]\n" ] }, { "cell_type": "code", "execution_count": 76, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "(984, 5)" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "(329, 5)" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Construimos los corpus de entrenamiento y test\n", "\n", "from sklearn.model_selection import train_test_split\n", "X_train, X_test, y_train, y_test = train_test_split(titanic_X, titanic_y, test_size=0.25, random_state=33)\n", "\n", "display(X_train.shape)\n", "display(X_test.shape)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Fase 3: Entrenamiento\n", "\n", "\"Drawing\"\n" ] }, { "cell_type": "code", "execution_count": 77, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "# Vamos a entrenar un árbol de decisión sobre los datos de entrenamiento\n", "from sklearn import tree\n", "clf = tree.DecisionTreeClassifier(criterion='entropy', max_depth=3 , min_samples_leaf=5)\n", "clf = clf.fit(X_train,y_train)\n" ] }, { "cell_type": "code", "execution_count": 78, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "image/svg+xml": [ "\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "Tree\r\n", "\r\n", "\r\n", "0\r\n", "\r\n", "sex <= 0.5\r\n", "entropy = 0.912\r\n", "samples = 984\r\n", "value = [662, 322]\r\n", "\r\n", "\r\n", "1\r\n", "\r\n", "3rd_class <= 0.5\r\n", "entropy = 0.918\r\n", "samples = 333\r\n", "value = [111, 222]\r\n", "\r\n", "\r\n", "0->1\r\n", "\r\n", "\r\n", "True\r\n", "\r\n", "\r\n", "8\r\n", "\r\n", "age <= 13.5\r\n", "entropy = 0.619\r\n", "samples = 651\r\n", "value = [551, 100]\r\n", "\r\n", "\r\n", "0->8\r\n", "\r\n", "\r\n", "False\r\n", "\r\n", "\r\n", "2\r\n", "\r\n", "1st_class <= 0.5\r\n", "entropy = 0.374\r\n", "samples = 180\r\n", "value = [13, 167]\r\n", "\r\n", "\r\n", "1->2\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "5\r\n", "\r\n", "age <= 13.0\r\n", "entropy = 0.942\r\n", "samples = 153\r\n", "value = [98, 55]\r\n", "\r\n", "\r\n", "1->5\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "3\r\n", "\r\n", "entropy = 0.477\r\n", "samples = 78\r\n", "value = [8, 70]\r\n", "\r\n", "\r\n", "2->3\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "4\r\n", "\r\n", "entropy = 0.282\r\n", "samples = 102\r\n", "value = [5, 97]\r\n", "\r\n", "\r\n", "2->4\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "6\r\n", "\r\n", "entropy = 0.592\r\n", "samples = 7\r\n", "value = [6, 1]\r\n", "\r\n", "\r\n", "5->6\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "7\r\n", "\r\n", "entropy = 0.951\r\n", "samples = 146\r\n", "value = [92, 54]\r\n", "\r\n", "\r\n", "5->7\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "9\r\n", "\r\n", "3rd_class <= 0.5\r\n", "entropy = 0.918\r\n", "samples = 21\r\n", "value = [7, 14]\r\n", "\r\n", "\r\n", "8->9\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "12\r\n", "\r\n", "1st_class <= 0.5\r\n", "entropy = 0.575\r\n", "samples = 630\r\n", "value = [544, 86]\r\n", "\r\n", "\r\n", "8->12\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "10\r\n", "\r\n", "entropy = 0.0\r\n", "samples = 9\r\n", "value = [0, 9]\r\n", "\r\n", "\r\n", "9->10\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "11\r\n", "\r\n", "entropy = 0.98\r\n", "samples = 12\r\n", "value = [7, 5]\r\n", "\r\n", "\r\n", "9->11\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "13\r\n", "\r\n", "entropy = 0.452\r\n", "samples = 496\r\n", "value = [449, 47]\r\n", "\r\n", "\r\n", "12->13\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "14\r\n", "\r\n", "entropy = 0.87\r\n", "samples = 134\r\n", "value = [95, 39]\r\n", "\r\n", "\r\n", "12->14\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "\r\n" ], "text/plain": [ "" ] }, "execution_count": 78, "metadata": {}, "output_type": "execute_result" } ], "source": [ "src = graphviz.Source(tree.export_graphviz(clf, feature_names=['age','sex','1st_class','2nd_class','3rd_class']))\n", "src\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Validación cruzada y selección de modelos\n", "\n", "\"Drawing\"\n", "\n" ] }, { "cell_type": "code", "execution_count": 79, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Profundidad 1, Accuracy media: 0.786 (+/-0.786)\n", "Profundidad 2, Accuracy media: 0.833 (+/-0.833)\n", "Profundidad 3, Accuracy media: 0.827 (+/-0.827)\n", "Profundidad 4, Accuracy media: 0.825 (+/-0.825)\n", "Profundidad 5, Accuracy media: 0.823 (+/-0.823)\n", "Profundidad 6, Accuracy media: 0.825 (+/-0.825)\n", "Profundidad 7, Accuracy media: 0.823 (+/-0.823)\n", "Profundidad 8, Accuracy media: 0.824 (+/-0.824)\n", "Profundidad 9, Accuracy media: 0.824 (+/-0.824)\n", "Profundidad 10, Accuracy media: 0.826 (+/-0.826)\n" ] } ], "source": [ "# Hacemos cross validation para encontrar la mejor profundidad para el árbol\n", "for md in range(10):\n", " clf = tree.DecisionTreeClassifier(criterion='entropy', max_depth=md+1 , min_samples_leaf=5)\n", " kf=sklearn.model_selection.KFold(n_splits=5)\n", " scores=np.zeros(5)\n", " score_index=0\n", " for train_index, test_index in kf.split(X_train):\n", " X_train_cv, X_test_cv= X_train.iloc[train_index], X_train.iloc[test_index]\n", " y_train_cv, y_test_cv= y_train.iloc[train_index], y_train.iloc[test_index]\n", " clf = clf.fit(X_train_cv,y_train_cv)\n", " y_pred=clf.predict(X_test_cv)\n", " scores[score_index]=metrics.accuracy_score(y_test_cv.astype(int), y_pred.astype(int))\n", " score_index += 1\n", " print (\"Profundidad {0:d}, Accuracy media: {1:.3f} (+/-{1:.3f})\".format(md+1, np.mean(scores), scipy.stats.sem(scores)))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Fase 4: Evaluación\n", "\n", "### Precisión y Recuperación\n", "\n", "\"Drawing\"\n", "\n" ] }, { "cell_type": "code", "execution_count": 80, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "from sklearn import metrics\n", "def measure_performance(X,y,clf, show_accuracy=True, show_classification_report=True, show_confusion_matrix=True):\n", " y_pred=clf.predict(X) \n", " if show_accuracy:\n", " print (\"Accuracy:{0:.3f}\".format(metrics.accuracy_score(y,y_pred)),\"\\n\")\n", "\n", " if show_classification_report:\n", " print(\"Classification report\")\n", " print(metrics.classification_report(y,y_pred),\"\\n\")\n", " \n", " if show_confusion_matrix:\n", " print (\"Confusion matrix\")\n", " print (metrics.confusion_matrix(y,y_pred),\"\\n\")\n", " " ] }, { "cell_type": "code", "execution_count": 81, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy:0.787 \n", "\n", "Classification report\n", " precision recall f1-score support\n", "\n", " 0 0.77 0.94 0.84 202\n", " 1 0.85 0.54 0.66 127\n", "\n", " accuracy 0.79 329\n", " macro avg 0.81 0.74 0.75 329\n", "weighted avg 0.80 0.79 0.77 329\n", " \n", "\n", "Confusion matrix\n", "[[190 12]\n", " [ 58 69]] \n", "\n" ] } ], "source": [ "# Construimos un clasificador con el mejor parámetro, y entrenamos sobre todo el conjunto de entrenamiento\n", "\n", "clf_dt=tree.DecisionTreeClassifier(criterion='entropy', max_depth=2 ,min_samples_leaf=5)\n", "clf_dt.fit(X_train,y_train)\n", "measure_performance(X_test,y_test,clf_dt)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Esto es así:\n", "\n", "- 190 eran No sobrevive, y los clasificó bien. A 12 más los clasificó como que sobrevieron, pero no.\n", "- 69 Sobrevivieron y los clasificó bien. a 58 más los clasificó como que No, y sobrevivieron.\n", "\n", "- Precisión para la clase 0: TP/(TP+FP) = 190 / (190 + 58) = 0.77 (Ojo que aquí \"positivo\" es que no sobrevivió)\n", "- Recall para la clase 0: TP/(TP+FN) = 190/(190+12) = 0.94\n", "- Precisión para la clase 1: 69/(58+69) = 0.54\n", "- Recall para la clase 1: 69/(69+12) = 0.85\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Predicción" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "\n", "\"Drawing\"\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 86, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "array([1], dtype=int64)" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "array([0], dtype=int64)" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "titanic_X = titanic[['age','sex', 'class_1st', 'class_2nd', 'class_3rd']]\n", "\n", "\n", "Rose = np.array([17,0,1,0,0]).reshape(1, -1)\n", "\n", "y_pred=clf.predict(Rose) \n", "display(y_pred)\n", "\n", "Jack = np.array([23,1,0,0,1]).reshape(1, -1)\n", "y_pred=clf.predict(Jack) \n", "display(y_pred)" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" } }, "nbformat": 4, "nbformat_minor": 4 }