{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Ejemplo de clasificación con árbol de decisión" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Al comienzo se importan las bibliotecas y módulos a utilizar" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "from sklearn.datasets import load_iris\n", "from sklearn.model_selection import train_test_split\n", "from sklearn import tree\n", "from sklearn.metrics import confusion_matrix\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Luego aparecen la funciones auxiliares" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def funcion_auxiliar_que_suma(a,b):\n", " \"\"\"\n", " Esta funcion suma dos elementos y devuelve el resultado\n", " \"\"\"\n", " \n", " suma = a+b\n", " \n", " return suma" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def plot_confusion_matrix(cm, classes,\n", " normalize=False,\n", " title='Confusion matrix',\n", " cmap=plt.cm.Blues):\n", " \"\"\"\n", " This function prints and plots the confusion matrix.\n", " Normalization can be applied by setting `normalize=True`.\n", " \"\"\"\n", " import itertools \n", " \n", " plt.imshow(cm, interpolation='nearest', cmap=cmap)\n", " plt.title(title)\n", " plt.colorbar()\n", " tick_marks = np.arange(len(classes))\n", " plt.xticks(tick_marks, classes, rotation=45)\n", " plt.yticks(tick_marks, classes)\n", "\n", " if normalize:\n", " cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]\n", " print(\"Normalized confusion matrix\")\n", " else:\n", " print('Confusion matrix, without normalization')\n", "\n", " print(cm)\n", " \n", " thresh = cm.max() / 2.\n", " for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):\n", " plt.text(j, i, cm[i, j],\n", " horizontalalignment=\"center\",\n", " color=\"white\" if cm[i, j] > thresh else \"black\")\n", "\n", " plt.tight_layout()\n", " plt.ylabel('True label')\n", " plt.xlabel('Predicted label')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 1- Se cargan los datos " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "En este caso se utilizará la base de datos iris que viene incluída en el paquete datasets. Además de incluir algunas bases conocidas, el [paquete](http://scikit-learn.org/stable/datasets/index.html#datasets) permite generar datos sintéticos." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "db = load_iris() # from sklearn.datasets import load_iris\n", "print(db.DESCR)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ejemplo: como levantar datos desde la web" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Load CSV from URL using NumPy\n", "from urllib.request import urlopen\n", "url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv'\n", "raw_data = urlopen(url)\n", "diabetesDataset = np.loadtxt(raw_data, delimiter=\",\")\n", "print(diabetesDataset.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Volvemos al ejemplo con la base de datos iris" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "features = [1, 2] # En este ejemplo usaremos solo dos características\n", "data = db.data[:,features]\n", "feat_names=[db.feature_names[features[0]]] + [db.feature_names[features[1]]]\n", "print(feat_names)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 2- Separar en conjunto de entrenamiento y test" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "El modulo [model_selection](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection) permite separar el conjunto de datos en entrenamiento y test." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# from sklearn.model_selection import train_test_split.\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " data , db.target, test_size=0.4, random_state=213)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 3- Definir el clasificador" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Los árboles de decisión son uno de los disponibles en scikit-learn para realizar [aprendizaje supervizado](http://scikit-learn.org/stable/supervised_learning.html)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# from sklearn import tree\n", "clf = tree.DecisionTreeClassifier()\n", "#clf = tree.DecisionTreeClassifier(min_samples_leaf=3)\n", "#clf = tree.DecisionTreeClassifier(max_depth=2)\n", "print(clf)\n", "#help(clf)\n", "# min_samples_leaf=1\n", "# max_depth=None" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#help(clf)\n", "#print(clf.feature_importances_) # los atributos que terminan con _ se llenan luego de entrenar" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 4- Entrenar" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "clf = clf.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import graphviz \n", "dot_data = tree.export_graphviz(clf, out_file=None,\n", " feature_names=feat_names,\n", " class_names=db.target_names,\n", " filled=True, rounded=True,\n", " special_characters=True)\n", "graph = graphviz.Source(dot_data) \n", "graph.render(\"tree\")\n", "graph" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "En este caso, como se utilizan sólo dos características, se puede visualizar la superficie de decisión" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Plot the decision boundaries\n", "\n", "# Parameters\n", "n_classes = 3\n", "plot_colors = \"bry\"\n", "plot_step = 0.01\n", "\n", "plt.figure(figsize=(10,10))\n", "\n", "# Se realiza una grilla que cubre todo el conjunto de entrenamiento\n", "x_min, x_max = X_train[:, 0].min() - 1, X_train[:, 0].max() + 1\n", "y_min, y_max = X_train[:, 1].min() - 1, X_train[:, 1].max() + 1\n", "xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),\n", " np.arange(y_min, y_max, plot_step))\n", "\n", "# Se predice la clase de cada uno de los puntos de la grilla\n", "Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])\n", "Z = Z.reshape(xx.shape)\n", "\n", "#Se asigna un color a cada clase y se plotea la superficie\n", "cs = plt.contourf(xx, yy, Z, cmap=plt.cm.Paired)\n", "plt.axis(\"equal\")\n", "\n", "# Se plotean también los puntos del conjunto de entrenamiento y test\n", "for i, color in zip(range(n_classes), plot_colors):\n", " idx_train = np.where(y_train == i)\n", " idx_test = np.where(y_test == i)\n", " plt.scatter(X_train[idx_train, 0], X_train[idx_train, 1], c=color, marker='o', label=db.target_names[i] +' train',\n", " cmap=plt.cm.Paired)\n", " plt.scatter(X_test[idx_test, 0], X_test[idx_test, 1], c=color, marker='v', label=db.target_names[i] +' test',\n", " cmap=plt.cm.Paired)\n", "\n", "plt.xlim(x_min, x_max)\n", "plt.ylim(y_min, y_max)\n", "plt.legend(loc='upper right')\n", "plt.xlabel(db.feature_names[features[0]])\n", "plt.ylabel(db.feature_names[features[1]])\n", "plt.title('Decision Boundary')\n", "plt.savefig('decisionBoundary.png')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Se puede predecir a qué clase pertenece un punto y con qué probabilidad (según el modelo aprendido)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(clf.predict([[2,3.5]]))\n", "print(clf.predict_proba([[2,3.5]]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 5- Evaluar el modelo aprendido" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Se predice utilizando el conjunto de test y el de entrenamiento" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "y_pred = clf.predict(X_test)\n", "y_pred_train = clf.predict(X_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Se calculan las matrices de confusión (test y entrenamiento)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# from sklearn.metrics import confusion_matrix\n", "\n", "cnf_matrix = confusion_matrix(y_test, y_pred)\n", "cnf_matrix_train = confusion_matrix(y_train, y_pred_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Se muestran las matrices de confusión utilizando las función auxiliar definida más arriba" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Plot non-normalized confusion matrix\n", "plt.figure(figsize = (15,5))\n", "plt.subplot(1,2,1)\n", "plot_confusion_matrix(cnf_matrix_train, classes=db.target_names,\n", " title='Confusion matrix, train')\n", "\n", "plt.subplot(1,2,2)\n", "plot_confusion_matrix(cnf_matrix, classes=db.target_names,\n", " title='Confusion matrix, test')\n", "\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Se calcula el accuracy para el conjunto de entrenamiento y test" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print('accuracy train = %f' % np.mean(y_train == y_pred_train))\n", "print('accuracy test = %f' % np.mean(y_test == y_pred))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Otra forma de hacerlo, usando el módulo metrics" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import accuracy_score\n", "print('accuracy train = %f ' % accuracy_score(y_train, y_pred_train))\n", "print('accuracy test = %f ' % accuracy_score(y_test, y_pred))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 1 }