{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# AA-UTE 2024" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Práctica 2 - Clasificadores no paramétricos e Hiperparámetros\n", "\n", "### Objetivos:\n", "En esta práctica se entra en detalle en los clasificadores de `kNN` (k-vecinos más cercanos) y `árboles de decisión`. Dichos clasificadores tienen la particularidad de tener un interpretabilidad simple y son fáciles de visualizar.\n", "\n", "También se tendrá un manejo de los `hiperparámetros` para cada modelo. A su vez, se presentan maneras de evaluar la elección de los mismos.\n", "\n", "Para varias gráficas de la práctica se utilizan funciones auxiliares cargados desde un archivo python _utils.py_, ubicado en el mismo directorio de éste notebook." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Imports" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "from sklearn.datasets import make_moons, make_classification\n", "from sklearn.model_selection import train_test_split \n", "from sklearn.metrics import accuracy_score " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# kNN" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Definición de gráfica para un clasificador kNN\n", "from sklearn.neighbors import KNeighborsClassifier\n", "from utils import plot2D_knn_results" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Cantidad de muestras a usar\n", "n_samples = 300\n", "\n", "# Generación de datos para clasificación\n", "X, y = make_classification(n_samples=n_samples, n_features=2, n_informative=2, \n", " n_redundant=0, n_classes=4, n_clusters_per_class=1, \n", " random_state=42, weights=[0.4, 0.2, 0.15, 0.5])\n", "\n", "# Dividir en conjuntos train/test\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Visualizar conjunto de datos\n", "plt.scatter(X[:, 0], X[:, 1], c=y, edgecolors='k')\n", "plt.title('Espacio de características')\n", "plt.xlabel('Feature 1')\n", "plt.ylabel('Feature 2')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Visualizar las regiones de decisión para $k=1$ y $k=5$. Responder: \n", "1. ¿En qué se diferencian las regiones? \n", "1. ¿Cuál parece ser la correcta?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Fijar valor de k\n", "k = 1\n", "\n", "# Inicializar clasificador con el valor de k\n", "clf_knn = KNeighborsClassifier(n_neighbors=k) # COMPLETAR\n", "clf_knn.fit(X_train, y_train)\n", "\n", "plot2D_knn_results(clf_knn, X_train, y_train, ['Feature 1','Feature 2'], ['','','',''],test_point=None)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ver para $k=\\{1,5\\}$ qué vecinos deciden la clasificación en un dato del conjunto de test." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Fijar valor de k\n", "k = 1\n", "\n", "# Fijar índice de dato test\n", "index = 0 # decomentar para usar índice fijo (ver p. ej: 0, 22, 43, 51)\n", "# index = np.random.randint(0,len(y_test)) # descomentar para usar indice aleatorio \n", "\n", "clf_knn = KNeighborsClassifier(n_neighbors=k) # COMPLETAR\n", "clf_knn.fit(X_train, y_train)\n", "\n", "dato_test = X_test[index]\n", "\n", "print('Predicción clase de dato test:', clf_knn.predict([X_test[index]]))\n", "\n", "plot2D_knn_results(clf_knn, X_train, y_train, ['Feature 1','Feature 2'], ['0','1','2','3'], test_point=dato_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Realizar una **curva $k$ _vs_ $accuracy$**, para los conjuntos de entrenamiento y de test con los valores de $k=\\{1,2,...,29,30\\}$. ¿Cuál es el valor del _**hiper-parámetro**_ k que generaliza mejor?\n", "\n", "Usar el siguiente dataset sintético." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Generar datos sintéticos\n", "X, y = make_moons(n_samples=300, noise=0.4, random_state=42)\n", "\n", "# Visualizar conjunto de datos\n", "plt.scatter(X[:, 0], X[:, 1], c=y, edgecolors='k')\n", "plt.title('Espacio de características')\n", "plt.xlabel('Feature 1')\n", "plt.ylabel('Feature 2')\n", "plt.show()\n", "\n", "# Split en train y test\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=30)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Completar usando los valores $k=\\{1,2,...,29,30\\}$ y guardando el _accuracy_ en train y test para cada valor de $k$" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Asignar k= 1,2,...,30 para entrenar\n", "ks = np.arange(1,31) # COMPLETAR\n", "\n", "# Inicializar lista para guardar valores de accuracy\n", "ks_train_acc = np.zeros(len(ks))\n", "ks_test_acc = np.zeros(len(ks))\n", "\n", "# Indice para subplot\n", "indx = 0\n", "# Inicializar figura\n", "plt.figure(figsize=(15,10))\n", "\n", "# Iterar por cada valor de k\n", "for indice, k in enumerate(ks):\n", " \n", " # Entrenar clasificador con los k vecinos \n", " knn = KNeighborsClassifier(n_neighbors=k) # COMPLETAR\n", " knn.fit(X_train, y_train)\n", "\n", " # Evaluar accuracy en train y guardar\n", " train_accuracy = accuracy_score(y_train, knn.predict(X_train)) # COMPLETAR\n", " ks_train_acc[indice] = train_accuracy # COMPLETAR\n", "\n", " # Evaluar accuracy en test y guardar\n", " test_accuracy = accuracy_score(y_test, knn.predict(X_test)) # COMPLETAR\n", " ks_test_acc[indice] = test_accuracy # COMPLETAR\n", "\n", " # Graficar regiones de decisión\n", " if k in [1,2,5,10,20,30]:\n", " indx += 1\n", " plt.subplot(2,3,indx)\n", " plt.title(f\"KNN (k={k}) - Train Acc: {train_accuracy:.2f}, Test Acc: {test_accuracy:.2f}\")\n", " xx, yy = np.meshgrid(np.linspace(X[:,0].min(), X[:,0].max(), 150), np.linspace(X[:,1].min(), X[:,1].max(), 150))\n", " Z = knn.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)\n", " plt.contourf(xx, yy, Z, alpha=0.3)\n", " plt.scatter(X_train[:, 0], X_train[:, 1], s=50, marker='*', c=y_train,edgecolors='k', label='Puntos train')\n", " plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test, edgecolors='k', label='Puntos test')\n", " plt.legend()\n", "plt.tight_layout()\n", "plt.show()\n", "\n", "# Graficar curva k vs accuracy\n", "plt.figure(figsize=(8,3))\n", "plt.plot(ks,ks_train_acc, label='Accuracy train') \n", "plt.plot(ks,ks_test_acc, label='Accuracy test') \n", "plt.ylabel('Accuracy')\n", "plt.xlabel('k (cantidad de vecinos)')\n", "plt.legend()\n", "plt.grid()\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**EXPERIMENTAR:**\n", "\n", "Investigar el parámetro $weights$ del clasificador. Cambiar el valor por defecto a _\"distance\"_ y observar si cambia la curva de accuracy. \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Árboles de decisión" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.tree import DecisionTreeClassifier\n", "from utils import display_tree\n", "import os\n", "IMAGES_PATH = './P2_outputs'\n", "# Crear carpeta con imágenes a guardar\n", "if not os.path.exists(IMAGES_PATH):\n", " os.mkdir(IMAGES_PATH)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Generar datos para clasificar" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "n_samples = 1000\n", "# Generación de datos para clasificación\n", "X, y = make_classification(n_samples=n_samples, n_features=2, n_informative=2, \n", " n_redundant=0, n_classes=4, n_clusters_per_class=1, \n", " random_state=42, weights=[0.4, 0.2, 0.15, 0.5])\n", "\n", "# Visualizar conjunto de datos\n", "plt.scatter(X[:, 0], X[:, 1], c=y, edgecolors='k')\n", "plt.title('Espacio de características')\n", "plt.xlabel('Feature 1')\n", "plt.ylabel('Feature 2')\n", "plt.show()\n", "\n", "# Separar train y test\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=41)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Entrenar un árbol de decisión sin restricciones" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tree = DecisionTreeClassifier(random_state=42)\n", "tree.fit(X_train,y_train)\n", "\n", "score_train = tree.score(X_train, y_train)\n", "print('Aciertos train:',score_train)\n", "score_test = tree.score(X_test, y_test)\n", "print('Aciertos test: ',score_test)\n", "\n", "# Desplegar forma del árbol\n", "display_tree(tree, IMAGES_PATH, 'arbol.dot')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**PREGUNTAS:**\n", "\n", "1. ¿Se da una situación de subajuste o sobreajuste? ¿Por qué?\n", "1. ¿Qué profundidad tiene el árbol? (verificar con el método `get_depth()`)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Entrenar un árbol de decisión restringiendo la profundidad a 5 nodos\n", "\n", "_Nota: investigar el hiperparámetro `max_depth` en [la documentación](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)_" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tree_max_depth = DecisionTreeClassifier(max_depth=5) # COMPLETAR\n", "tree_max_depth.fit(X_train,y_train)\n", "\n", "score_train = tree_max_depth.score(X_train, y_train)\n", "print('Aciertos train:',score_train)\n", "score_test = tree_max_depth.score(X_test, y_test)\n", "print('Aciertos test: ',score_test)\n", "\n", "display_tree(tree_max_depth, IMAGES_PATH, 'arbol_max_depth.dot')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Entrenar un árbol de decisión restringiendo la cantidad de muestras por hoja. Que cada hoja tenga como mínimo 7 muestras." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tree_min_samp = DecisionTreeClassifier(min_samples_leaf=7) # COMPLETAR\n", "tree_min_samp.fit(X_train,y_train)\n", "\n", "score_train = tree_min_samp.score(X_train, y_train)\n", "print('Aciertos train:',score_train)\n", "score_test = tree.score(X_test, y_test)\n", "print('Aciertos test: ',score_test)\n", "\n", "display_tree(tree_min_samp, IMAGES_PATH, 'arbol_min_samples_leaf.dot')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**RESPONDER:**\n", "\n", "¿Cuál modelo tiene mejor desempeño? ¿Esto quiere decir que es mejor? \n", "\n", "Discutir si con estas evaluaciones es suficiente para decidir cuál es mejor.\n", "\n", "_Respuesta:_" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Completar la siguiente celda que utiliza _**valdación cruzada**_ sobre cada una de las opciones anteriores" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import cross_val_score\n", "from utils import plot_classifier_regions\n", "\n", "# Completar función para desplegar scores de validación cruzada\n", "def display_cv_scores(clf, X, y, cv_num=10, scoring=\"accuracy\", extra_string=None):\n", " \"\"\" \n", " Función que entrena un clasificador por validación cruzada e imprime\n", " la media y varianza en los conjuntos de validación.\n", "\n", " Entradas:\n", " - clf: clasificador de sklearn\n", " - X: conjunto de datos\n", " - y: etiquetas/valores ground-truth\n", " - cv_num (int) : cantidad de veces para hacer validación cruzada\n", " - scoring (str): string que indica qué score utilizar\n", " - extra_string (str): (opcional) string para desplegar que sirva para identificar el experimento hecho\n", " \"\"\"\n", " \n", " scores = cross_val_score(clf, X, y, scoring=scoring, cv=cv_num) # COMPLETAR scores obtenidos con cross_val_score\n", " \n", " cv_mean = scores.mean() # COMPLERTAR media de scores\n", " cv_std = scores.std() # COMPLERTAR desviación estándar de scores\n", "\n", " print(f\"CV Mean {extra_string}: {cv_mean:.3f}\")\n", " print(f\"CV Standard deviation {extra_string}: {cv_std:.3f}\")\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cv_num = 10 # número de divisiones para validación cruzada\n", "\n", "# Arbol sin restricciones\n", "display_cv_scores(tree, X_train, y_train, cv_num=cv_num, extra_string='sin restricciones')\n", "plot_classifier_regions(tree, \n", " X_train, y_train,\n", " X_test, y_test\n", " )\n", "print('----------------------------------------')\n", "\n", "# Arbol con máxima profundidad\n", "display_cv_scores(tree_max_depth, X_train, y_train, cv_num=cv_num, extra_string='restringiendo profundidad')\n", "plot_classifier_regions(tree_max_depth, \n", " X_train, y_train,\n", " X_test, y_test\n", " )\n", "print('----------------------------------------')\n", "\n", "# Arbol con mínima cantidad de muestras por hoja\n", "display_cv_scores(tree_min_samp, X_train, y_train, cv_num=cv_num, extra_string='restringiendo muestras por hoja')\n", "plot_classifier_regions(tree_min_samp, \n", " X_train, y_train,\n", " X_test, y_test\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**EXPERIMENTAR:**\n", "\n", "Realizar el análisis de validación cruzada para otro hiperparámetro de árbol de decisión (ver sección _parameters_ de [la documentación](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)). \n", "\n", "Discutir según los valores del sesgo y varianza. Desplegar la forma del árbol y graficar las regiones de decisión para éste último entrenamiento." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "aaute", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13" }, "orig_nbformat": 4 }, "nbformat": 4, "nbformat_minor": 2 }