{ "cells": [ { "cell_type": "markdown", "id": "2e5de355-6204-4229-b741-731d6e43938f", "metadata": { "id": "2e5de355-6204-4229-b741-731d6e43938f" }, "source": [ "# Práctico 3 - Sesgo & Varianza" ] }, { "cell_type": "markdown", "id": "3993bd49-18ac-411a-a57b-c9af7e0bc77d", "metadata": { "id": "3993bd49-18ac-411a-a57b-c9af7e0bc77d" }, "source": [ "## Imports" ] }, { "cell_type": "code", "execution_count": null, "id": "6632abaa-18c4-417b-a987-386d7fa0e2a2", "metadata": { "id": "6632abaa-18c4-417b-a987-386d7fa0e2a2" }, "outputs": [], "source": [ "import numpy as np\n", "np.set_printoptions(formatter={'float': lambda x: \"{0:0.5f}\".format(x)})\n", "\n", "# Visualizations\n", "import matplotlib.pyplot as plt\n", "\n", "# Machine learning\n", "import sklearn\n", "\n", "# Regression\n", "from sklearn.linear_model import LinearRegression\n", "\n", "# Preprocessing\n", "from sklearn.preprocessing import PolynomialFeatures\n", "from sklearn.preprocessing import StandardScaler\n", "\n", "# Metrics\n", "from sklearn.metrics import mean_squared_error\n", "\n", "# Data analysis and manipulation\n", "import pandas as pd\n", "pd.set_option('display.precision', 5) # 5 decimales" ] }, { "cell_type": "markdown", "id": "e929773c-0cd2-438c-9f1a-67b0ae09e329", "metadata": { "id": "e929773c-0cd2-438c-9f1a-67b0ae09e329" }, "source": [ "## MSE verdadero vs. MSE en test" ] }, { "cell_type": "markdown", "id": "38a0e65c", "metadata": { "id": "38a0e65c" }, "source": [ "### Función de regresión verdadera" ] }, { "cell_type": "markdown", "id": "e7e99f7a-b39e-4803-8c81-6b9dab04afbc", "metadata": { "id": "e7e99f7a-b39e-4803-8c81-6b9dab04afbc" }, "source": [ "Vamos a trabajar con el ejemplo del rendimiento medio del cultivo de papas en función de la lluvia.\n", "\n", "Lo haremos con los datos de lluvia normalizados a $[0,1]$.\n", "\n", "Supongamos que la verdadera función de regresión (rendimiento medio) es $f:[0,1]\\to\\mathbb{R}$ dada por:\n", "\n", "$$f(x) = -40.425 x^2 + 58.450 x +9.175$$\n", "\n", "Por lo tanto el rendimiento individual viene dado por el modelo\n", "\n", "$$y = f(x) + \\epsilon$$\n", "\n", "donde $\\epsilon$ indica ruido que suponemos normalmente distribuido, con media = 0 y varianza = 2.\n", "\n", "Vamos a suponer que $x$ está distribuida uniformemente en $[0,1]$.\n", "\n", "Estos supuestos son desconocidos en la práctica real, pero nos serviran para evaluar las técnicas de selección de modelos." ] }, { "cell_type": "code", "execution_count": null, "id": "1674e3ba-0502-4d06-8137-b942aaebe2ee", "metadata": { "id": "1674e3ba-0502-4d06-8137-b942aaebe2ee" }, "outputs": [], "source": [ "# Función verdadera\n", "A = -40.425\n", "B = 58.45\n", "C = 9.175\n", "\n", "def f(x):\n", " return A*(x**2) + B*x + C" ] }, { "cell_type": "code", "execution_count": null, "id": "fdcc0c10-5448-4dc8-a249-73c97a3c8523", "metadata": { "id": "fdcc0c10-5448-4dc8-a249-73c97a3c8523" }, "outputs": [], "source": [ "# Error irreducible\n", "sigma_epsilon = np.sqrt(2)" ] }, { "cell_type": "markdown", "id": "29c1a4c1-62ce-4fd1-b835-049d452f846d", "metadata": { "id": "29c1a4c1-62ce-4fd1-b835-049d452f846d" }, "source": [ "### Datos originales" ] }, { "cell_type": "markdown", "id": "d49407b4-34a5-4ebf-bc4e-f15c5b84a489", "metadata": { "id": "d49407b4-34a5-4ebf-bc4e-f15c5b84a489" }, "source": [ "Suponemos también que tenemos los siguientes datos:" ] }, { "cell_type": "code", "source": [ "#Si se usa Google Colab descomentar las siguientes líneas y elegir los archivos Papas.csv y Test.csv\n", "from google.colab import files\n", "uploaded = files.upload()" ], "metadata": { "id": "_nYHOnidT-JP" }, "id": "_nYHOnidT-JP", "execution_count": null, "outputs": [] }, { "cell_type": "code", "execution_count": null, "id": "b848f5e3-cab2-4187-9216-b5026b4d1db1", "metadata": { "id": "b848f5e3-cab2-4187-9216-b5026b4d1db1" }, "outputs": [], "source": [ "# Nuestro dataset de base S\n", "papas = pd.read_csv('/content/Papas.csv')\n", "papas" ] }, { "cell_type": "code", "execution_count": null, "id": "5e3dfbd3-0a5d-4c5e-9edf-04c3b392eb26", "metadata": { "id": "5e3dfbd3-0a5d-4c5e-9edf-04c3b392eb26" }, "outputs": [], "source": [ "# Cambiamos las unidades de Lluvia para disminuir errores numéricos\n", "papas['Lluvia'] = (papas['Lluvia']-50)/350" ] }, { "cell_type": "code", "execution_count": null, "id": "c0d62345-46e6-416e-be5c-7363b36976f9", "metadata": { "id": "c0d62345-46e6-416e-be5c-7363b36976f9" }, "outputs": [], "source": [ "plt.figure(figsize=(8, 4))\n", "x_range = np.linspace(0, 1, 1000)\n", "plt.plot(x_range, f(x_range), 'r', linewidth=3.0)\n", "\n", "plt.scatter(x = 'Lluvia',\n", " y = 'Rendimiento',\n", " data=papas\n", " )\n", "plt.xlabel('Lluvia')\n", "plt.ylabel('Rendimiento')\n", "plt.title(r'Dataset $S$ y función verdadera $f(x)$')\n", "plt.legend([r'$f(x)$', r'$S$'])\n", "plt.grid(True)\n", "\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "id": "fabeff1d-4bd9-4615-9650-71f3e38e2483", "metadata": { "id": "fabeff1d-4bd9-4615-9650-71f3e38e2483" }, "outputs": [], "source": [ "# Pasamos a numpy para trabajar con sklearn\n", "X = np.array(papas['Lluvia']).reshape(-1, 1)\n", "y = np.array(papas['Rendimiento']).reshape(-1, 1)" ] }, { "cell_type": "markdown", "id": "bfbe79a8-9664-4b78-bd40-3ce25202ecf4", "metadata": { "id": "bfbe79a8-9664-4b78-bd40-3ce25202ecf4" }, "source": [ "### Datos de test" ] }, { "cell_type": "markdown", "id": "4128bd13-6ea2-46f3-bf85-2697f18d1a7a", "metadata": { "id": "4128bd13-6ea2-46f3-bf85-2697f18d1a7a" }, "source": [ "Suponemos también que disponemos de los siguientes datos de test:" ] }, { "cell_type": "code", "execution_count": null, "id": "042b85c4-89bc-4a0a-b0d6-eb09ba40e753", "metadata": { "id": "042b85c4-89bc-4a0a-b0d6-eb09ba40e753" }, "outputs": [], "source": [ "# Nuestro dataset de test Stest\n", "test = pd.read_csv('Test.csv')\n", "test" ] }, { "cell_type": "code", "execution_count": null, "id": "4c75a800-80b8-4f67-8091-f31a6b1d6dab", "metadata": { "id": "4c75a800-80b8-4f67-8091-f31a6b1d6dab" }, "outputs": [], "source": [ "# Cambiamos las unidades de Lluvia para disminuir errores numéricos\n", "test['Lluvia'] = (test['Lluvia']-50)/350\n", "test" ] }, { "cell_type": "code", "execution_count": null, "id": "b1e6e083-6ced-4d65-b525-0451b18eeced", "metadata": { "id": "b1e6e083-6ced-4d65-b525-0451b18eeced" }, "outputs": [], "source": [ "plt.figure(figsize=(8, 4))\n", "x_range = np.linspace(0, 1, 1000)\n", "plt.plot(x_range, f(x_range), 'r', linewidth=3.0)\n", "\n", "plt.scatter(x = 'Lluvia',\n", " y = 'Rendimiento',\n", " data=papas\n", " )\n", "plt.scatter(x = 'Lluvia',\n", " y = 'Rendimiento',\n", " data=test\n", " )\n", "plt.xlabel('Lluvia')\n", "plt.ylabel('Rendimiento')\n", "plt.title(r'Datasets $S$, $S_{test}$ y función verdadera $f(x)$')\n", "plt.legend([r'$f(x)$', r'$S$',r'$S_{test}$'])\n", "plt.grid(True)\n", "\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "id": "f20c330c-9e39-4895-82e7-1f5be2da875c", "metadata": { "id": "f20c330c-9e39-4895-82e7-1f5be2da875c" }, "outputs": [], "source": [ "# Pasamos a numpy para trabajar con sklearn\n", "X_test = np.array(test['Lluvia']).reshape(-1, 1)\n", "y_test = np.array(test['Rendimiento']).reshape(-1, 1)" ] }, { "cell_type": "markdown", "id": "fc32581c-ac76-499e-abc4-446bc29f86c4", "metadata": { "id": "fc32581c-ac76-499e-abc4-446bc29f86c4" }, "source": [ "### Regresión lineal simple" ] }, { "cell_type": "markdown", "id": "4a7742b8-3234-4b31-89f7-d480409b39c7", "metadata": { "id": "4a7742b8-3234-4b31-89f7-d480409b39c7" }, "source": [ "#### Entrenamiento" ] }, { "cell_type": "code", "execution_count": null, "id": "6a1306da-956d-4f81-bccb-8ae86d38eaa8", "metadata": { "id": "6a1306da-956d-4f81-bccb-8ae86d38eaa8" }, "outputs": [], "source": [ "lin_reg = LinearRegression()\n", "lin_reg.fit(X, y)" ] }, { "cell_type": "code", "execution_count": null, "id": "28cdc861-e0fe-41c4-b29e-829fbac4ae4d", "metadata": { "id": "28cdc861-e0fe-41c4-b29e-829fbac4ae4d" }, "outputs": [], "source": [ "y_hat_S = lin_reg.predict(X)" ] }, { "cell_type": "code", "execution_count": null, "id": "297bd225-98ab-42cd-9695-c857d1f49bee", "metadata": { "id": "297bd225-98ab-42cd-9695-c857d1f49bee" }, "outputs": [], "source": [ "#RMSE\n", "RootMSE_S = np.sqrt(mean_squared_error(y,y_hat_S))\n", "RootMSE_S" ] }, { "cell_type": "markdown", "id": "bc23e316-ee17-4d8f-a72d-662d53875ef6", "metadata": { "id": "bc23e316-ee17-4d8f-a72d-662d53875ef6" }, "source": [ "#### Gráfico del modelo obtenido" ] }, { "cell_type": "code", "execution_count": null, "id": "12b3ed56-a5cc-4f53-ae4a-5b517f8536dc", "metadata": { "id": "12b3ed56-a5cc-4f53-ae4a-5b517f8536dc" }, "outputs": [], "source": [ "plt.figure(figsize=(8, 4))\n", "\n", "x_range = np.linspace(0, 1, 1000)\n", "X_range = x_range.reshape(-1, 1)\n", "y_hat_S = lin_reg.predict(X_range)\n", "\n", "plt.plot(x_range, f(x_range), 'r', linewidth=3.0)\n", "plt.plot(X_range, y_hat_S, 'tab:blue' , linewidth=3.0)\n", "\n", "plt.scatter(x = 'Lluvia',\n", " y = 'Rendimiento',\n", " data=papas\n", " )\n", "plt.scatter(x = 'Lluvia',\n", " y = 'Rendimiento',\n", " data=test\n", " )\n", "plt.xlabel('Lluvia')\n", "plt.ylabel('Rendimiento')\n", "plt.title(r'Datasets $S$, $S_{test}$, verdadera $f(x)$ y modelo $\\widehat{y}_S=f_{S}(x)$')\n", "plt.legend([r'$f(x)$', r'$f_S(x)$', r'$S$',r'$S_{test}$'])\n", "plt.grid(True)\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "053b310b-37cf-4fe7-bc80-5314cbcd8adb", "metadata": { "id": "053b310b-37cf-4fe7-bc80-5314cbcd8adb" }, "source": [ "#### MSE verdadera" ] }, { "cell_type": "markdown", "id": "30b97533-41d7-4bdd-b11d-bdc87df9cb50", "metadata": { "id": "30b97533-41d7-4bdd-b11d-bdc87df9cb50" }, "source": [ "Como conocemos la distribución $D$ de los datos podemos calcular el error verdadero\n", "\n", "$$\\text{MSE}(f_S)=\\mathbb{E}_{(x,y)\\sim D}\\Big[(y-f_S(x))^2\\Big]$$\n", "\n", "Esto es imposible en la práctica real." ] }, { "cell_type": "code", "execution_count": null, "id": "20b36665-858a-41a3-8a9e-198bf61b64ca", "metadata": { "id": "20b36665-858a-41a3-8a9e-198bf61b64ca" }, "outputs": [], "source": [ "# Coeficientes de f_S\n", "w0 = lin_reg.intercept_[0]\n", "w1 = lin_reg.coef_[0][0]\n", "print(w0,w1)" ] }, { "cell_type": "markdown", "id": "02611408-1ed9-474c-9ad8-c9c9981e58d8", "metadata": { "id": "02611408-1ed9-474c-9ad8-c9c9981e58d8" }, "source": [ "Denotamos $f(x)=Ax^2+Bx + C$ la función de regresión verdadera.\n", "\n", "El error verdadero del modelo $f_S(x)=w_0+w_1 x$ es\n", "\n", "$$\n", "\\begin{aligned}\n", "\\text{MSE}(f_S)\n", "& = \\mathbb{E}_{(x,y)\\sim D}\\Big[(y-f_S(x))^2\\Big]\\\\\n", "& = \\mathbb{E}_{(x,\\epsilon)}\\Big[(f(x)+\\epsilon-f_S(x))^2\\Big]\\\\\n", "& = \\mathbb{E}_{(x,\\epsilon)}\\Big[(Ax^2 + Bx + C +\\epsilon-w_0-w_1 x)^2\\Big]\\\\\n", "\\end{aligned}\n", "$$\n", "\n", "Juntando los coeficientes en las potencias de $x$ y desarrollando el cuadrado llegamos a la expresión\n", "\n", "$$\n", "\\begin{aligned}\n", "\\text{MSE}(f_S)\n", "& =\n", "\\sigma^2+A^2\\mathbb{E}[x^4]\\\\\n", "& + 2 A (B-w_1) \\mathbb{E}[x^3]\\\\\n", "& + ((B-w_1)^2+2 A (C-w_0))\\mathbb{E}[x^2]\\\\\n", "& + 2 (B-w_1)(C-w_0)\\mathbb{E}[x]\\\\\n", "& + (C-w_0)^2\n", "\\end{aligned}\n", "$$\n", "\n", "en donde aparecen los momentos de la distribución uniforme en $[0,1]$." ] }, { "cell_type": "markdown", "id": "cca76d4d", "metadata": { "id": "cca76d4d" }, "source": [ "Los momentos de la distribución uniforme en $[a,b]$ están dados por:\n", "\n", "$$\\mathbb{E}[x^n]=\\frac{b^{n+1}-a^{n+1}}{(n+1)(b-a)}$$\n", "\n", "por lo que para la uniforme en $[0,1]$ tenemos $\\mathbb{E}[x^n]=1/(n+1)$." ] }, { "cell_type": "code", "execution_count": null, "id": "0e55db84", "metadata": { "id": "0e55db84" }, "outputs": [], "source": [ "# Momentos de la uniforme\n", "exp_x = 1/2\n", "exp_x2 = 1/3\n", "exp_x3 = 1/4\n", "exp_x4 = 1/5" ] }, { "cell_type": "code", "execution_count": null, "id": "343f89b4-5534-4787-acce-7dd4a4e43819", "metadata": { "id": "343f89b4-5534-4787-acce-7dd4a4e43819" }, "outputs": [], "source": [ "e0 = (C-w0)**2\n", "e1 = 2*(B-w1)*(C-w0)*exp_x\n", "e2 = ((B-w1)**2+2*A*(C-w0))*exp_x2\n", "e3 = 2*A*(B-w1)*exp_x3\n", "e4 = (A**2)*exp_x4\n", "\n", "MSE = sigma_epsilon**2 + e0 + e1 + e2 + e3 + e4" ] }, { "cell_type": "code", "execution_count": null, "id": "5d05f760-3182-4e1c-bda9-70647d30295e", "metadata": { "id": "5d05f760-3182-4e1c-bda9-70647d30295e" }, "outputs": [], "source": [ "RootMSE = np.sqrt(MSE)\n", "RootMSE" ] }, { "cell_type": "markdown", "id": "2d210ba1-6f93-4ee7-83bd-3af7b9efc5b1", "metadata": { "id": "2d210ba1-6f93-4ee7-83bd-3af7b9efc5b1" }, "source": [ "#### Error en test" ] }, { "cell_type": "markdown", "id": "c8332624-dc97-4646-8d00-2a2dad62bd78", "metadata": { "id": "c8332624-dc97-4646-8d00-2a2dad62bd78" }, "source": [ "Como en la práctica no conocemos la distribución $D$, estimamos el error verdadero\n", "\n", "$$\\text{MSE}_{S_{test}}(f_S)=\\frac{1}{|S_{test}|}\\sum_{(x,y)\\in S_{test}}(y-f_S(x))^2$$\n", "\n", "Aquí $S_{test}$ es una muestra de $D$ independiente de $S$." ] }, { "cell_type": "code", "execution_count": null, "id": "afab162a-3355-4571-82ec-2e955d1720ad", "metadata": { "id": "afab162a-3355-4571-82ec-2e955d1720ad" }, "outputs": [], "source": [ "# Predicción en test\n", "y_hat_Stest = lin_reg.predict(X_test)" ] }, { "cell_type": "code", "execution_count": null, "id": "6456338c-be99-40e3-bf46-22b733b588de", "metadata": { "id": "6456338c-be99-40e3-bf46-22b733b588de" }, "outputs": [], "source": [ "# Error en test\n", "RootMSE_Stest = np.sqrt(mean_squared_error(y_test,y_hat_Stest))\n", "RootMSE_Stest" ] }, { "cell_type": "markdown", "id": "8fd20aed-9a61-4771-85f4-384b6e8b7a8a", "metadata": { "id": "8fd20aed-9a61-4771-85f4-384b6e8b7a8a" }, "source": [ "### Regresión lineal con polinomios" ] }, { "cell_type": "markdown", "id": "ec34163d-7f82-4476-8b71-12fc050ca4bc", "metadata": { "id": "ec34163d-7f82-4476-8b71-12fc050ca4bc" }, "source": [ "#### Entrenamiento" ] }, { "cell_type": "code", "execution_count": null, "id": "a4cd6e4d-f017-4fa2-97f1-1b8a4c834682", "metadata": { "id": "a4cd6e4d-f017-4fa2-97f1-1b8a4c834682" }, "outputs": [], "source": [ "# Entrenamos regresiones polinomiales para varios grados\n", "# Guardamos la info necesaria para predecir en test posteriormente\n", "modelos = []\n", "scalers = []\n", "polys = []\n", "M = []\n", "STD = []\n", "norm_intercepts = []\n", "norm_coefs = []\n", "\n", "# Grado máximo de los polinomios\n", "grado_max = 6\n", "\n", "for grado in range(1,grado_max+1):\n", " poly = PolynomialFeatures(degree=grado, include_bias=False)\n", " poly.fit(X)\n", " polys.append(poly)\n", " X_poly = poly.transform(X)\n", " m = np.mean(X_poly,axis=0)\n", " std= np.std(X_poly, axis=0)\n", "\n", " M.append(m)\n", " STD.append(std)\n", "\n", " scaler = StandardScaler()\n", " X_norm = scaler.fit_transform(X_poly)\n", " scalers.append(scaler)\n", "\n", " poly_reg = LinearRegression()\n", " poly_reg.fit(X_norm, y)\n", " modelos.append(poly_reg)\n", "\n", " norm_intercepts.append(poly_reg.intercept_[0])\n", " norm_coefs.append(poly_reg.coef_[0])" ] }, { "cell_type": "code", "execution_count": null, "id": "78dda6b3-29ab-452a-a37c-c839f91984b7", "metadata": { "id": "78dda6b3-29ab-452a-a37c-c839f91984b7" }, "outputs": [], "source": [ "# Desnormalizamos los coeficientes\n", "intercepts = []\n", "coefs = []\n", "\n", "for k in range(len(norm_coefs)):\n", " beta0 = norm_intercepts[k]\n", " beta = norm_coefs[k]\n", " m = M[k]\n", " std = STD[k]\n", " intercepts.append(beta0 - np.sum(m*beta/std))\n", " coefs.append(beta/std)" ] }, { "cell_type": "markdown", "id": "049640f0-4c73-4e37-a5ee-d20a0f3898c8", "metadata": { "id": "049640f0-4c73-4e37-a5ee-d20a0f3898c8" }, "source": [ "#### Error en train, en test y verdadero" ] }, { "cell_type": "code", "execution_count": null, "id": "6469c2c5-3de4-41d6-a56f-57077b6e1c50", "metadata": { "id": "6469c2c5-3de4-41d6-a56f-57077b6e1c50" }, "outputs": [], "source": [ "# Calculamos los MSE en train, test y verdadero con el objetivo de graficarlos\n", "MSEs = []\n", "MSEs_S = []\n", "MSEs_Stest = []\n", "momentos = [1/(n+1) for n in range(2*max(grado_max,2)+1)]\n", "\n", "for k in range(len(modelos)):\n", " # Coeficientes\n", " w0 = np.array([intercepts[k]])\n", " w1 = coefs[k]\n", " w = np.concatenate((w0,w1))\n", " if len(w) == 2:\n", " polinomio = np.concatenate((w,np.zeros(1)))-np.array([C,B,A])\n", " elif len(w) == 3:\n", " polinomio = w-np.array([C,B,A])\n", " else:\n", " polinomio = w-np.concatenate((np.array([C,B,A]),np.zeros(len(w)-3)))\n", "\n", " # MSE verdadero\n", " terminos = []\n", " for i in range(len(polinomio)):\n", " for j in range(len(polinomio)):\n", " terminos.append(polinomio[i]*polinomio[j]*momentos[i+j])\n", " MSEs.append(sigma_epsilon**2+sum(terminos))\n", "\n", " # Modelo y transformaciones\n", " f_S = modelos[k]\n", " poly = polys[k]\n", " scaler = scalers[k]\n", "\n", " # MSE en train\n", " X_poly = poly.transform(X)\n", " X_norm = scaler.transform(X_poly)\n", " y_hat_S = f_S.predict(X_norm)\n", " MSEs_S.append(mean_squared_error(y,y_hat_S))\n", "\n", " # MSE en test\n", " X_test_poly = poly.transform(X_test)\n", " X_test_norm = scaler.transform(X_test_poly)\n", " y_hat_Stest = f_S.predict(X_test_norm)\n", " MSEs_Stest.append(mean_squared_error(y_test,y_hat_Stest))" ] }, { "cell_type": "code", "execution_count": null, "id": "693b4d1b-4b99-4665-b2c7-76f52c6e31a2", "metadata": { "id": "693b4d1b-4b99-4665-b2c7-76f52c6e31a2", "outputId": "f81eb896-1db3-4eb5-85b5-9d99f1eddf47", "colab": { "base_uri": "https://localhost:8080/", "height": 413 } }, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "\n" }, "metadata": {} } ], "source": [ "plt.figure(figsize=(8, 4))\n", "plt.plot(range(1,grado_max+1),np.sqrt(MSEs), '-o')\n", "plt.plot(range(1,grado_max+1),np.sqrt(MSEs_S), '-o')\n", "plt.plot(range(1,grado_max+1),np.sqrt(MSEs_Stest), '-o')\n", "plt.xlabel('Grado')\n", "plt.ylabel('Error')\n", "plt.title(r'Errores de $\\widehat{y}_S=f_{S}(x)$ en función del grado')\n", "plt.legend([r'Verdadero: Root $MSE(f_S)$',r'Train: Root $MSE_S(f_S)$',r'Test: Root $MSE_{S_{test}}(f_S)$'])\n", "plt.grid(True)\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "b19ed99d", "metadata": { "id": "b19ed99d" }, "source": [ "Notar que $MSE$ es bien aproximada por $MSE_{S_{test}}$." ] }, { "cell_type": "markdown", "id": "8a826054-7639-4050-b7aa-95ed9b326350", "metadata": { "id": "8a826054-7639-4050-b7aa-95ed9b326350" }, "source": [ "## Visualización del sesgo y la varianza" ] }, { "cell_type": "code", "execution_count": null, "id": "1a227d91-3619-478d-aa09-ab8108119f64", "metadata": { "id": "1a227d91-3619-478d-aa09-ab8108119f64" }, "outputs": [], "source": [ "# Definimos nuestro modelo polinomial usando Numpy porque es más rápido\n", "def h(x, w):\n", " d = len(w) - 1\n", " return np.sum(w * np.power(x, np.expand_dims(np.arange(d, -1, -1), 1)).T, 1)" ] }, { "cell_type": "code", "execution_count": null, "id": "3a342bef-52f5-4aff-9f0e-3411a429f83c", "metadata": { "id": "3a342bef-52f5-4aff-9f0e-3411a429f83c" }, "outputs": [], "source": [ "# Vamos a generar varias hipótesis polinomiales con una fracción de los datos\n", "n = 15\n", "\n", "# Definimos el rango de valores de x\n", "x_range = np.linspace(0, 1, 1000)\n", "\n", "# Definimos los grados de los polinomios\n", "d_arr = range(1,6)\n", "\n", "# Cantidad R de datasets\n", "R = 10000" ] }, { "cell_type": "code", "execution_count": null, "id": "2b1f6b4c-3d17-41da-b0aa-fbe3acde08d9", "metadata": { "id": "2b1f6b4c-3d17-41da-b0aa-fbe3acde08d9" }, "outputs": [], "source": [ "# Graficamos\n", "fig, axs = plt.subplots(2, 1, sharex=True, sharey=True, figsize=(8,8))\n", "\n", "for k in range(2):\n", " # Iteramos en los R datasets\n", " models = np.zeros((R,len(x_range)))\n", " for r in range(R):\n", " # Vector de valores de x\n", " x_train = np.random.rand(n)\n", "\n", " # Ruido\n", " epsilon = sigma_epsilon * np.random.randn(n)\n", "\n", " # Función verdadera + ruido\n", " y_train = f(x_train) + epsilon\n", "\n", " d = d_arr[k]\n", " w = np.polyfit(x_train, y_train, d)\n", " models[r,:] = h(x_range,w)\n", "\n", " # Graficamos la función de regresión verdadera\n", " axs[k].plot(x_range, f(x_range), 'r', linewidth=3.0)\n", " axs[k].plot(x_range, np.mean(models,axis=0), 'g', linewidth=3.0)\n", " axs[k].fill_between(\n", " x_range,\n", " np.mean(models,axis=0) - np.std(models,axis=0),\n", " np.mean(models,axis=0) + np.std(models,axis=0),\n", " alpha=0.2,\n", " color=\"tab:green\",\n", " lw=2,\n", " )\n", " axs[k].legend([r'$f(x)$',r'$\\mathbb{E}_S(f_S(x))$',r'$Var_{S}(f_S(x))$'])\n", "\n", " axs[k].grid(True)\n", " axs[k].title.set_text('d = {}'.format(d_arr[k]))\n", "\n", "plt.xlabel('Lluvia')\n", "plt.ylabel('Rendimiento')\n", "plt.suptitle(r'Sesgo y Varianza')\n", "plt.tight_layout()\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "id": "c6ef67b9", "metadata": { "id": "c6ef67b9" }, "outputs": [], "source": [ "# Graficamos\n", "fig, axs = plt.subplots(2, 1, sharex=True, sharey=True, figsize=(8,8))\n", "\n", "for k in range(2):\n", " # Iteramos en los R datasets\n", " models = np.zeros((R,len(x_range)))\n", " for r in range(R):\n", " # Vector de valores de x\n", " x_train = np.random.rand(n)\n", "\n", " # Ruido\n", " epsilon = sigma_epsilon * np.random.randn(n)\n", "\n", " # Función verdadera + ruido\n", " y_train = f(x_train) + epsilon\n", "\n", " d = d_arr[k+2]\n", " w = np.polyfit(x_train, y_train, d)\n", " models[r,:] = h(x_range,w)\n", "\n", " # Graficamos la función de regresión verdadera\n", " axs[k].plot(x_range, f(x_range), 'r', linewidth=3.0)\n", " axs[k].plot(x_range, np.mean(models,axis=0), 'g', linewidth=3.0)\n", " axs[k].fill_between(\n", " x_range,\n", " np.mean(models,axis=0) - np.std(models,axis=0),\n", " np.mean(models,axis=0) + np.std(models,axis=0),\n", " alpha=0.2,\n", " color=\"tab:green\",\n", " lw=2,\n", " )\n", " axs[k].legend([r'$f(x)$',r'$\\mathbb{E}_S(f_S(x))$',r'$Var_{S}(f_S(x))$'])\n", "\n", " axs[k].grid(True)\n", " axs[k].title.set_text('d = {}'.format(d_arr[k+2]))\n", "\n", "plt.xlabel('Lluvia')\n", "plt.ylabel('Rendimiento')\n", "plt.suptitle(r'Sesgo y Varianza')\n", "plt.tight_layout()\n", "plt.show()" ] } ], "metadata": { "colab": { "provenance": [], "toc_visible": true }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.15" } }, "nbformat": 4, "nbformat_minor": 5 }