{"cells":[{"cell_type":"markdown","metadata":{"id":"OOs_c2w5EeiJ"},"source":["# Práctico 2 - Regresión lineal y regularización"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"arB61Ev1EeiV"},"outputs":[],"source":["import numpy as np\n","import matplotlib.pyplot as plt"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"DpmLs7bfEeia"},"outputs":[],"source":["# Datos de entrenamiento\n","X_train = np.array([206, 188, 219, 372, 345, 231, 203, 170, 55, 91, 292, 141, 129, 170, 324]).reshape(-1, 1)\n","y_train = np.array([29, 25, 31, 25, 29, 30, 26, 23, 12, 15, 28, 24, 23, 22, 30])\n","\n","# Datos de validación\n","X_val = np.array([213, 80, 391, 250, 57, 303, 263, 157, 72, 157, 188, 216, 362, 283, 308]).reshape(-1, 1)\n","y_val = np.array([30, 16, 25, 26, 9, 28, 28, 25, 13, 23, 26, 25, 28, 33, 30])"]},{"cell_type":"markdown","metadata":{"id":"0UiWzfCpEeib"},"source":["## 1. Regresión lineal\n"]},{"cell_type":"markdown","source":["Definir y entrenar un modelo de regresión lineal"],"metadata":{"id":"EWcfBnr56gCU"}},{"cell_type":"code","source":["from sklearn.linear_model import LinearRegression"],"metadata":{"id":"UIcQwmAz1Ohw"},"execution_count":null,"outputs":[]},{"cell_type":"code","execution_count":null,"metadata":{"id":"dcy54j8_Eeid"},"outputs":[],"source":["#COMPLETAR"]},{"cell_type":"markdown","source":["Predecir en los conjuntos de train y val"],"metadata":{"id":"tYxkiknH6u1D"}},{"cell_type":"code","execution_count":null,"metadata":{"id":"2iufAOhXEeig"},"outputs":[],"source":["y_pred_train = #COMPLETAR\n","y_pred_val = #COMPLETAR"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"Woj42XUiEeih"},"outputs":[],"source":["# Gráfica de datos de entrenamiento, validación y recta de regresión\n","plt.scatter(X_train, y_train, color='blue', label='Entrenamiento')\n","plt.scatter(X_val, y_val, color='red', label='Validación')\n","plt.plot(X_train, y_pred_train, color='green', label='Regresión lineal')\n","plt.legend()\n","plt.xlabel('Lluvia (mm)')\n","plt.ylabel('Rendimiento (ton/ha)')\n","plt.title('Regresión Lineal y Datos')\n","plt.grid(True)\n","plt.show()"]},{"cell_type":"markdown","metadata":{"id":"nLbMhd6UEeij"},"source":["\n","## 2. MSE en entrenamiento, validación y CV"]},{"cell_type":"markdown","source":["Calcular MSE en los conjuntos de entrenamiento y validación, y el MSE promedio con validación cruzada 5-folds."],"metadata":{"id":"dESDmOL470f_"}},{"cell_type":"code","source":["from sklearn.metrics import mean_squared_error\n","from sklearn.model_selection import cross_val_score"],"metadata":{"id":"MTfN4AJf1JFb"},"execution_count":null,"outputs":[]},{"cell_type":"code","execution_count":null,"metadata":{"id":"8F50r_utEeik"},"outputs":[],"source":["mse_train = #COMPLETAR\n","mse_val = #COMPLETAR\n","mse_cv = #COMPLETAR\n","\n","print(f\"MSE Entrenamiento: {mse_train}\")\n","print(f\"MSE Validación: {mse_val}\")\n","print(f\"MSE CV: {mse_cv}\")"]},{"cell_type":"markdown","metadata":{"id":"e4nYgxEVEeim"},"source":["\n","## 3. Determinar el grado óptimo\n"]},{"cell_type":"markdown","source":["Para cada grado del polinomio (entre 1 y 5), entrenar un modelo de regresión lineal y calcular los MSE. En función de esto determinar el grado óptimo."],"metadata":{"id":"B-by2SrD7iO3"}},{"cell_type":"code","source":["from sklearn.preprocessing import PolynomialFeatures"],"metadata":{"id":"vSHk8bo07cQs"},"execution_count":null,"outputs":[]},{"cell_type":"code","execution_count":null,"metadata":{"id":"X-tc1qLnEein"},"outputs":[],"source":["degrees = 5\n","mse_train_vals = []\n","mse_val_vals = []\n","mse_cv_vals = []\n","\n","for degree in range(1, degrees + 1):\n"," #COMPLETAR\n","\n"," mse_train = #COMPLETAR\n"," mse_val = #COMPLETAR\n"," mse_cv = #COMPLETAR\n","\n"," mse_train_vals.append(mse_train)\n"," mse_val_vals.append(mse_val)\n"," mse_cv_vals.append(mse_cv)"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"E5ckZBFTEeio"},"outputs":[],"source":["# Gráfica de evolución de MSE con grado del polinomio\n","plt.plot(range(1, degrees + 1), mse_train_vals, label='Entrenamiento', color='blue', marker ='o', linestyle='-')\n","plt.plot(range(1, degrees + 1), mse_val_vals, label='Validación', color='red', marker ='o', linestyle='-')\n","plt.plot(range(1, degrees + 1), mse_cv_vals, label='CV', color='green', marker ='o', linestyle='-')\n","plt.xlabel('Grado del Polinomio')\n","plt.ylabel('MSE')\n","plt.grid(True)\n","plt.xticks(range(1, degrees + 1))\n","plt.title('MSE vs Grado del Polinomio')\n","plt.legend()\n","plt.show()"]},{"cell_type":"markdown","metadata":{"id":"sEnYtiHDEeip"},"source":["## Versión del punto 3 usando pipeline"]},{"cell_type":"markdown","source":["Repetir el punto anterior utilizando un Pipeline. Estandarizar los datos."],"metadata":{"id":"9dJupZob8MmG"}},{"cell_type":"code","execution_count":null,"metadata":{"id":"l0ysvT-hEeip"},"outputs":[],"source":["from sklearn.pipeline import Pipeline\n","from sklearn.preprocessing import StandardScaler"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"jZjyySecEeiq"},"outputs":[],"source":["degrees = 5\n","mse_train_vals = []\n","mse_val_vals = []\n","mse_cv_vals = []\n","\n","for degree in range(1, degrees + 1):\n"," # Crear pipeline\n","\n"," # Ajustar el modelo\n","\n"," # Predecir y calcular errores\n","\n"," mse_train = mean_squared_error(y_train, y_pred_train)\n"," mse_val = mean_squared_error(y_val, y_pred_val)\n"," mse_cv = -np.mean(cross_val_score(pipeline, X_train, y_train, cv=5, scoring='neg_mean_squared_error'))\n","\n"," mse_train_vals.append(mse_train)\n"," mse_val_vals.append(mse_val)\n"," mse_cv_vals.append(mse_cv)"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"02yEuVJtEeir"},"outputs":[],"source":["# Gráfica de evolución de MSE con grado del polinomio\n","plt.plot(range(1, degrees + 1), mse_train_vals, label='Entrenamiento', color='blue', marker ='o', linestyle='-')\n","plt.plot(range(1, degrees + 1), mse_val_vals, label='Validación', color='red', marker ='o', linestyle='-')\n","plt.plot(range(1, degrees + 1), mse_cv_vals, label='CV', color='green', marker ='o', linestyle='-')\n","plt.xlabel('Grado del Polinomio')\n","plt.ylabel('MSE')\n","plt.title('MSE vs Grado del Polinomio')\n","plt.legend()\n","plt.grid(True)\n","plt.show()\n"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"xhwBjVgIEeir"},"outputs":[],"source":["# Gráfica de los modelos para cada grado del polinomio junto con los datos originales\n","plt.scatter(X_train, y_train, color='blue', s=10, label='Datos de entrenamiento')\n","plt.scatter(X_val, y_val, color='red', s=10, label='Datos de validación')\n","\n","# Rango de X para las predicciones de la línea\n","X_range = np.linspace(X_train.min()-100, X_train.max()+100, 400).reshape(-1, 1)\n","\n","for degree in range(1, degrees + 1):\n"," # Crear pipeline\n","\n"," # Ajustar el modelo\n","\n"," # Predecir y graficar\n"," y_range_pred = pipeline.predict(X_range)\n"," plt.plot(X_range, y_range_pred, label=f'Grado {degree}', linewidth=0.7)\n","\n","plt.xlabel('Lluvia (mm)')\n","plt.ylabel('Rendimiento (ton/ha)')\n","plt.title('Modelos Polinomiales y Datos')\n","plt.legend(loc='upper right', fontsize='small')\n","plt.grid(True)\n","plt.tight_layout()\n","plt.show()\n"]},{"cell_type":"markdown","metadata":{"id":"xBxzikFcEeis"},"source":["\n","## 4. λ óptimo para regresión polinomial de grado 5 con regularización"]},{"cell_type":"markdown","source":["Encontrar el hiperparámetro de regularización λ óptimo para un modelo de regresión polinomial de grado 5."],"metadata":{"id":"hzjt5d8n9HUt"}},{"cell_type":"code","source":["from sklearn.linear_model import RidgeCV"],"metadata":{"id":"DwlRnyv31dHP"},"execution_count":null,"outputs":[]},{"cell_type":"code","execution_count":null,"metadata":{"id":"D5u9uR2IEeit"},"outputs":[],"source":["degree = 5\n","alphas = np.logspace(-6, 6, 13)\n","\n","#COMPLETAR\n","#\n","#\n","#\n","#\n","#\n","#\n","X_train_poly= #COMPLETAR\n","mse_alphas = #COMPLETAR\n","optimal_lambda = #COMPLETAR"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"mt0yWufGEeit"},"outputs":[],"source":["# Gráfica de evolución de MSE con λ\n","plt.semilogx(alphas, mse_alphas, label='Validación', color='red')\n","plt.xlabel('λ')\n","plt.ylabel('MSE')\n","plt.title('MSE vs λ')\n","plt.grid(True)\n","plt.legend()\n","plt.show()"]},{"cell_type":"code","source":["from sklearn.linear_model import Ridge"],"metadata":{"id":"DVmI_CjH9awT"},"execution_count":null,"outputs":[]},{"cell_type":"code","execution_count":null,"metadata":{"id":"jgH0Le6wEeiu"},"outputs":[],"source":["# Gráfico del mejor modelo junto con datos de entrenamiento y validación\n","best_ridge = Ridge(alpha=optimal_lambda).fit(X_train_poly, y_train)\n","X_range = np.linspace(X_train.min(), X_train.max(), 400).reshape(-1, 1)\n","degree = 5\n","poly = PolynomialFeatures(degree)\n","X_range_poly = poly.transform(X_range)\n","y_range_pred = best_ridge.predict(X_range_poly)\n","\n","plt.scatter(X_train, y_train, color='blue', label='Entrenamiento')\n","plt.scatter(X_val, y_val, color='red', label='Validación')\n","plt.plot(X_range, y_range_pred, color='green', label='Regresión Polinomial Regularizada')\n","plt.legend()\n","plt.xlabel('Lluvia (mm)')\n","plt.ylabel('Rendimiento (ton/ha)')\n","plt.title('Regresión Polinomial Regularizada y Datos')\n","plt.grid(True)\n","plt.show()"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"XlRbHU3kEeiv"},"outputs":[],"source":["# Crear y ajustar el mejor modelo con el valor óptimo de lambda\n","best_pipeline = Pipeline([\n"," ('poly', PolynomialFeatures(degree=5)),\n"," ('ridge', Ridge(alpha=optimal_lambda))\n","])\n","\n","best_pipeline.fit(X_train, y_train)\n","\n","# Predecir y calcular errores en entrenamiento y validación\n","y_pred_train_best = best_pipeline.predict(X_train)\n","y_pred_val_best = best_pipeline.predict(X_val)\n","\n","mse_train_best = mean_squared_error(y_train, y_pred_train_best)\n","mse_val_best = mean_squared_error(y_val, y_pred_val_best)\n","\n","print(f\"Error cuadrático medio (MSE) en entrenamiento del mejor modelo: {mse_train_best:.3f}\")\n","print(f\"Error cuadrático medio (MSE) en validación del mejor modelo: {mse_val_best:.3f}\")"]},{"cell_type":"markdown","metadata":{"id":"j5zjgWVyEeiy"},"source":["### **Ejercicio: Regresión en el Conjunto de Datos \"Boston Housing\"**\n","\n","El conjunto de datos \"Boston Housing\" es un clásico en machine learning y estadística. Contiene información recopilada por el Servicio de Censos de EE.UU. sobre viviendas en el área de Boston. El dataset contiene 506 observaciones y 13 atributos que pueden influir en el valor mediano de las viviendas en Boston.\n","\n","**Objetivo:** Predice el valor mediano de las viviendas (MEDV) usando atributos como la tasa de criminalidad (CRIM), la cantidad promedio de habitaciones por vivienda (RM), entre otros.\n","\n","**Instrucciones:**\n","\n","1. **Carga y Análisis Inicial:** Carga el conjunto de datos \"Boston Housing\" y realiza un análisis exploratorio inicial.\n","2. **Preprocesamiento:** Divide el conjunto de datos en conjuntos de entrenamiento y validación.\n","3. **Modelo Lineal:** Entrena una regresión lineal usando todas las características para predecir MEDV. Evalúa su rendimiento en los conjuntos de entrenamiento y validación usando el error cuadrático medio (MSE).\n","4. **Modelo Polinomial:** Considera las características polinomiales hasta el grado 3. Entrena el modelo y evalúa su rendimiento.\n","5. **Regularización:** Usa la regresión Ridge con las características polinomiales. Encuentra el valor óptimo de \\( \\lambda \\) usando validación cruzada.\n","6. **Gráficas:**\n"," * Grafica la evolución del MSE en función del grado del polinomio.\n"," * Grafica el MSE en función de \\( \\lambda \\) para Ridge.\n"," * Visualiza el mejor modelo polinomial con y sin regularización en comparación con los datos reales.\n","7. **Conclusiones:** ¿qué modelo considerarías el mejor? ¿Cómo impactan las características polinomiales y la regularización en el rendimiento del modelo?"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"dESonOYfEeiz"},"outputs":[],"source":["# Si no tienes instalado statsmodels descomentar la siguiente linea\n","#!pip install statsmodels\n","import statsmodels.api as sm\n","import pandas as pd"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"94Rfoa1QEei0"},"outputs":[],"source":["# Cargar el conjunto de datos Boston Housing desde statsmodels\n","data = sm.datasets.get_rdataset('Boston', package='MASS').data\n","\n","# Si prefieres que MEDV sea la columna objetivo y no una de las características:\n","data = data.rename(columns={'medv': 'MEDV'})\n","\n","# Mostrar las primeras filas del DataFrame\n","data.head()\n"]}],"metadata":{"kernelspec":{"display_name":"base","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.8.15"},"orig_nbformat":4,"colab":{"provenance":[]}},"nbformat":4,"nbformat_minor":0}