{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "provenance": [] }, "kernelspec": { "name": "python3", "display_name": "Python 3" }, "language_info": { "name": "python" } }, "cells": [ { "cell_type": "markdown", "source": [ "# Parte A [Opcional]\n", "\n", "En esta parte van a implementar un transformer de scikit-learn desde cero.\n", "\n", "El objetivo es entender cómo funcionan y ver en la práctica cómo son los estándares de scikit learn.\n", "\n", "Un transformador es un objeto cuyo principal método es el `transform` que permite aplicar transformaciones sobre las features de entrada." ], "metadata": { "id": "g0NDHbHleKBv" } }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "tYJeFWHVK8a8" }, "outputs": [], "source": [ "import numpy as np\n", "from sklearn.base import BaseEstimator, TransformerMixin\n", "\n", "# Notar que hereda de BaseEstimator y TransformerMixin;\n", "# esto permite tener métodos extra \"gratis\", y garantiza\n", "# que nuestro transformer sea compatible con todo scikit\n", "# con solo implementar fit/transform y seguir los estándares\n", "class CustomStandardScaler(BaseEstimator, TransformerMixin):\n", "\n", " \"\"\"Standardize features by removing the mean and scaling to unit variance.\n", "\n", " This is a custom implementation of sklearn.preprocessing.StandardScaler for\n", " learning purposes only.\n", "\n", " The standard score of a sample `x` is calculated as:\n", "\n", " z = (x - u) / s\n", "\n", " where `u` is the mean of the training samples or zero if `with_mean=False`,\n", " and `s` is the standard deviation of the training samples or one if\n", " `with_std=False`.\n", "\n", " Centering and scaling happen independently on each feature by computing\n", " the relevant statistics on the samples in the training set. Mean and\n", " standard deviation are then stored to be used on later data using\n", " :meth:`transform`.\n", "\n", " Parameters\n", " ----------\n", " with_mean : bool, default=True\n", " If True, center the data before scaling.\n", "\n", " with_std : bool, default=True\n", " If True, scale the data to unit variance (or equivalently,\n", " unit standard deviation).\n", "\n", " Attributes\n", " ----------\n", " mean_ : ndarray of shape (n_features,) or None\n", " The mean value for each feature in the training set.\n", " Equal to ``None`` when ``with_mean=False``.\n", "\n", " std_ : ndarray of shape (n_features,) or None\n", " The variance for each feature in the training set.\n", " Equal to ``None`` when ``with_std=False``.\n", "\n", "\n", " See Also\n", " --------\n", " sklearn.preprocessing.StandardScaler : Original transformer from scikit-learn.\n", "\n", " Examples\n", " --------\n", " >>> data = [[0, 0], [0, 0], [1, 1], [1, 1]]\n", " >>> scaler = CustomStandardScaler()\n", " >>> print(scaler.fit(data))\n", " CustomStandardScaler()\n", " >>> print(scaler.mean_)\n", " [0.5 0.5]\n", " >>> print(scaler.transform(data))\n", " [[-1. -1.]\n", " [-1. -1.]\n", " [ 1. 1.]\n", " [ 1. 1.]]\n", " >>> print(scaler.transform([[2, 2]]))\n", " [[3. 3.]]\n", " \"\"\"\n", " # todos los parametros de entrada deben tener un valor por defecto\n", " # y, todos los parámetros necesarios se pasan en el __init__\n", " def __init__(self, with_mean=True, with_std=True):\n", " super().__init__()\n", " # se debe almacenar todos los paramétros de entrada en un atributo\n", " # de igual nombre al de entrada, y con el mismo valor que dio el usuario\n", " # ejemplo: si tengo un parametro pepe de entrada, lo guardo como self.pepe\n", "\n", " # == su codigo empieza aqui ====\n", " self.with_mean =\n", " self.with_std =\n", " # == su codigo termina aqui ====\n", "\n", " # los atributos calculados los nombre con _ al final\n", " self.mean_ = None\n", " self.std_ = None\n", "\n", " def fit(self, X, y=None):\n", "\n", " # implementar el metodo fit que calcula la media y desviacion\n", " # de los datos en X, y los guarda en self.mean_ y self.std_ respectivamente\n", " # dependiendo de los parametros dados por el usuario\n", "\n", " # == su codigo empieza aqui ====\n", " self.mean_ =\n", " self.std_ =\n", " # == su codigo termina aqui ====\n", "\n", " # IMPORTANTE: el .fit siempre retorna el self\n", " return self\n", "\n", " def transform(self, X, y=None):\n", " # implementar el metodo transform que resta la media de self.mean_\n", " # y divide entre la desviacion estandar self.std_\n", " # según los parametros dados por el usuario\n", " # == su codigo empieza aqui ====\n", " # == su codigo termina aqui ====\n", " return X" ] }, { "cell_type": "markdown", "source": [ "En la siguiente celda comparamos el `CustomStandardScaler` implementado anteriormente, con el provisto por scikit, para verificar que nos da lo mismo y nueestra implementación es correcta" ], "metadata": { "id": "KyV8ytO6gOxN" } }, { "cell_type": "code", "source": [ "from sklearn.preprocessing import StandardScaler\n", "\n", "np.random.seed(0)\n", "X = np.random.normal(5, 2, (200, 8))\n", "\n", "for with_mean in [True, False]:\n", " for with_std in [True, False]:\n", " X_sklearn = StandardScaler(with_mean=with_mean, with_std=with_std).fit(X).transform(X)\n", " X_nuestro = CustomStandardScaler(with_mean=with_mean, with_std=with_std).fit(X).transform(X)\n", "\n", " assert np.max(np.abs(X_sklearn - X_nuestro))==0, (with_mean, with_std)" ], "metadata": { "id": "-8qPmPyjShId" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "Lo siguiente que vamos a hacer es tomar un dataset, iris, y entrenar un clasificador. El objetivo es usar nuestro `CustomStandardScaler` en una grid search más adelante.\n", "\n", "##Preguntas:\n", "En el llamado a `train_test_split`:\n", "- Qué hace el parametro `stratify=y`? Por qué es importante?\n", "- Qué hace el parámetro `shuffle=True`? Por qué es importante?" ], "metadata": { "id": "h2t7zcJGge6M" } }, { "cell_type": "code", "source": [ "from sklearn.datasets import load_iris\n", "from sklearn.model_selection import train_test_split\n", "\n", "iris = load_iris()\n", "\n", "y = iris.target\n", "X = iris.data\n", "X_train, X_test, y_train, y_test = train_test_split(X, y,\n", " test_size=0.1,\n", " random_state=0,\n", " stratify=y,\n", " shuffle=True)\n", "\n", "\n", "X_train.shape, y_train.shape" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "kqkdIHsiOE95", "outputId": "8ef04f01-eff0-49b9-d4d6-b796641d8c22" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "((135, 4), (135,))" ] }, "metadata": {}, "execution_count": 77 } ] }, { "cell_type": "markdown", "source": [ "Implemente un pipe de clasificación que utilice el `CustomStandardScaler` desarrollado. Ejecute una grid search sobre el pipe que pruebe todas las combinaciones de los parámetros `CustomStandardScaler.with_mean` y `CustomStandardScaler.with_mean` al menos." ], "metadata": { "id": "-6oDbgQyhG4i" } }, { "cell_type": "code", "source": [ "from sklearn.pipeline import Pipeline\n", "from sklearn.neighbors import KNeighborsClassifier\n", "from sklearn.model_selection import GridSearchCV\n", "\n", "pipe = Pipeline([\n", " (\"scaler\", CustomStandardScaler()),\n", "\n", "])\n", "\n", "gs = GridSearchCV(\n", " pipe,\n", " {\n", " \"scaler__with_mean\": [True, False],\n", " \"scaler__with_std\": [True, False],\n", "\n", " },\n", " cv=6,\n", " n_jobs=-1,\n", " scoring = (\"accuracy\", \"f1_macro\"), # defino todas las que quiero trackear\n", " refit=\"accuracy\" # indico cual es la mas importante para reentrenar el ganador\n", ")\n", "\n", "gs.fit(X_train, y_train)\n", "print(gs.best_params_)\n", "print(gs.best_score_)" ], "metadata": { "id": "v68dx5hOORlm" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "Para terminar vamos a levantar todas las ejecuciones en un unico DataFrame de pandas, como una forma rápida de visualización de estos datos" ], "metadata": { "id": "s3th-szjhk0T" } }, { "cell_type": "code", "source": [ "import pandas as pd\n", "# habilitamos todas las columnas\n", "pd.set_option('display.max_columns', None)\n", "# levantamos los resultados de la grid search en un dataframe\n", "pd.DataFrame(gs.cv_results_)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 1000 }, "id": "TKQ48IgHPybR", "outputId": "84ded1e6-b3f5-4dfb-fa22-bc33aa76d137" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " mean_fit_time std_fit_time mean_score_time std_score_time \\\n", "0 0.002353 0.000816 0.009110 0.003944 \n", "1 0.001512 0.000131 0.007573 0.004027 \n", "2 0.001884 0.000986 0.005648 0.002699 \n", "3 0.001446 0.000099 0.008071 0.004958 \n", "4 0.002101 0.001313 0.005209 0.000959 \n", "5 0.001772 0.000372 0.006268 0.001582 \n", "6 0.004500 0.003316 0.009153 0.003559 \n", "7 0.001475 0.000125 0.005685 0.001782 \n", "8 0.001441 0.000052 0.004896 0.000112 \n", "9 0.001500 0.000069 0.005179 0.000726 \n", "10 0.001516 0.000066 0.004809 0.000103 \n", "11 0.001328 0.000155 0.004345 0.000738 \n", "\n", " param_clf__n_neighbors param_scaler__with_mean param_scaler__with_std \\\n", "0 5 True True \n", "1 5 True False \n", "2 5 False True \n", "3 5 False False \n", "4 10 True True \n", "5 10 True False \n", "6 10 False True \n", "7 10 False False \n", "8 15 True True \n", "9 15 True False \n", "10 15 False True \n", "11 15 False False \n", "\n", " params split0_test_accuracy \\\n", "0 {'clf__n_neighbors': 5, 'scaler__with_mean': T... 1.000000 \n", "1 {'clf__n_neighbors': 5, 'scaler__with_mean': T... 0.956522 \n", "2 {'clf__n_neighbors': 5, 'scaler__with_mean': F... 1.000000 \n", "3 {'clf__n_neighbors': 5, 'scaler__with_mean': F... 0.956522 \n", "4 {'clf__n_neighbors': 10, 'scaler__with_mean': ... 0.956522 \n", "5 {'clf__n_neighbors': 10, 'scaler__with_mean': ... 1.000000 \n", "6 {'clf__n_neighbors': 10, 'scaler__with_mean': ... 0.956522 \n", "7 {'clf__n_neighbors': 10, 'scaler__with_mean': ... 1.000000 \n", "8 {'clf__n_neighbors': 15, 'scaler__with_mean': ... 1.000000 \n", "9 {'clf__n_neighbors': 15, 'scaler__with_mean': ... 1.000000 \n", "10 {'clf__n_neighbors': 15, 'scaler__with_mean': ... 1.000000 \n", "11 {'clf__n_neighbors': 15, 'scaler__with_mean': ... 1.000000 \n", "\n", " split1_test_accuracy split2_test_accuracy split3_test_accuracy \\\n", "0 0.913043 1.000000 0.909091 \n", "1 0.913043 1.000000 0.954545 \n", "2 0.913043 1.000000 0.909091 \n", "3 0.913043 1.000000 0.954545 \n", "4 0.913043 0.956522 0.954545 \n", "5 0.956522 0.956522 0.909091 \n", "6 0.913043 0.956522 0.954545 \n", "7 0.956522 0.956522 0.909091 \n", "8 0.913043 1.000000 0.909091 \n", "9 0.956522 1.000000 0.909091 \n", "10 0.913043 1.000000 0.909091 \n", "11 0.956522 1.000000 0.909091 \n", "\n", " split4_test_accuracy split5_test_accuracy mean_test_accuracy \\\n", "0 1.0 1.0 0.970356 \n", "1 1.0 1.0 0.970685 \n", "2 1.0 1.0 0.970356 \n", "3 1.0 1.0 0.970685 \n", "4 1.0 1.0 0.963439 \n", "5 1.0 1.0 0.970356 \n", "6 1.0 1.0 0.963439 \n", "7 1.0 1.0 0.970356 \n", "8 1.0 1.0 0.970356 \n", "9 1.0 1.0 0.977602 \n", "10 1.0 1.0 0.970356 \n", "11 1.0 1.0 0.977602 \n", "\n", " std_test_accuracy rank_test_accuracy split0_test_f1_macro \\\n", "0 0.041939 5 1.000000 \n", "1 0.032562 3 0.955556 \n", "2 0.041939 5 1.000000 \n", "3 0.032562 3 0.955556 \n", "4 0.029966 11 0.955556 \n", "5 0.033597 5 1.000000 \n", "6 0.029966 11 0.955556 \n", "7 0.033597 5 1.000000 \n", "8 0.041939 5 1.000000 \n", "9 0.034508 1 1.000000 \n", "10 0.041939 5 1.000000 \n", "11 0.034508 1 1.000000 \n", "\n", " split1_test_f1_macro split2_test_f1_macro split3_test_f1_macro \\\n", "0 0.907407 1.000000 0.910714 \n", "1 0.907407 1.000000 0.954751 \n", "2 0.907407 1.000000 0.910714 \n", "3 0.907407 1.000000 0.954751 \n", "4 0.907407 0.954751 0.955556 \n", "5 0.954751 0.954751 0.910714 \n", "6 0.907407 0.954751 0.955556 \n", "7 0.954751 0.954751 0.910714 \n", "8 0.907407 1.000000 0.910714 \n", "9 0.954751 1.000000 0.907407 \n", "10 0.907407 1.000000 0.910714 \n", "11 0.954751 1.000000 0.907407 \n", "\n", " split4_test_f1_macro split5_test_f1_macro mean_test_f1_macro \\\n", "0 1.0 1.0 0.969687 \n", "1 1.0 1.0 0.969619 \n", "2 1.0 1.0 0.969687 \n", "3 1.0 1.0 0.969619 \n", "4 1.0 1.0 0.962212 \n", "5 1.0 1.0 0.970036 \n", "6 1.0 1.0 0.962212 \n", "7 1.0 1.0 0.970036 \n", "8 1.0 1.0 0.969687 \n", "9 1.0 1.0 0.977026 \n", "10 1.0 1.0 0.969687 \n", "11 1.0 1.0 0.977026 \n", "\n", " std_test_f1_macro rank_test_f1_macro \n", "0 0.042880 5 \n", "1 0.034298 9 \n", "2 0.042880 5 \n", "3 0.034298 9 \n", "4 0.031632 11 \n", "5 0.033366 3 \n", "6 0.031632 11 \n", "7 0.033366 3 \n", "8 0.042880 5 \n", "9 0.035247 1 \n", "10 0.042880 5 \n", "11 0.035247 1 " ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
mean_fit_timestd_fit_timemean_score_timestd_score_timeparam_clf__n_neighborsparam_scaler__with_meanparam_scaler__with_stdparamssplit0_test_accuracysplit1_test_accuracysplit2_test_accuracysplit3_test_accuracysplit4_test_accuracysplit5_test_accuracymean_test_accuracystd_test_accuracyrank_test_accuracysplit0_test_f1_macrosplit1_test_f1_macrosplit2_test_f1_macrosplit3_test_f1_macrosplit4_test_f1_macrosplit5_test_f1_macromean_test_f1_macrostd_test_f1_macrorank_test_f1_macro
00.0023530.0008160.0091100.0039445TrueTrue{'clf__n_neighbors': 5, 'scaler__with_mean': T...1.0000000.9130431.0000000.9090911.01.00.9703560.04193951.0000000.9074071.0000000.9107141.01.00.9696870.0428805
10.0015120.0001310.0075730.0040275TrueFalse{'clf__n_neighbors': 5, 'scaler__with_mean': T...0.9565220.9130431.0000000.9545451.01.00.9706850.03256230.9555560.9074071.0000000.9547511.01.00.9696190.0342989
20.0018840.0009860.0056480.0026995FalseTrue{'clf__n_neighbors': 5, 'scaler__with_mean': F...1.0000000.9130431.0000000.9090911.01.00.9703560.04193951.0000000.9074071.0000000.9107141.01.00.9696870.0428805
30.0014460.0000990.0080710.0049585FalseFalse{'clf__n_neighbors': 5, 'scaler__with_mean': F...0.9565220.9130431.0000000.9545451.01.00.9706850.03256230.9555560.9074071.0000000.9547511.01.00.9696190.0342989
40.0021010.0013130.0052090.00095910TrueTrue{'clf__n_neighbors': 10, 'scaler__with_mean': ...0.9565220.9130430.9565220.9545451.01.00.9634390.029966110.9555560.9074070.9547510.9555561.01.00.9622120.03163211
50.0017720.0003720.0062680.00158210TrueFalse{'clf__n_neighbors': 10, 'scaler__with_mean': ...1.0000000.9565220.9565220.9090911.01.00.9703560.03359751.0000000.9547510.9547510.9107141.01.00.9700360.0333663
60.0045000.0033160.0091530.00355910FalseTrue{'clf__n_neighbors': 10, 'scaler__with_mean': ...0.9565220.9130430.9565220.9545451.01.00.9634390.029966110.9555560.9074070.9547510.9555561.01.00.9622120.03163211
70.0014750.0001250.0056850.00178210FalseFalse{'clf__n_neighbors': 10, 'scaler__with_mean': ...1.0000000.9565220.9565220.9090911.01.00.9703560.03359751.0000000.9547510.9547510.9107141.01.00.9700360.0333663
80.0014410.0000520.0048960.00011215TrueTrue{'clf__n_neighbors': 15, 'scaler__with_mean': ...1.0000000.9130431.0000000.9090911.01.00.9703560.04193951.0000000.9074071.0000000.9107141.01.00.9696870.0428805
90.0015000.0000690.0051790.00072615TrueFalse{'clf__n_neighbors': 15, 'scaler__with_mean': ...1.0000000.9565221.0000000.9090911.01.00.9776020.03450811.0000000.9547511.0000000.9074071.01.00.9770260.0352471
100.0015160.0000660.0048090.00010315FalseTrue{'clf__n_neighbors': 15, 'scaler__with_mean': ...1.0000000.9130431.0000000.9090911.01.00.9703560.04193951.0000000.9074071.0000000.9107141.01.00.9696870.0428805
110.0013280.0001550.0043450.00073815FalseFalse{'clf__n_neighbors': 15, 'scaler__with_mean': ...1.0000000.9565221.0000000.9090911.01.00.9776020.03450811.0000000.9547511.0000000.9074071.01.00.9770260.0352471
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "
\n", "
\n" ] }, "metadata": {}, "execution_count": 79 } ] }, { "cell_type": "code", "source": [ "from sklearn.metrics import classification_report\n", "y_pred = gs.predict(X_test)\n", "\n", "print(classification_report(y_test, y_pred))" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "BWgXdyuuUScw", "outputId": "3bc46983-1d53-463e-ef00-f26ed1545235" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ " precision recall f1-score support\n", "\n", " 0 1.00 1.00 1.00 5\n", " 1 1.00 1.00 1.00 5\n", " 2 1.00 1.00 1.00 5\n", "\n", " accuracy 1.00 15\n", " macro avg 1.00 1.00 1.00 15\n", "weighted avg 1.00 1.00 1.00 15\n", "\n" ] } ] }, { "cell_type": "markdown", "source": [ "# Parte B\n", "\n", "En esta parte vamos a poner en práctica los conceptos vistos en clase. Vamos a usar el daraset de `california housing`. Este dataset es para estimar precios promedios de casas en California, pero, lo convertiremos en un problema de clasificacion binaria: casas baratas vs casas baratas, en función de si su precio es mayor o no que el promedio de precios del dataset." ], "metadata": { "id": "aICK6TUQiLgm" } }, { "cell_type": "code", "source": [ "print(fetch_california_housing().DESCR)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "qXqa_DSFcQrg", "outputId": "3c49794c-ad32-4408-c080-63afc6152276" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ ".. _california_housing_dataset:\n", "\n", "California Housing dataset\n", "--------------------------\n", "\n", "**Data Set Characteristics:**\n", "\n", " :Number of Instances: 20640\n", "\n", " :Number of Attributes: 8 numeric, predictive attributes and the target\n", "\n", " :Attribute Information:\n", " - MedInc median income in block group\n", " - HouseAge median house age in block group\n", " - AveRooms average number of rooms per household\n", " - AveBedrms average number of bedrooms per household\n", " - Population block group population\n", " - AveOccup average number of household members\n", " - Latitude block group latitude\n", " - Longitude block group longitude\n", "\n", " :Missing Attribute Values: None\n", "\n", "This dataset was obtained from the StatLib repository.\n", "https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html\n", "\n", "The target variable is the median house value for California districts,\n", "expressed in hundreds of thousands of dollars ($100,000).\n", "\n", "This dataset was derived from the 1990 U.S. census, using one row per census\n", "block group. A block group is the smallest geographical unit for which the U.S.\n", "Census Bureau publishes sample data (a block group typically has a population\n", "of 600 to 3,000 people).\n", "\n", "A household is a group of people residing within a home. Since the average\n", "number of rooms and bedrooms in this dataset are provided per household, these\n", "columns may take surprisingly large values for block groups with few households\n", "and many empty houses, such as vacation resorts.\n", "\n", "It can be downloaded/loaded using the\n", ":func:`sklearn.datasets.fetch_california_housing` function.\n", "\n", ".. topic:: References\n", "\n", " - Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions,\n", " Statistics and Probability Letters, 33 (1997) 291-297\n", "\n" ] } ] }, { "cell_type": "code", "source": [ "from sklearn.datasets import fetch_california_housing\n", "from sklearn.model_selection import train_test_split\n", "\n", "X, y = fetch_california_housing(return_X_y=True, as_frame=False)\n", "\n", "y_mean = np.mean(y)\n", "\n", "# pasamos el target a binario: 0 o 1\n", "y[y<=y_mean] = 0\n", "y[y>y_mean] = 1\n", "y = y.astype(int)\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(X, y,\n", " test_size=0.1,\n", " random_state=0,\n", " stratify=y,\n", " shuffle=True)\n", "X_train.shape, y_train.shape" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "eyRcP85XYole", "outputId": "3cba3d9b-22fe-41bd-a94a-82d58c28c975" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "((18576, 8), (18576,))" ] }, "metadata": {}, "execution_count": 3 } ] }, { "cell_type": "markdown", "source": [ "Vamos a penalizar distintos los errores:\n", "- Una casa cara que es clasificada como barata, va a tener un costo de 1\n", "- Una casa barata que es clasificada como cara, va a tener un costo de 2\n", "\n", "Definir la matriz de costos como un numpy array.\n", "\n", "Con esta matriz, implementar la `expected_cost_los`: el costo esperado visto en clase:" ], "metadata": { "id": "6DxS9LbdjBpZ" } }, { "cell_type": "code", "source": [ "from sklearn.metrics import confusion_matrix, make_scorer\n", "\n", "COST_MATRIX = np.array([\n", " [xx, yy],\n", " [ww, zz]\n", "])\n", "\n", "assert COST_MATRIX.shape == (2, 2)\n", "\n", "def expected_cost_loss(y_true, y_pred):\n", " # == su codigo empieza aqui ====\n", "\n", " cost =\n", " # == su codigo termina aqui ====\n", " return cost\n", "\n", "expected_cost_scorer = make_scorer(expected_cost_loss, greater_is_better=False)" ], "metadata": { "id": "qny8-Gccbl18" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "Implementar un pipeline para encontrar un clasificador probabilistico (o sea, asegurense que tenga disponible un `predict_proba`) y sus parámetros para minimizar el costo esperado definido anteriormente." ], "metadata": { "id": "JkHEd-SOjk8F" } }, { "cell_type": "code", "source": [ "from sklearn.pipeline import Pipeline\n", "from sklearn.neighbors import KNeighborsClassifier\n", "from sklearn.model_selection import GridSearchCV\n", "\n", "# == su codigo empieza aqui ====\n", "pipe = Pipeline([\n", "\n", "])\n", "\n", "gs = GridSearchCV(\n", "\n", ")\n", "# == su codigo termina aqui ====\n", "gs.fit(X_train, y_train)\n", "print(gs.best_params_)\n", "print(gs.best_score_)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "gevd8mera_Gk", "outputId": "e255d744-33dd-4360-8515-8972b52fc6ec" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "{'scaler__with_mean': True, 'scaler__with_std': True}\n", "-0.2635120585701981\n" ] } ] }, { "cell_type": "markdown", "source": [ "Evaluar el clasificador entrenado. Reportar el costo esperado y el costo esperado normalizado para el mejor clasificador encontrado." ], "metadata": { "id": "AZeNhXb5j6i4" } }, { "cell_type": "code", "source": [ "from sklearn.metrics import classification_report\n", "\n", "# == su codigo empieza aqui ====\n", "# == su codigo termina aqui ====" ], "metadata": { "id": "Ge_7ABDJkYf2" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "Reportar auc_ROC y auc_PR [opcional: graficarlos]" ], "metadata": { "id": "M_vp1lPd1Zsx" } }, { "cell_type": "code", "source": [ "# == su codigo empieza aqui ====\n", "# == su codigo termina aqui ====" ], "metadata": { "id": "w1tUtE1v1cUX" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "\n", "Para el clasificador entrenado, encontrar un threshold que minimice la funcion de costo esperado" ], "metadata": { "id": "WVQyMo7S1MVw" } }, { "cell_type": "code", "source": [ "# == su codigo empieza aqui ====\n", "# == su codigo termina aqui ====" ], "metadata": { "id": "tkiK2GmZknQy" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "Su clasificador, esta bien calibrado? Calibrarlo. Mostrar el brier_score_loss antes y después ed calibrarlo. [opcional: mostrar diagramas de calibración antes y después de calibrarlo]" ], "metadata": { "id": "TP_4F7xG1oNR" } }, { "cell_type": "code", "source": [ "# == su codigo empieza aqui ====\n", "# == su codigo termina aqui ====" ], "metadata": { "id": "HuYJHH-C1ytM" }, "execution_count": null, "outputs": [] } ] }