{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "provenance": [] }, "kernelspec": { "name": "python3", "display_name": "Python 3" }, "language_info": { "name": "python" } }, "cells": [ { "cell_type": "markdown", "source": [ "# Parte A [Opcional]\n", "\n", "En esta parte van a implementar un transformer de scikit-learn desde cero.\n", "\n", "El objetivo es entender cómo funcionan y ver en la práctica cómo son los estándares de scikit learn.\n", "\n", "Un transformador es un objeto cuyo principal método es el `transform` que permite aplicar transformaciones sobre las features de entrada." ], "metadata": { "id": "g0NDHbHleKBv" } }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "tYJeFWHVK8a8" }, "outputs": [], "source": [ "import numpy as np\n", "from sklearn.base import BaseEstimator, TransformerMixin\n", "\n", "# Notar que hereda de BaseEstimator y TransformerMixin;\n", "# esto permite tener métodos extra \"gratis\", y garantiza\n", "# que nuestro transformer sea compatible con todo scikit\n", "# con solo implementar fit/transform y seguir los estándares\n", "class CustomStandardScaler(BaseEstimator, TransformerMixin):\n", "\n", " \"\"\"Standardize features by removing the mean and scaling to unit variance.\n", "\n", " This is a custom implementation of sklearn.preprocessing.StandardScaler for\n", " learning purposes only.\n", "\n", " The standard score of a sample `x` is calculated as:\n", "\n", " z = (x - u) / s\n", "\n", " where `u` is the mean of the training samples or zero if `with_mean=False`,\n", " and `s` is the standard deviation of the training samples or one if\n", " `with_std=False`.\n", "\n", " Centering and scaling happen independently on each feature by computing\n", " the relevant statistics on the samples in the training set. Mean and\n", " standard deviation are then stored to be used on later data using\n", " :meth:`transform`.\n", "\n", " Parameters\n", " ----------\n", " with_mean : bool, default=True\n", " If True, center the data before scaling.\n", "\n", " with_std : bool, default=True\n", " If True, scale the data to unit variance (or equivalently,\n", " unit standard deviation).\n", "\n", " Attributes\n", " ----------\n", " mean_ : ndarray of shape (n_features,) or None\n", " The mean value for each feature in the training set.\n", " Equal to ``None`` when ``with_mean=False``.\n", "\n", " std_ : ndarray of shape (n_features,) or None\n", " The variance for each feature in the training set.\n", " Equal to ``None`` when ``with_std=False``.\n", "\n", "\n", " See Also\n", " --------\n", " sklearn.preprocessing.StandardScaler : Original transformer from scikit-learn.\n", "\n", " Examples\n", " --------\n", " >>> data = [[0, 0], [0, 0], [1, 1], [1, 1]]\n", " >>> scaler = CustomStandardScaler()\n", " >>> print(scaler.fit(data))\n", " CustomStandardScaler()\n", " >>> print(scaler.mean_)\n", " [0.5 0.5]\n", " >>> print(scaler.transform(data))\n", " [[-1. -1.]\n", " [-1. -1.]\n", " [ 1. 1.]\n", " [ 1. 1.]]\n", " >>> print(scaler.transform([[2, 2]]))\n", " [[3. 3.]]\n", " \"\"\"\n", " # todos los parametros de entrada deben tener un valor por defecto\n", " # y, todos los parámetros necesarios se pasan en el __init__\n", " def __init__(self, with_mean=True, with_std=True):\n", " super().__init__()\n", " # se debe almacenar todos los paramétros de entrada en un atributo\n", " # de igual nombre al de entrada, y con el mismo valor que dio el usuario\n", " # ejemplo: si tengo un parametro pepe de entrada, lo guardo como self.pepe\n", "\n", " # == su codigo empieza aqui ====\n", " self.with_mean =\n", " self.with_std =\n", " # == su codigo termina aqui ====\n", "\n", " # los atributos calculados los nombre con _ al final\n", " self.mean_ = None\n", " self.std_ = None\n", "\n", " def fit(self, X, y=None):\n", "\n", " # implementar el metodo fit que calcula la media y desviacion\n", " # de los datos en X, y los guarda en self.mean_ y self.std_ respectivamente\n", " # dependiendo de los parametros dados por el usuario\n", "\n", " # == su codigo empieza aqui ====\n", " self.mean_ =\n", " self.std_ =\n", " # == su codigo termina aqui ====\n", "\n", " # IMPORTANTE: el .fit siempre retorna el self\n", " return self\n", "\n", " def transform(self, X, y=None):\n", " # implementar el metodo transform que resta la media de self.mean_\n", " # y divide entre la desviacion estandar self.std_\n", " # según los parametros dados por el usuario\n", " # == su codigo empieza aqui ====\n", " # == su codigo termina aqui ====\n", " return X" ] }, { "cell_type": "markdown", "source": [ "En la siguiente celda comparamos el `CustomStandardScaler` implementado anteriormente, con el provisto por scikit, para verificar que nos da lo mismo y nueestra implementación es correcta" ], "metadata": { "id": "KyV8ytO6gOxN" } }, { "cell_type": "code", "source": [ "from sklearn.preprocessing import StandardScaler\n", "\n", "np.random.seed(0)\n", "X = np.random.normal(5, 2, (200, 8))\n", "\n", "for with_mean in [True, False]:\n", " for with_std in [True, False]:\n", " X_sklearn = StandardScaler(with_mean=with_mean, with_std=with_std).fit(X).transform(X)\n", " X_nuestro = CustomStandardScaler(with_mean=with_mean, with_std=with_std).fit(X).transform(X)\n", "\n", " assert np.max(np.abs(X_sklearn - X_nuestro))==0, (with_mean, with_std)" ], "metadata": { "id": "-8qPmPyjShId" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "Lo siguiente que vamos a hacer es tomar un dataset, iris, y entrenar un clasificador. El objetivo es usar nuestro `CustomStandardScaler` en una grid search más adelante.\n", "\n", "##Preguntas:\n", "En el llamado a `train_test_split`:\n", "- Qué hace el parametro `stratify=y`? Por qué es importante?\n", "- Qué hace el parámetro `shuffle=True`? Por qué es importante?" ], "metadata": { "id": "h2t7zcJGge6M" } }, { "cell_type": "code", "source": [ "from sklearn.datasets import load_iris\n", "from sklearn.model_selection import train_test_split\n", "\n", "iris = load_iris()\n", "\n", "y = iris.target\n", "X = iris.data\n", "X_train, X_test, y_train, y_test = train_test_split(X, y,\n", " test_size=0.1,\n", " random_state=0,\n", " stratify=y,\n", " shuffle=True)\n", "\n", "\n", "X_train.shape, y_train.shape" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "kqkdIHsiOE95", "outputId": "8ef04f01-eff0-49b9-d4d6-b796641d8c22" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "((135, 4), (135,))" ] }, "metadata": {}, "execution_count": 77 } ] }, { "cell_type": "markdown", "source": [ "Implemente un pipe de clasificación que utilice el `CustomStandardScaler` desarrollado. Ejecute una grid search sobre el pipe que pruebe todas las combinaciones de los parámetros `CustomStandardScaler.with_mean` y `CustomStandardScaler.with_mean` al menos." ], "metadata": { "id": "-6oDbgQyhG4i" } }, { "cell_type": "code", "source": [ "from sklearn.pipeline import Pipeline\n", "from sklearn.neighbors import KNeighborsClassifier\n", "from sklearn.model_selection import GridSearchCV\n", "\n", "pipe = Pipeline([\n", " (\"scaler\", CustomStandardScaler()),\n", "\n", "])\n", "\n", "gs = GridSearchCV(\n", " pipe,\n", " {\n", " \"scaler__with_mean\": [True, False],\n", " \"scaler__with_std\": [True, False],\n", "\n", " },\n", " cv=6,\n", " n_jobs=-1,\n", " scoring = (\"accuracy\", \"f1_macro\"), # defino todas las que quiero trackear\n", " refit=\"accuracy\" # indico cual es la mas importante para reentrenar el ganador\n", ")\n", "\n", "gs.fit(X_train, y_train)\n", "print(gs.best_params_)\n", "print(gs.best_score_)" ], "metadata": { "id": "v68dx5hOORlm" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "Para terminar vamos a levantar todas las ejecuciones en un unico DataFrame de pandas, como una forma rápida de visualización de estos datos" ], "metadata": { "id": "s3th-szjhk0T" } }, { "cell_type": "code", "source": [ "import pandas as pd\n", "# habilitamos todas las columnas\n", "pd.set_option('display.max_columns', None)\n", "# levantamos los resultados de la grid search en un dataframe\n", "pd.DataFrame(gs.cv_results_)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 1000 }, "id": "TKQ48IgHPybR", "outputId": "84ded1e6-b3f5-4dfb-fa22-bc33aa76d137" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " mean_fit_time std_fit_time mean_score_time std_score_time \\\n", "0 0.002353 0.000816 0.009110 0.003944 \n", "1 0.001512 0.000131 0.007573 0.004027 \n", "2 0.001884 0.000986 0.005648 0.002699 \n", "3 0.001446 0.000099 0.008071 0.004958 \n", "4 0.002101 0.001313 0.005209 0.000959 \n", "5 0.001772 0.000372 0.006268 0.001582 \n", "6 0.004500 0.003316 0.009153 0.003559 \n", "7 0.001475 0.000125 0.005685 0.001782 \n", "8 0.001441 0.000052 0.004896 0.000112 \n", "9 0.001500 0.000069 0.005179 0.000726 \n", "10 0.001516 0.000066 0.004809 0.000103 \n", "11 0.001328 0.000155 0.004345 0.000738 \n", "\n", " param_clf__n_neighbors param_scaler__with_mean param_scaler__with_std \\\n", "0 5 True True \n", "1 5 True False \n", "2 5 False True \n", "3 5 False False \n", "4 10 True True \n", "5 10 True False \n", "6 10 False True \n", "7 10 False False \n", "8 15 True True \n", "9 15 True False \n", "10 15 False True \n", "11 15 False False \n", "\n", " params split0_test_accuracy \\\n", "0 {'clf__n_neighbors': 5, 'scaler__with_mean': T... 1.000000 \n", "1 {'clf__n_neighbors': 5, 'scaler__with_mean': T... 0.956522 \n", "2 {'clf__n_neighbors': 5, 'scaler__with_mean': F... 1.000000 \n", "3 {'clf__n_neighbors': 5, 'scaler__with_mean': F... 0.956522 \n", "4 {'clf__n_neighbors': 10, 'scaler__with_mean': ... 0.956522 \n", "5 {'clf__n_neighbors': 10, 'scaler__with_mean': ... 1.000000 \n", "6 {'clf__n_neighbors': 10, 'scaler__with_mean': ... 0.956522 \n", "7 {'clf__n_neighbors': 10, 'scaler__with_mean': ... 1.000000 \n", "8 {'clf__n_neighbors': 15, 'scaler__with_mean': ... 1.000000 \n", "9 {'clf__n_neighbors': 15, 'scaler__with_mean': ... 1.000000 \n", "10 {'clf__n_neighbors': 15, 'scaler__with_mean': ... 1.000000 \n", "11 {'clf__n_neighbors': 15, 'scaler__with_mean': ... 1.000000 \n", "\n", " split1_test_accuracy split2_test_accuracy split3_test_accuracy \\\n", "0 0.913043 1.000000 0.909091 \n", "1 0.913043 1.000000 0.954545 \n", "2 0.913043 1.000000 0.909091 \n", "3 0.913043 1.000000 0.954545 \n", "4 0.913043 0.956522 0.954545 \n", "5 0.956522 0.956522 0.909091 \n", "6 0.913043 0.956522 0.954545 \n", "7 0.956522 0.956522 0.909091 \n", "8 0.913043 1.000000 0.909091 \n", "9 0.956522 1.000000 0.909091 \n", "10 0.913043 1.000000 0.909091 \n", "11 0.956522 1.000000 0.909091 \n", "\n", " split4_test_accuracy split5_test_accuracy mean_test_accuracy \\\n", "0 1.0 1.0 0.970356 \n", "1 1.0 1.0 0.970685 \n", "2 1.0 1.0 0.970356 \n", "3 1.0 1.0 0.970685 \n", "4 1.0 1.0 0.963439 \n", "5 1.0 1.0 0.970356 \n", "6 1.0 1.0 0.963439 \n", "7 1.0 1.0 0.970356 \n", "8 1.0 1.0 0.970356 \n", "9 1.0 1.0 0.977602 \n", "10 1.0 1.0 0.970356 \n", "11 1.0 1.0 0.977602 \n", "\n", " std_test_accuracy rank_test_accuracy split0_test_f1_macro \\\n", "0 0.041939 5 1.000000 \n", "1 0.032562 3 0.955556 \n", "2 0.041939 5 1.000000 \n", "3 0.032562 3 0.955556 \n", "4 0.029966 11 0.955556 \n", "5 0.033597 5 1.000000 \n", "6 0.029966 11 0.955556 \n", "7 0.033597 5 1.000000 \n", "8 0.041939 5 1.000000 \n", "9 0.034508 1 1.000000 \n", "10 0.041939 5 1.000000 \n", "11 0.034508 1 1.000000 \n", "\n", " split1_test_f1_macro split2_test_f1_macro split3_test_f1_macro \\\n", "0 0.907407 1.000000 0.910714 \n", "1 0.907407 1.000000 0.954751 \n", "2 0.907407 1.000000 0.910714 \n", "3 0.907407 1.000000 0.954751 \n", "4 0.907407 0.954751 0.955556 \n", "5 0.954751 0.954751 0.910714 \n", "6 0.907407 0.954751 0.955556 \n", "7 0.954751 0.954751 0.910714 \n", "8 0.907407 1.000000 0.910714 \n", "9 0.954751 1.000000 0.907407 \n", "10 0.907407 1.000000 0.910714 \n", "11 0.954751 1.000000 0.907407 \n", "\n", " split4_test_f1_macro split5_test_f1_macro mean_test_f1_macro \\\n", "0 1.0 1.0 0.969687 \n", "1 1.0 1.0 0.969619 \n", "2 1.0 1.0 0.969687 \n", "3 1.0 1.0 0.969619 \n", "4 1.0 1.0 0.962212 \n", "5 1.0 1.0 0.970036 \n", "6 1.0 1.0 0.962212 \n", "7 1.0 1.0 0.970036 \n", "8 1.0 1.0 0.969687 \n", "9 1.0 1.0 0.977026 \n", "10 1.0 1.0 0.969687 \n", "11 1.0 1.0 0.977026 \n", "\n", " std_test_f1_macro rank_test_f1_macro \n", "0 0.042880 5 \n", "1 0.034298 9 \n", "2 0.042880 5 \n", "3 0.034298 9 \n", "4 0.031632 11 \n", "5 0.033366 3 \n", "6 0.031632 11 \n", "7 0.033366 3 \n", "8 0.042880 5 \n", "9 0.035247 1 \n", "10 0.042880 5 \n", "11 0.035247 1 " ], "text/html": [ "\n", "
\n", " | mean_fit_time | \n", "std_fit_time | \n", "mean_score_time | \n", "std_score_time | \n", "param_clf__n_neighbors | \n", "param_scaler__with_mean | \n", "param_scaler__with_std | \n", "params | \n", "split0_test_accuracy | \n", "split1_test_accuracy | \n", "split2_test_accuracy | \n", "split3_test_accuracy | \n", "split4_test_accuracy | \n", "split5_test_accuracy | \n", "mean_test_accuracy | \n", "std_test_accuracy | \n", "rank_test_accuracy | \n", "split0_test_f1_macro | \n", "split1_test_f1_macro | \n", "split2_test_f1_macro | \n", "split3_test_f1_macro | \n", "split4_test_f1_macro | \n", "split5_test_f1_macro | \n", "mean_test_f1_macro | \n", "std_test_f1_macro | \n", "rank_test_f1_macro | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "0.002353 | \n", "0.000816 | \n", "0.009110 | \n", "0.003944 | \n", "5 | \n", "True | \n", "True | \n", "{'clf__n_neighbors': 5, 'scaler__with_mean': T... | \n", "1.000000 | \n", "0.913043 | \n", "1.000000 | \n", "0.909091 | \n", "1.0 | \n", "1.0 | \n", "0.970356 | \n", "0.041939 | \n", "5 | \n", "1.000000 | \n", "0.907407 | \n", "1.000000 | \n", "0.910714 | \n", "1.0 | \n", "1.0 | \n", "0.969687 | \n", "0.042880 | \n", "5 | \n", "
1 | \n", "0.001512 | \n", "0.000131 | \n", "0.007573 | \n", "0.004027 | \n", "5 | \n", "True | \n", "False | \n", "{'clf__n_neighbors': 5, 'scaler__with_mean': T... | \n", "0.956522 | \n", "0.913043 | \n", "1.000000 | \n", "0.954545 | \n", "1.0 | \n", "1.0 | \n", "0.970685 | \n", "0.032562 | \n", "3 | \n", "0.955556 | \n", "0.907407 | \n", "1.000000 | \n", "0.954751 | \n", "1.0 | \n", "1.0 | \n", "0.969619 | \n", "0.034298 | \n", "9 | \n", "
2 | \n", "0.001884 | \n", "0.000986 | \n", "0.005648 | \n", "0.002699 | \n", "5 | \n", "False | \n", "True | \n", "{'clf__n_neighbors': 5, 'scaler__with_mean': F... | \n", "1.000000 | \n", "0.913043 | \n", "1.000000 | \n", "0.909091 | \n", "1.0 | \n", "1.0 | \n", "0.970356 | \n", "0.041939 | \n", "5 | \n", "1.000000 | \n", "0.907407 | \n", "1.000000 | \n", "0.910714 | \n", "1.0 | \n", "1.0 | \n", "0.969687 | \n", "0.042880 | \n", "5 | \n", "
3 | \n", "0.001446 | \n", "0.000099 | \n", "0.008071 | \n", "0.004958 | \n", "5 | \n", "False | \n", "False | \n", "{'clf__n_neighbors': 5, 'scaler__with_mean': F... | \n", "0.956522 | \n", "0.913043 | \n", "1.000000 | \n", "0.954545 | \n", "1.0 | \n", "1.0 | \n", "0.970685 | \n", "0.032562 | \n", "3 | \n", "0.955556 | \n", "0.907407 | \n", "1.000000 | \n", "0.954751 | \n", "1.0 | \n", "1.0 | \n", "0.969619 | \n", "0.034298 | \n", "9 | \n", "
4 | \n", "0.002101 | \n", "0.001313 | \n", "0.005209 | \n", "0.000959 | \n", "10 | \n", "True | \n", "True | \n", "{'clf__n_neighbors': 10, 'scaler__with_mean': ... | \n", "0.956522 | \n", "0.913043 | \n", "0.956522 | \n", "0.954545 | \n", "1.0 | \n", "1.0 | \n", "0.963439 | \n", "0.029966 | \n", "11 | \n", "0.955556 | \n", "0.907407 | \n", "0.954751 | \n", "0.955556 | \n", "1.0 | \n", "1.0 | \n", "0.962212 | \n", "0.031632 | \n", "11 | \n", "
5 | \n", "0.001772 | \n", "0.000372 | \n", "0.006268 | \n", "0.001582 | \n", "10 | \n", "True | \n", "False | \n", "{'clf__n_neighbors': 10, 'scaler__with_mean': ... | \n", "1.000000 | \n", "0.956522 | \n", "0.956522 | \n", "0.909091 | \n", "1.0 | \n", "1.0 | \n", "0.970356 | \n", "0.033597 | \n", "5 | \n", "1.000000 | \n", "0.954751 | \n", "0.954751 | \n", "0.910714 | \n", "1.0 | \n", "1.0 | \n", "0.970036 | \n", "0.033366 | \n", "3 | \n", "
6 | \n", "0.004500 | \n", "0.003316 | \n", "0.009153 | \n", "0.003559 | \n", "10 | \n", "False | \n", "True | \n", "{'clf__n_neighbors': 10, 'scaler__with_mean': ... | \n", "0.956522 | \n", "0.913043 | \n", "0.956522 | \n", "0.954545 | \n", "1.0 | \n", "1.0 | \n", "0.963439 | \n", "0.029966 | \n", "11 | \n", "0.955556 | \n", "0.907407 | \n", "0.954751 | \n", "0.955556 | \n", "1.0 | \n", "1.0 | \n", "0.962212 | \n", "0.031632 | \n", "11 | \n", "
7 | \n", "0.001475 | \n", "0.000125 | \n", "0.005685 | \n", "0.001782 | \n", "10 | \n", "False | \n", "False | \n", "{'clf__n_neighbors': 10, 'scaler__with_mean': ... | \n", "1.000000 | \n", "0.956522 | \n", "0.956522 | \n", "0.909091 | \n", "1.0 | \n", "1.0 | \n", "0.970356 | \n", "0.033597 | \n", "5 | \n", "1.000000 | \n", "0.954751 | \n", "0.954751 | \n", "0.910714 | \n", "1.0 | \n", "1.0 | \n", "0.970036 | \n", "0.033366 | \n", "3 | \n", "
8 | \n", "0.001441 | \n", "0.000052 | \n", "0.004896 | \n", "0.000112 | \n", "15 | \n", "True | \n", "True | \n", "{'clf__n_neighbors': 15, 'scaler__with_mean': ... | \n", "1.000000 | \n", "0.913043 | \n", "1.000000 | \n", "0.909091 | \n", "1.0 | \n", "1.0 | \n", "0.970356 | \n", "0.041939 | \n", "5 | \n", "1.000000 | \n", "0.907407 | \n", "1.000000 | \n", "0.910714 | \n", "1.0 | \n", "1.0 | \n", "0.969687 | \n", "0.042880 | \n", "5 | \n", "
9 | \n", "0.001500 | \n", "0.000069 | \n", "0.005179 | \n", "0.000726 | \n", "15 | \n", "True | \n", "False | \n", "{'clf__n_neighbors': 15, 'scaler__with_mean': ... | \n", "1.000000 | \n", "0.956522 | \n", "1.000000 | \n", "0.909091 | \n", "1.0 | \n", "1.0 | \n", "0.977602 | \n", "0.034508 | \n", "1 | \n", "1.000000 | \n", "0.954751 | \n", "1.000000 | \n", "0.907407 | \n", "1.0 | \n", "1.0 | \n", "0.977026 | \n", "0.035247 | \n", "1 | \n", "
10 | \n", "0.001516 | \n", "0.000066 | \n", "0.004809 | \n", "0.000103 | \n", "15 | \n", "False | \n", "True | \n", "{'clf__n_neighbors': 15, 'scaler__with_mean': ... | \n", "1.000000 | \n", "0.913043 | \n", "1.000000 | \n", "0.909091 | \n", "1.0 | \n", "1.0 | \n", "0.970356 | \n", "0.041939 | \n", "5 | \n", "1.000000 | \n", "0.907407 | \n", "1.000000 | \n", "0.910714 | \n", "1.0 | \n", "1.0 | \n", "0.969687 | \n", "0.042880 | \n", "5 | \n", "
11 | \n", "0.001328 | \n", "0.000155 | \n", "0.004345 | \n", "0.000738 | \n", "15 | \n", "False | \n", "False | \n", "{'clf__n_neighbors': 15, 'scaler__with_mean': ... | \n", "1.000000 | \n", "0.956522 | \n", "1.000000 | \n", "0.909091 | \n", "1.0 | \n", "1.0 | \n", "0.977602 | \n", "0.034508 | \n", "1 | \n", "1.000000 | \n", "0.954751 | \n", "1.000000 | \n", "0.907407 | \n", "1.0 | \n", "1.0 | \n", "0.977026 | \n", "0.035247 | \n", "1 | \n", "