# Parte A [Opcional]

En esta parte van a implementar un transformer de scikit-learn desde cero.

El objetivo es entender cómo funcionan y ver en la práctica cómo son los estándares de scikit learn.

Un transformador es un objeto cuyo principal método es el `transform` que permite aplicar transformaciones sobre las features de entrada.

In [None]:
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin

# Notar que hereda de BaseEstimator y TransformerMixin;
# esto permite tener métodos extra "gratis", y garantiza
# que nuestro transformer sea compatible con todo scikit
# con solo implementar fit/transform y seguir los estándares
class CustomStandardScaler(BaseEstimator, TransformerMixin):

    """Standardize features by removing the mean and scaling to unit variance.

    This is a custom implementation of sklearn.preprocessing.StandardScaler for
    learning purposes only.

    The standard score of a sample `x` is calculated as:

        z = (x - u) / s

    where `u` is the mean of the training samples or zero if `with_mean=False`,
    and `s` is the standard deviation of the training samples or one if
    `with_std=False`.

    Centering and scaling happen independently on each feature by computing
    the relevant statistics on the samples in the training set. Mean and
    standard deviation are then stored to be used on later data using
    :meth:`transform`.

    Parameters
    ----------
    with_mean : bool, default=True
        If True, center the data before scaling.

    with_std : bool, default=True
        If True, scale the data to unit variance (or equivalently,
        unit standard deviation).

    Attributes
    ----------
    mean_ : ndarray of shape (n_features,) or None
        The mean value for each feature in the training set.
        Equal to ``None`` when ``with_mean=False``.

    std_ : ndarray of shape (n_features,) or None
        The variance for each feature in the training set.
        Equal to ``None`` when ``with_std=False``.


    See Also
    --------
    sklearn.preprocessing.StandardScaler : Original transformer from scikit-learn.

    Examples
    --------
    >>> data = [[0, 0], [0, 0], [1, 1], [1, 1]]
    >>> scaler = CustomStandardScaler()
    >>> print(scaler.fit(data))
    CustomStandardScaler()
    >>> print(scaler.mean_)
    [0.5 0.5]
    >>> print(scaler.transform(data))
    [[-1. -1.]
     [-1. -1.]
     [ 1.  1.]
     [ 1.  1.]]
    >>> print(scaler.transform([[2, 2]]))
    [[3. 3.]]
    """
    # todos los parametros de entrada deben tener un valor por defecto
    # y, todos los parámetros necesarios se pasan en el __init__
    def __init__(self, with_mean=True, with_std=True):
        super().__init__()
        # se debe almacenar todos los paramétros de entrada en un atributo
        # de igual nombre al de entrada, y con el mismo valor que dio el usuario
        # ejemplo: si tengo un parametro pepe de entrada, lo guardo como self.pepe

        # == su codigo empieza aqui ====
        self.with_mean =
        self.with_std =
        # == su codigo termina aqui ====

        # los atributos calculados los nombre con _ al final
        self.mean_ = None
        self.std_ = None

    def fit(self, X, y=None):

        # implementar el metodo fit que calcula la media y desviacion
        # de los datos en X, y los guarda en self.mean_ y self.std_ respectivamente
        # dependiendo de los parametros dados por el usuario

        # == su codigo empieza aqui ====
            self.mean_ =
            self.std_ =
        # == su codigo termina aqui ====

        # IMPORTANTE: el .fit siempre retorna el self
        return self

    def transform(self, X, y=None):
        # implementar el metodo transform que resta la media de self.mean_
        # y divide entre la desviacion estandar self.std_
        # según los parametros dados por el usuario
        # == su codigo empieza aqui ====
        # == su codigo termina aqui ====
        return X

En la siguiente celda comparamos el `CustomStandardScaler` implementado anteriormente, con el provisto por scikit, para verificar que nos da lo mismo y nueestra implementación es correcta

In [None]:
from sklearn.preprocessing import StandardScaler

np.random.seed(0)
X = np.random.normal(5, 2, (200, 8))

for with_mean in [True, False]:
  for with_std in [True, False]:
    X_sklearn = StandardScaler(with_mean=with_mean, with_std=with_std).fit(X).transform(X)
    X_nuestro = CustomStandardScaler(with_mean=with_mean, with_std=with_std).fit(X).transform(X)

    assert np.max(np.abs(X_sklearn - X_nuestro))==0, (with_mean, with_std)

Lo siguiente que vamos a hacer es tomar un dataset, iris, y entrenar un clasificador. El objetivo es usar nuestro `CustomStandardScaler` en una grid search más adelante.

##Preguntas:
En el llamado a `train_test_split`:
- Qué hace el parametro `stratify=y`? Por qué es importante?
- Qué hace el parámetro `shuffle=True`? Por qué es importante?

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()

y = iris.target
X = iris.data
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.1,
                                                    random_state=0,
                                                    stratify=y,
                                                    shuffle=True)


X_train.shape, y_train.shape

((135, 4), (135,))

Implemente un pipe de clasificación que utilice el `CustomStandardScaler` desarrollado. Ejecute una grid search sobre el pipe que pruebe todas las combinaciones de los parámetros `CustomStandardScaler.with_mean` y `CustomStandardScaler.with_mean` al menos.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

pipe = Pipeline([
    ("scaler", CustomStandardScaler()),

])

gs = GridSearchCV(
    pipe,
    {
        "scaler__with_mean": [True, False],
        "scaler__with_std": [True, False],

    },
    cv=6,
    n_jobs=-1,
    scoring = ("accuracy", "f1_macro"),  # defino todas las que quiero trackear
    refit="accuracy"  # indico cual es la mas importante para reentrenar el ganador
)

gs.fit(X_train, y_train)
print(gs.best_params_)
print(gs.best_score_)

Para terminar vamos a levantar todas las ejecuciones en un unico DataFrame de pandas, como una forma rápida de visualización de estos datos

In [None]:
import pandas as pd
# habilitamos todas las columnas
pd.set_option('display.max_columns', None)
# levantamos los resultados de la grid search en un dataframe
pd.DataFrame(gs.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_clf__n_neighbors,param_scaler__with_mean,param_scaler__with_std,params,split0_test_accuracy,split1_test_accuracy,split2_test_accuracy,split3_test_accuracy,split4_test_accuracy,split5_test_accuracy,mean_test_accuracy,std_test_accuracy,rank_test_accuracy,split0_test_f1_macro,split1_test_f1_macro,split2_test_f1_macro,split3_test_f1_macro,split4_test_f1_macro,split5_test_f1_macro,mean_test_f1_macro,std_test_f1_macro,rank_test_f1_macro
0,0.002353,0.000816,0.00911,0.003944,5,True,True,"{'clf__n_neighbors': 5, 'scaler__with_mean': T...",1.0,0.913043,1.0,0.909091,1.0,1.0,0.970356,0.041939,5,1.0,0.907407,1.0,0.910714,1.0,1.0,0.969687,0.04288,5
1,0.001512,0.000131,0.007573,0.004027,5,True,False,"{'clf__n_neighbors': 5, 'scaler__with_mean': T...",0.956522,0.913043,1.0,0.954545,1.0,1.0,0.970685,0.032562,3,0.955556,0.907407,1.0,0.954751,1.0,1.0,0.969619,0.034298,9
2,0.001884,0.000986,0.005648,0.002699,5,False,True,"{'clf__n_neighbors': 5, 'scaler__with_mean': F...",1.0,0.913043,1.0,0.909091,1.0,1.0,0.970356,0.041939,5,1.0,0.907407,1.0,0.910714,1.0,1.0,0.969687,0.04288,5
3,0.001446,9.9e-05,0.008071,0.004958,5,False,False,"{'clf__n_neighbors': 5, 'scaler__with_mean': F...",0.956522,0.913043,1.0,0.954545,1.0,1.0,0.970685,0.032562,3,0.955556,0.907407,1.0,0.954751,1.0,1.0,0.969619,0.034298,9
4,0.002101,0.001313,0.005209,0.000959,10,True,True,"{'clf__n_neighbors': 10, 'scaler__with_mean': ...",0.956522,0.913043,0.956522,0.954545,1.0,1.0,0.963439,0.029966,11,0.955556,0.907407,0.954751,0.955556,1.0,1.0,0.962212,0.031632,11
5,0.001772,0.000372,0.006268,0.001582,10,True,False,"{'clf__n_neighbors': 10, 'scaler__with_mean': ...",1.0,0.956522,0.956522,0.909091,1.0,1.0,0.970356,0.033597,5,1.0,0.954751,0.954751,0.910714,1.0,1.0,0.970036,0.033366,3
6,0.0045,0.003316,0.009153,0.003559,10,False,True,"{'clf__n_neighbors': 10, 'scaler__with_mean': ...",0.956522,0.913043,0.956522,0.954545,1.0,1.0,0.963439,0.029966,11,0.955556,0.907407,0.954751,0.955556,1.0,1.0,0.962212,0.031632,11
7,0.001475,0.000125,0.005685,0.001782,10,False,False,"{'clf__n_neighbors': 10, 'scaler__with_mean': ...",1.0,0.956522,0.956522,0.909091,1.0,1.0,0.970356,0.033597,5,1.0,0.954751,0.954751,0.910714,1.0,1.0,0.970036,0.033366,3
8,0.001441,5.2e-05,0.004896,0.000112,15,True,True,"{'clf__n_neighbors': 15, 'scaler__with_mean': ...",1.0,0.913043,1.0,0.909091,1.0,1.0,0.970356,0.041939,5,1.0,0.907407,1.0,0.910714,1.0,1.0,0.969687,0.04288,5
9,0.0015,6.9e-05,0.005179,0.000726,15,True,False,"{'clf__n_neighbors': 15, 'scaler__with_mean': ...",1.0,0.956522,1.0,0.909091,1.0,1.0,0.977602,0.034508,1,1.0,0.954751,1.0,0.907407,1.0,1.0,0.977026,0.035247,1


In [None]:
from sklearn.metrics import classification_report
y_pred = gs.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00         5
           1       1.00      1.00      1.00         5
           2       1.00      1.00      1.00         5

    accuracy                           1.00        15
   macro avg       1.00      1.00      1.00        15
weighted avg       1.00      1.00      1.00        15



# Parte B

En esta parte vamos a poner en práctica los conceptos vistos en clase. Vamos a usar el daraset de `california housing`. Este dataset es para estimar precios promedios de casas en California, pero, lo convertiremos en un problema de clasificacion binaria: casas baratas vs casas baratas, en función de si su precio es mayor o no que el promedio de precios del dataset.

In [None]:
print(fetch_california_housing().DESCR)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block group
        - HouseAge      median house age in block group
        - AveRooms      average number of rooms per household
        - AveBedrms     average number of bedrooms per household
        - Population    block group population
        - AveOccup      average number of household members
        - Latitude      block group latitude
        - Longitude     block group longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived

In [None]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

X, y = fetch_california_housing(return_X_y=True, as_frame=False)

y_mean = np.mean(y)

# pasamos el target a binario: 0 o 1
y[y<=y_mean] = 0
y[y>y_mean] = 1
y = y.astype(int)

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.1,
                                                    random_state=0,
                                                    stratify=y,
                                                    shuffle=True)
X_train.shape, y_train.shape

((18576, 8), (18576,))

Vamos a penalizar distintos los errores:
- Una casa cara que es clasificada como barata, va a tener un costo de 1
- Una casa barata que es clasificada como cara, va a tener un costo de 2

Definir la matriz de costos como un numpy array.

Con esta matriz, implementar la `expected_cost_los`: el costo esperado visto en clase:

In [None]:
from sklearn.metrics import confusion_matrix, make_scorer

COST_MATRIX = np.array([
    [xx, yy],
    [ww, zz]
])

assert COST_MATRIX.shape == (2, 2)

def expected_cost_loss(y_true, y_pred):
    # == su codigo empieza aqui ====

    cost =
    # == su codigo termina aqui ====
    return cost

expected_cost_scorer = make_scorer(expected_cost_loss, greater_is_better=False)

Implementar un pipeline para encontrar un clasificador probabilistico (o sea, asegurense que tenga disponible un `predict_proba`) y sus parámetros para minimizar el costo esperado definido anteriormente.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

# == su codigo empieza aqui ====
pipe = Pipeline([

])

gs = GridSearchCV(

)
# == su codigo termina aqui ====
gs.fit(X_train, y_train)
print(gs.best_params_)
print(gs.best_score_)

{'scaler__with_mean': True, 'scaler__with_std': True}
-0.2635120585701981


Evaluar el clasificador entrenado. Reportar el costo esperado y el costo esperado normalizado para el mejor clasificador encontrado.

In [None]:
from sklearn.metrics import classification_report

# == su codigo empieza aqui ====
# == su codigo termina aqui ====

Reportar auc_ROC y auc_PR [opcional: graficarlos]

In [None]:
# == su codigo empieza aqui ====
# == su codigo termina aqui ====


Para el clasificador entrenado, encontrar un threshold que minimice la funcion de costo esperado

In [None]:
# == su codigo empieza aqui ====
# == su codigo termina aqui ====

Su clasificador, esta bien calibrado? Calibrarlo. Mostrar el brier_score_loss antes y después ed calibrarlo. [opcional: mostrar diagramas de calibración antes y después de calibrarlo]

In [None]:
# == su codigo empieza aqui ====
# == su codigo termina aqui ====