{ "cells": [ { "cell_type": "markdown", "id": "3a0016a8-c324-45cf-9f51-de62895f36cd", "metadata": { "id": "3a0016a8-c324-45cf-9f51-de62895f36cd" }, "source": [ "# Manipulación básica de datasets" ] }, { "cell_type": "markdown", "id": "0b94cb49-8d74-46b0-a8db-76cf2381f813", "metadata": { "id": "0b94cb49-8d74-46b0-a8db-76cf2381f813" }, "source": [ "## Imports" ] }, { "cell_type": "code", "execution_count": null, "id": "7ac21f68-8498-4078-bdb7-ae562e791aa1", "metadata": { "id": "7ac21f68-8498-4078-bdb7-ae562e791aa1" }, "outputs": [], "source": [ "# Scientific computing\n", "import numpy as np\n", "np.set_printoptions(formatter={'float': lambda x: \"{0:0.3f}\".format(x)})\n", "\n", "# Statistics\n", "import scipy.stats as stats\n", "\n", "# Visualizations\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns;\n", "sns.set(style=\"ticks\", color_codes=True)\n", "\n", "# Machine learning\n", "import sklearn\n", "\n", "# Data analysis and manipulation\n", "import pandas as pd\n", "pd.set_option('display.precision', 2) # 2 decimal places\n", "pd.set_option('display.max_rows', 20)\n", "pd.set_option('display.max_columns', 30)\n", "pd.set_option('display.width', 100) # wide windows" ] }, { "cell_type": "markdown", "id": "9550692b-65d6-4bd0-8fac-50012cc6af6a", "metadata": { "id": "9550692b-65d6-4bd0-8fac-50012cc6af6a" }, "source": [ "## Datasets de ejemplo de sklearn" ] }, { "cell_type": "markdown", "id": "b7587c2c-c530-407e-aac6-828fded13721", "metadata": { "id": "b7587c2c-c530-407e-aac6-828fded13721" }, "source": [ "scikit-learn tiene algunos conjuntos de datos pequeños que son útiles para ilustrar rápidamente el comportamiento de los diversos algoritmos. Uno de ellos es el famoso dataset Iris, utilizado por primera vez por Sir R.A. Fisher.\n", "\n", "El conjunto de datos contiene 3 clases de 50 instancias cada una, donde cada clase se refiere a un tipo de planta de iris." ] }, { "cell_type": "code", "execution_count": null, "id": "ee614368-714d-4082-bba2-87b996d55900", "metadata": { "id": "ee614368-714d-4082-bba2-87b996d55900" }, "outputs": [], "source": [ "# Cargamos el Iris dataset\n", "from sklearn.datasets import load_iris\n", "iris = load_iris()" ] }, { "cell_type": "code", "execution_count": null, "id": "baf879b4-3e5e-41b3-94ec-a340af818bb1", "metadata": { "id": "baf879b4-3e5e-41b3-94ec-a340af818bb1" }, "outputs": [], "source": [ "# Iris es un objeto tipo diccionario\n", "iris.keys()" ] }, { "cell_type": "code", "execution_count": null, "id": "a46ed8d5-58c7-4710-84fa-1a1dd5053a0d", "metadata": { "id": "a46ed8d5-58c7-4710-84fa-1a1dd5053a0d" }, "outputs": [], "source": [ "# Tipos de los items\n", "keys = list(iris.keys())\n", "for k in range(len(keys)):\n", " print(\"Type of iris.\" + str(keys[k]) + \" :\", type(iris[keys[k]]))" ] }, { "cell_type": "code", "execution_count": null, "id": "39c2c2c8-8bef-47aa-9681-f6cb990f285d", "metadata": { "id": "39c2c2c8-8bef-47aa-9681-f6cb990f285d" }, "outputs": [], "source": [ "# Item de descripción\n", "print(iris.DESCR)" ] }, { "cell_type": "code", "execution_count": null, "id": "f32df1aa-28ea-42aa-ba0f-57109d83fbe3", "metadata": { "id": "f32df1aa-28ea-42aa-ba0f-57109d83fbe3" }, "outputs": [], "source": [ "# Clases o categorías en el target\n", "print(iris.target_names)" ] }, { "cell_type": "code", "execution_count": null, "id": "1be93d9b-f31c-49fb-8b40-d621349df71c", "metadata": { "id": "1be93d9b-f31c-49fb-8b40-d621349df71c" }, "outputs": [], "source": [ "# Nombres de las features o variables predictoras\n", "print(iris.feature_names)" ] }, { "cell_type": "markdown", "id": "57d99906-f271-4b76-9837-1556aa2067e8", "metadata": { "id": "57d99906-f271-4b76-9837-1556aa2067e8" }, "source": [ "Este data set puede usarse como ejemplo para un problema de clasificación. En el mismo nos basamos en las features o variables predictoras, que forman una matriz de diseño $X$, para predecir la variable target que denotamos $y$.\n", "\n", "En este ejemplo concreto tenemos:\n", "\n", "+ La matriz de diseño $X$ tiene $D=4$ columnas (las variables sepal length, sepal width, petal length, y petal width) y $N=150$ filas (las observaciones). Por eso, para una observación dada el vector de features lo escribimos $\\boldsymbol{x}\\in \\mathbb{R}^D$, y decimos que la matriz $X\\in \\mathbb{R}^{(N,D)}$.\n", "+ El target es la especie, que denotamos $y$. En este caso puede tomar $C=3$ valores (que llamamos clases o categorías), a saber, setosa, versicolor y virginica.\n", "\n", "En los ejemplos de clasificación, el target es siempre una variable discreta. En este ejemplo concreto, todas las features son variables continuas." ] }, { "cell_type": "code", "execution_count": null, "id": "edf0cc38-b9da-47cb-85d3-dc67b3c5e0e0", "metadata": { "id": "edf0cc38-b9da-47cb-85d3-dc67b3c5e0e0" }, "outputs": [], "source": [ "# Extraemos los numpy arrays\n", "X = iris.data\n", "y = iris.target" ] }, { "cell_type": "code", "execution_count": null, "id": "b490c886-db17-4a93-9b30-f45af9a18a5f", "metadata": { "id": "b490c886-db17-4a93-9b30-f45af9a18a5f" }, "outputs": [], "source": [ "print(\"Atributos de X\")\n", "print(\n", "'''\\\n", "type: {}\n", "dtype: {}\n", "ndim: {}\n", "shape: {}\n", "size: {}\n", "itemsize: {}\n", "nbytes: {}\\\n", "'''.format(type(X),X.dtype,X.ndim,X.shape,X.size,X.itemsize,X.nbytes)\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "27aa2f77-d4a4-4607-bcf4-93f64f87d8f8", "metadata": { "id": "27aa2f77-d4a4-4607-bcf4-93f64f87d8f8" }, "outputs": [], "source": [ "# Primeras 10 filas\n", "X[0:10,:]" ] }, { "cell_type": "code", "execution_count": null, "id": "09e8fe31-c375-450a-800e-21d2b0085b30", "metadata": { "id": "09e8fe31-c375-450a-800e-21d2b0085b30" }, "outputs": [], "source": [ "print(\"Atributos de y\")\n", "print(\n", "'''\\\n", "type: {}\n", "dtype: {}\n", "ndim: {}\n", "shape: {}\n", "size: {}\n", "itemsize: {}\n", "nbytes: {}\\\n", "'''.format(type(y),y.dtype,y.ndim,y.shape,y.size,y.itemsize,y.nbytes)\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "202f20ed-5794-4c9c-86fb-87a28f6e48fb", "metadata": { "id": "202f20ed-5794-4c9c-86fb-87a28f6e48fb" }, "outputs": [], "source": [ "# Primeros 10 elementos\n", "y[0:10]" ] }, { "cell_type": "markdown", "id": "3b3d33b8-0731-4882-9472-188b28cd47f6", "metadata": { "id": "3b3d33b8-0731-4882-9472-188b28cd47f6" }, "source": [ "## Manipulación usando Pandas" ] }, { "cell_type": "code", "execution_count": null, "id": "3422a277-8aaf-4858-ab6c-f013dac1083b", "metadata": { "id": "3422a277-8aaf-4858-ab6c-f013dac1083b" }, "outputs": [], "source": [ "# Convertimos a pandas dataframe\n", "df = pd.DataFrame(data=X, columns=iris.feature_names)\n", "df['target'] = pd.Series(iris.target_names[y], dtype='category')" ] }, { "cell_type": "code", "execution_count": null, "id": "904cc617-fae1-45f2-a29b-477e404e1359", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "904cc617-fae1-45f2-a29b-477e404e1359", "outputId": "d0570877-f623-4d3b-fb4b-abc9c1dece21" }, "outputs": [ { "data": { "text/plain": [ "pandas.core.frame.DataFrame" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Tipo de objeto creado\n", "type(df)" ] }, { "cell_type": "code", "execution_count": null, "id": "d6b90637-552c-4a3b-929b-1f1cf14279ff", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "d6b90637-552c-4a3b-929b-1f1cf14279ff", "outputId": "faf40140-7e31-4ef1-c322-25f1fee8d488" }, "outputs": [ { "data": { "text/plain": [ "150" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Número de filas en el data frame\n", "len(df)" ] }, { "cell_type": "code", "execution_count": null, "id": "83e487bd-6bab-48ab-97a8-6e5ff402cef6", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "83e487bd-6bab-48ab-97a8-6e5ff402cef6", "outputId": "eb8e707c-770b-4807-cab3-3f092db487dd" }, "outputs": [ { "data": { "text/plain": [ "(150, 5)" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Número de filas y columnas en el data frame\n", "df.shape" ] }, { "cell_type": "code", "execution_count": null, "id": "261281be-f1c6-405f-b1c9-258b7491c510", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 572 }, "id": "261281be-f1c6-405f-b1c9-258b7491c510", "outputId": "723b475a-68cb-49be-bd8c-5f768f800fa6" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)target
05.13.51.40.2setosa
14.93.01.40.2setosa
24.73.21.30.2setosa
34.63.11.50.2setosa
45.03.61.40.2setosa
..................
1456.73.05.22.3virginica
1466.32.55.01.9virginica
1476.53.05.22.0virginica
1486.23.45.42.3virginica
1495.93.05.11.8virginica
\n", "

150 rows × 5 columns

\n", "
" ], "text/plain": [ " sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target\n", "0 5.1 3.5 1.4 0.2 setosa\n", "1 4.9 3.0 1.4 0.2 setosa\n", "2 4.7 3.2 1.3 0.2 setosa\n", "3 4.6 3.1 1.5 0.2 setosa\n", "4 5.0 3.6 1.4 0.2 setosa\n", ".. ... ... ... ... ...\n", "145 6.7 3.0 5.2 2.3 virginica\n", "146 6.3 2.5 5.0 1.9 virginica\n", "147 6.5 3.0 5.2 2.0 virginica\n", "148 6.2 3.4 5.4 2.3 virginica\n", "149 5.9 3.0 5.1 1.8 virginica\n", "\n", "[150 rows x 5 columns]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Visualización\n", "df" ] }, { "cell_type": "code", "execution_count": null, "id": "3ef27832-7b45-4f62-9781-af3500b64cdd", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 208 }, "id": "3ef27832-7b45-4f62-9781-af3500b64cdd", "outputId": "1ae2db47-7477-495f-fef3-0814cfd97b19" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)target
05.13.51.40.2setosa
14.93.01.40.2setosa
\n", "
" ], "text/plain": [ " sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target\n", "0 5.1 3.5 1.4 0.2 setosa\n", "1 4.9 3.0 1.4 0.2 setosa" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Primeras filas del data frame\n", "df.head(2)" ] }, { "cell_type": "code", "execution_count": null, "id": "5f20c073-e84c-4e25-81fe-9adfc9cb8772", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 239 }, "id": "5f20c073-e84c-4e25-81fe-9adfc9cb8772", "outputId": "69d68527-64fe-49bc-c04b-87cc74ca0525" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)target
1476.53.05.22.0virginica
1486.23.45.42.3virginica
1495.93.05.11.8virginica
\n", "
" ], "text/plain": [ " sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target\n", "147 6.5 3.0 5.2 2.0 virginica\n", "148 6.2 3.4 5.4 2.3 virginica\n", "149 5.9 3.0 5.1 1.8 virginica" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Ultimas filas del data frame\n", "df.tail(3)" ] }, { "cell_type": "markdown", "id": "5328c218-ef42-469f-bfba-97c4c19e808f", "metadata": { "id": "5328c218-ef42-469f-bfba-97c4c19e808f" }, "source": [ "### Extracción de filas y columnas" ] }, { "cell_type": "markdown", "id": "f613274d-33bf-4692-a94e-d77c34122b7c", "metadata": { "id": "f613274d-33bf-4692-a94e-d77c34122b7c" }, "source": [ "Para seleccionar filas y columnas de un DataFrame de pandas, loc e iloc son dos funciones de uso común.\n", "\n", "Hay una diferencia sutil entre las dos funciones:\n", "\n", "+ **loc** selecciona filas y columnas con etiquetas específicas\n", "+ **iloc** selecciona filas y columnas en posiciones enteras específicas\n", "\n", "Los siguientes ejemplos muestran cómo usar cada función en la práctica." ] }, { "cell_type": "code", "execution_count": null, "id": "3f09a9b2-4e56-4b2e-a643-61d4268b655a", "metadata": { "id": "3f09a9b2-4e56-4b2e-a643-61d4268b655a", "tags": [] }, "outputs": [], "source": [ "# Creamos un dataframe\n", "df_ejemplo = pd.DataFrame({'team': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'],\n", " 'points': [5, 7, 7, 9, 12, 9, 9, 4],\n", " 'assists': [11, 8, 10, 6, 6, 5, 9, 12]},\n", " index=['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H'])\n", "\n", "# Lo visualizamos\n", "df_ejemplo" ] }, { "cell_type": "code", "execution_count": null, "id": "d48f471c-4e60-48cd-92d1-d84fbb2b951c", "metadata": { "id": "d48f471c-4e60-48cd-92d1-d84fbb2b951c" }, "outputs": [], "source": [ "df_ejemplo.index" ] }, { "cell_type": "markdown", "id": "b39ec565-1a1d-4c31-a0ed-0806e59c8a58", "metadata": { "id": "b39ec565-1a1d-4c31-a0ed-0806e59c8a58" }, "source": [ "Podemos usar loc para seleccionar filas específicas del DataFrame según sus etiquetas de índice:" ] }, { "cell_type": "code", "execution_count": null, "id": "49147d11-367d-4ab0-8edd-8d6dd1a163d4", "metadata": { "id": "49147d11-367d-4ab0-8edd-8d6dd1a163d4" }, "outputs": [], "source": [ "# Seleccionamos las filas con index labels 'E' y 'F'\n", "df_ejemplo.loc[['E', 'F']]" ] }, { "cell_type": "markdown", "id": "7725c4e8-7537-4cd6-a9f2-473ae26ccfa6", "metadata": { "id": "7725c4e8-7537-4cd6-a9f2-473ae26ccfa6" }, "source": [ "Podemos usar loc para seleccionar filas y columnas específicas del DataFrame en función de sus etiquetas:" ] }, { "cell_type": "code", "execution_count": null, "id": "7c8e9090-c4cb-44a4-affa-efda9c0fb608", "metadata": { "id": "7c8e9090-c4cb-44a4-affa-efda9c0fb608" }, "outputs": [], "source": [ "# Seleccionamos las filas 'E' y 'F' y las columnas 'team' y 'assists'\n", "df_ejemplo.loc[['E', 'F'], ['team', 'assists']]" ] }, { "cell_type": "markdown", "id": "7005abc6-0192-488b-8290-63e5ea77c6b2", "metadata": { "id": "7005abc6-0192-488b-8290-63e5ea77c6b2" }, "source": [ "Podemos usar loc con el argumento : para seleccionar rangos de filas y columnas según sus etiquetas:" ] }, { "cell_type": "code", "execution_count": null, "id": "bf1e906d-ca14-47ef-9e18-29f1fcd902bc", "metadata": { "id": "bf1e906d-ca14-47ef-9e18-29f1fcd902bc" }, "outputs": [], "source": [ "# Seleccionamos filas 'E' hasta 'H' y columnas 'team' y 'points'\n", "df_ejemplo.loc['E': , :'points']" ] }, { "cell_type": "markdown", "id": "0046eb47-a02e-4c82-b722-e635014977b1", "metadata": { "id": "0046eb47-a02e-4c82-b722-e635014977b1" }, "source": [ "Podemos usar iloc para seleccionar filas específicas del DataFrame en función de su posición:" ] }, { "cell_type": "code", "execution_count": null, "id": "90531f0d-4ec8-4c0f-98c3-d7c5c0b2b9dd", "metadata": { "id": "90531f0d-4ec8-4c0f-98c3-d7c5c0b2b9dd" }, "outputs": [], "source": [ "# Seleccionamos las filas desde la 4 a la 6\n", "df_ejemplo.iloc[4:6]" ] }, { "cell_type": "markdown", "id": "298e8606-72d0-4876-afbe-f6e362f40db8", "metadata": { "id": "298e8606-72d0-4876-afbe-f6e362f40db8" }, "source": [ "Podemos usar iloc para seleccionar filas específicas y columnas específicas del DataFrame en función de sus posiciones:" ] }, { "cell_type": "code", "execution_count": null, "id": "c726c8e2-b509-42ec-820b-a2bea61b9327", "metadata": { "id": "c726c8e2-b509-42ec-820b-a2bea61b9327" }, "outputs": [], "source": [ "# Seleccionamos las filas de la 4 a la 6 y las columnas de 0 a 2\n", "df_ejemplo.iloc[4:6, 0:2]" ] }, { "cell_type": "markdown", "id": "24ee9447-a718-436e-98d6-6e2809aa4905", "metadata": { "id": "24ee9447-a718-436e-98d6-6e2809aa4905" }, "source": [ "Otros ejemplos" ] }, { "cell_type": "code", "execution_count": null, "id": "dfd9f0c9-04a5-492f-b9e1-b207feca2705", "metadata": { "id": "dfd9f0c9-04a5-492f-b9e1-b207feca2705" }, "outputs": [], "source": [ "# Selección de filas con index 2, 3 y 4\n", "df.iloc[2:4]" ] }, { "cell_type": "code", "execution_count": null, "id": "35a37a11-a3e1-40dd-9d12-0442e4f330d0", "metadata": { "id": "35a37a11-a3e1-40dd-9d12-0442e4f330d0" }, "outputs": [], "source": [ "# Seleccion de filas con identificador 42,43,44, y 45\n", "df.loc[42:45]" ] }, { "cell_type": "code", "execution_count": null, "id": "b5eaca2f-e45e-4500-81b6-3ebadb8310dd", "metadata": { "id": "b5eaca2f-e45e-4500-81b6-3ebadb8310dd" }, "outputs": [], "source": [ "# Selección de filas con index par, lambda expression\n", "df.loc[lambda x: x.index % 2 == 0]" ] }, { "cell_type": "code", "execution_count": null, "id": "78115e7f-2813-4874-8a75-5a2893f1f735", "metadata": { "id": "78115e7f-2813-4874-8a75-5a2893f1f735" }, "outputs": [], "source": [ "# Selección de columnas por el nombre\n", "df[[iris.feature_names[1],iris.feature_names[3]]]" ] }, { "cell_type": "code", "execution_count": null, "id": "05a55d90-1ea4-4fd2-81e8-cd2b82ea5463", "metadata": { "id": "05a55d90-1ea4-4fd2-81e8-cd2b82ea5463" }, "outputs": [], "source": [ "df.loc[2][[iris.feature_names[1],iris.feature_names[3]]]" ] }, { "cell_type": "markdown", "id": "b20b8999-1623-462f-94cf-8233c6585ffc", "metadata": { "id": "b20b8999-1623-462f-94cf-8233c6585ffc" }, "source": [ "## Análisis exploratorio" ] }, { "cell_type": "markdown", "id": "a9e2313a-cb62-4a39-a557-4f7fcc6b5b35", "metadata": { "id": "a9e2313a-cb62-4a39-a557-4f7fcc6b5b35" }, "source": [ "### Resumen numérico de las distribuciones" ] }, { "cell_type": "markdown", "id": "18e0abf9-df40-44e5-9f3b-28bcdfda804b", "metadata": { "id": "18e0abf9-df40-44e5-9f3b-28bcdfda804b" }, "source": [ "Comenzamos el análisis con algunos resúmenes numéricos que permiten tener una idea rápida de cómo son las variables en nuestro dataset." ] }, { "cell_type": "code", "execution_count": null, "id": "fdccb297-cd5f-4f33-aecb-f8f277f048bb", "metadata": { "id": "fdccb297-cd5f-4f33-aecb-f8f277f048bb" }, "outputs": [], "source": [ "# Identificamos las variables del data frame y sus caracteristicas\n", "df.info()" ] }, { "cell_type": "markdown", "id": "e58956d6-64f7-4091-8552-e53e40ed2b0d", "metadata": { "id": "e58956d6-64f7-4091-8552-e53e40ed2b0d" }, "source": [ "Vemos si hay datos faltantes:" ] }, { "cell_type": "code", "execution_count": null, "id": "20f89c2e-55cc-444c-92fd-d0129f670baa", "metadata": { "id": "20f89c2e-55cc-444c-92fd-d0129f670baa" }, "outputs": [], "source": [ "# Porcentaje de valores nulos\n", "df.isnull().mean()" ] }, { "cell_type": "markdown", "id": "69fabd28-aa15-41b5-9b4c-5cce01f26b80", "metadata": { "id": "69fabd28-aa15-41b5-9b4c-5cce01f26b80" }, "source": [ "Resumen numérico de la distribución marginal de las columnas de $X$:" ] }, { "cell_type": "code", "execution_count": null, "id": "7e9e2136-939e-4889-86ec-00b373191ee6", "metadata": { "id": "7e9e2136-939e-4889-86ec-00b373191ee6" }, "outputs": [], "source": [ "# Distribución marginal de las features\n", "# Estadísticas descriptivas\n", "# Media, desviación estandar, quartiles\n", "df.describe()" ] }, { "cell_type": "code", "execution_count": null, "id": "07b4a75f-3fab-401a-9d47-6a97a77aacf1", "metadata": { "id": "07b4a75f-3fab-401a-9d47-6a97a77aacf1" }, "outputs": [], "source": [ "# Podemos customizar el resumen numérico si lo deseamos\n", "df.agg({'sepal length (cm)':[\"min\", \"max\", \"median\", \"skew\"]})" ] }, { "cell_type": "markdown", "id": "6142f384-2865-42c7-bc34-1c18ff42da3b", "metadata": { "id": "6142f384-2865-42c7-bc34-1c18ff42da3b" }, "source": [ "Resumen numérico de la distribución marginal del target $y$:" ] }, { "cell_type": "code", "execution_count": null, "id": "585923f7-b374-4919-9224-53f0d83af5f8", "metadata": { "id": "585923f7-b374-4919-9224-53f0d83af5f8" }, "outputs": [], "source": [ "# Valores únicos de las variables categoricas\n", "df[\"target\"].unique()" ] }, { "cell_type": "code", "execution_count": null, "id": "ae6962ff-6eea-4a78-bec2-f20476714c25", "metadata": { "id": "ae6962ff-6eea-4a78-bec2-f20476714c25" }, "outputs": [], "source": [ "# Distribución marginal del target como frecuencias\n", "df[\"target\"].value_counts()" ] }, { "cell_type": "code", "execution_count": null, "id": "9376bb9b-eb53-4e23-9bd7-7d73d1dfdcb6", "metadata": { "id": "9376bb9b-eb53-4e23-9bd7-7d73d1dfdcb6" }, "outputs": [], "source": [ "# Distribución marginal del target como porcentajes\n", "df[\"target\"].value_counts(normalize=True)" ] }, { "cell_type": "markdown", "id": "50623b5a-06db-4668-8ef2-6ea3f510d0ea", "metadata": { "id": "50623b5a-06db-4668-8ef2-6ea3f510d0ea" }, "source": [ "El siguiente paso es ver cómo se relacionan las diferentes variables de $X$ con $y$. La relación global entre todas las variables viene dada por la densidad conjunto $p(\\boldsymbol{x},y)$." ] }, { "cell_type": "markdown", "id": "3fd28fd0-b4b0-4238-8751-769151cb3121", "metadata": { "id": "3fd28fd0-b4b0-4238-8751-769151cb3121" }, "source": [ "Para dicha densidad conjunta es difícil hacer un resumen numérico, pues no es ni continua ni discreta. Sin embargo podemos hacernos una idea mirando los resúmenes de las distribuciones condicionales." ] }, { "cell_type": "markdown", "id": "2cc8eb00-011f-4791-85fe-09339fc94256", "metadata": { "id": "2cc8eb00-011f-4791-85fe-09339fc94256" }, "source": [ "Comenzemos por la más sencilla $p(\\boldsymbol{x}|y)$:" ] }, { "cell_type": "code", "execution_count": null, "id": "143f5834-9578-4596-bb13-e6b977b03f7b", "metadata": { "id": "143f5834-9578-4596-bb13-e6b977b03f7b" }, "outputs": [], "source": [ "df.groupby(\"target\").aggregate(func=[\"min\", \"median\", \"max\"])" ] }, { "cell_type": "markdown", "id": "8d3f7016-931f-402e-919a-a1bc1efcbf49", "metadata": { "id": "8d3f7016-931f-402e-919a-a1bc1efcbf49" }, "source": [ "A modo de ejemplo, vemos que hay una dependencia fuerte entre $\\boldsymbol{x}$ e $y$, ya que las medianas varían bastante al condicionar por diferentes valores de $y$ (las diferentes especies)." ] }, { "cell_type": "markdown", "id": "335b9cb1-366c-41c8-afb5-e257b9b568e6", "metadata": { "id": "335b9cb1-366c-41c8-afb5-e257b9b568e6" }, "source": [ "La dependencia entre las componentes de $\\boldsymbol{x}$ puede ser de interés también. Un resumen numérico simple y que nos da una buena idea de su relación se obtiene calculando la matriz de correlaciones." ] }, { "cell_type": "markdown", "id": "9b67e267-e001-4de4-b245-dcb25642bb81", "metadata": { "id": "9b67e267-e001-4de4-b245-dcb25642bb81" }, "source": [ "#### Correlación" ] }, { "cell_type": "markdown", "id": "d889fcb5-e43b-4da9-9073-7104788aed43", "metadata": { "id": "d889fcb5-e43b-4da9-9073-7104788aed43" }, "source": [ "Es una medida de la relación entre dos variables. En estadística, se utiliza para evaluar la relación lineal entre dos variables continuas.\n", "\n", "La correlación puede variar entre -1 y 1, donde 1 indica una correlación positiva perfecta, -1 indica una correlación negativa perfecta y 0 indica que no hay correlación entre las variables" ] }, { "cell_type": "markdown", "id": "e16c83b6-4588-49da-9c97-44e2b7bb930c", "metadata": { "id": "e16c83b6-4588-49da-9c97-44e2b7bb930c" }, "source": [ "La función .corr() se utiliza para encontrar la correlación de a pares de todas las columnas del DataFrame. Los nulos se excluyen. Las columnas no numéricas se excluyen." ] }, { "cell_type": "code", "execution_count": null, "id": "1ec36fa8-171e-4cb6-84de-e7256c21242d", "metadata": { "id": "1ec36fa8-171e-4cb6-84de-e7256c21242d" }, "outputs": [], "source": [ "df.iloc[:,0:4].corr()" ] }, { "cell_type": "markdown", "id": "081d07b6-68ae-4478-820d-54232e103cde", "metadata": { "id": "081d07b6-68ae-4478-820d-54232e103cde" }, "source": [ "Puede ser más sencillo visualizarla con colores:" ] }, { "cell_type": "code", "execution_count": null, "id": "356ef6f8-b6cc-46d8-bf7c-760ed878e425", "metadata": { "id": "356ef6f8-b6cc-46d8-bf7c-760ed878e425" }, "outputs": [], "source": [ "# Mapa de correlación\n", "cmap = sns.diverging_palette(220, 20, sep=20, as_cmap=True)\n", "sns.heatmap(df.iloc[:,0:4].corr(), annot=True,cmap=cmap, center=0).set_title(\"Correlation Heatmap\", fontsize=16)\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "50bfba47-01d3-44ee-85f9-2590ea6ae2c9", "metadata": { "id": "50bfba47-01d3-44ee-85f9-2590ea6ae2c9" }, "source": [ "De todos modos esta información es parcial. Por ejemplo, $x_1=$sepal lenght y $x_2=$ sepal width parecen tener poca relación entre ellas (correlación baja de -0.12). Sin embargo, al condicionar por las distintas especies esto puede cambiar drásticamente:" ] }, { "cell_type": "code", "execution_count": null, "id": "53889107-99b8-40a6-b504-0a08ab73ef0f", "metadata": { "id": "53889107-99b8-40a6-b504-0a08ab73ef0f" }, "outputs": [], "source": [ "df.groupby(\"target\").corr()" ] }, { "cell_type": "code", "execution_count": null, "id": "4aae0292-1849-46e9-848a-57e5fb2d2e86", "metadata": { "id": "4aae0292-1849-46e9-848a-57e5fb2d2e86" }, "outputs": [], "source": [ "# Mapa de correlación\n", "cmap = sns.diverging_palette(220, 20, sep=20, as_cmap=True)\n", "sns.heatmap(df.groupby(\"target\").corr(), annot=True,cmap=cmap, center=0).set_title(\"Correlation Heatmap\", fontsize=16)\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "2aec627e-941e-47ec-874d-377b3b59c8e8", "metadata": { "id": "2aec627e-941e-47ec-874d-377b3b59c8e8" }, "source": [ "### Análisis univariado" ] }, { "cell_type": "markdown", "id": "9c89732f-2bc8-48d8-9c06-64a2d20a6758", "metadata": { "id": "9c89732f-2bc8-48d8-9c06-64a2d20a6758" }, "source": [ "Vamos a profundizar en la información estadística que podemos obtener de cada variable por separado (distribución marginal). Esta vez además utilizaremos también visualizaciones para facilitar la tarea." ] }, { "cell_type": "markdown", "id": "30f22c45-6904-4b44-be70-a6eee381da70", "metadata": { "id": "30f22c45-6904-4b44-be70-a6eee381da70" }, "source": [ "#### Densidad" ] }, { "cell_type": "markdown", "id": "84279fbf-5b47-457c-a38c-9eacdd48c5cf", "metadata": { "id": "84279fbf-5b47-457c-a38c-9eacdd48c5cf" }, "source": [ "La función de densidad de probabilidad (PDF) es la función que describe la distribución de una variable aleatoria continua. En teoría de la probabilidad y estadística, la PDF de una variable aleatoria continua se utiliza para describir la probabilidad de que un valor ocurra dentro de un intervalo determinado." ] }, { "cell_type": "markdown", "id": "425bdbaa-ccef-482e-bbc9-e6bba3d6693e", "metadata": { "id": "425bdbaa-ccef-482e-bbc9-e6bba3d6693e" }, "source": [ "Recordemos la definición: $p(x)dx$ es la probabilidad de que la variable pertenezca a un intervalo infinitesimal de longitud $dx$ centrado en $x$. En particular, la densidad $p(x)$ no mide probabilidades, sino que mide probabilidades por unidad de medida de $x$. En este ejemplo concreto podríamos decir %/cm." ] }, { "cell_type": "markdown", "id": "e3692d3f-0002-43a5-b677-c8213849f567", "metadata": { "id": "e3692d3f-0002-43a5-b677-c8213849f567" }, "source": [ "Cuando disponemos de un dataset, una forma de estimarla es a través de histogramas. Otra forma un poco más sofisticada es a través de una estimación por núcleos (kernel density estimation, kde). En general se suele graficar ambos, histograma y kde, en una misma figura:" ] }, { "cell_type": "code", "execution_count": null, "id": "9388a65d-591c-4b0e-8c33-a110b604c072", "metadata": { "id": "9388a65d-591c-4b0e-8c33-a110b604c072" }, "outputs": [], "source": [ "# Ploteamos la función de densidad de probabilidad (PDF) de una variable continua\n", "plt.figure(figsize=(3,3))\n", "sns.displot(data=df, x=iris.feature_names[0], stat=\"density\", kde=True)\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "5cdf01ae-53c5-4d1a-97b8-e4afa535b81d", "metadata": { "id": "5cdf01ae-53c5-4d1a-97b8-e4afa535b81d" }, "source": [ "En notación matemática, el gráfico anterior estima la densidad de probabilidad (marginal) de la variable $x_1=$sepal lenght, que denotamos $p(x_1)$." ] }, { "cell_type": "markdown", "id": "a03e11df-9ea3-4f4d-8670-965890557030", "metadata": { "id": "a03e11df-9ea3-4f4d-8670-965890557030" }, "source": [ "Podemos hacer varios subplots con los histogramas que nos interesan:" ] }, { "cell_type": "code", "execution_count": null, "id": "7aaa9285-b3c0-4a21-ae0d-ef301f65dd33", "metadata": { "id": "7aaa9285-b3c0-4a21-ae0d-ef301f65dd33" }, "outputs": [], "source": [ "fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(8, 6))\n", "\n", "sns.histplot(data=df,\n", " x=iris.feature_names[0],\n", " stat=\"density\",\n", " kde=True,\n", " label=iris.feature_names[0],\n", " ax=ax[0][0])\n", "\n", "sns.histplot(data=df,\n", " x=iris.feature_names[1],\n", " stat=\"density\",\n", " kde=True,\n", " label=iris.feature_names[1],\n", " ax=ax[0][1])\n", "\n", "sns.histplot(data=df,\n", " x=iris.feature_names[2],\n", " stat=\"density\",\n", " kde=True,\n", " label=iris.feature_names[2],\n", " ax=ax[1][0])\n", "\n", "sns.histplot(data=df,\n", " x=iris.feature_names[3],\n", " stat=\"density\",\n", " kde=True,\n", " label=iris.feature_names[3],\n", " ax=ax[1][1])\n", "\n", "plt.tight_layout()\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "35930771-e935-42c7-bff5-0371fdc8b7f6", "metadata": { "id": "35930771-e935-42c7-bff5-0371fdc8b7f6" }, "source": [ "O incluso graficarlos todos juntos:" ] }, { "cell_type": "code", "execution_count": null, "id": "b4fcdb1c-ce3f-4859-84ac-d24d0228f5b0", "metadata": { "id": "b4fcdb1c-ce3f-4859-84ac-d24d0228f5b0" }, "outputs": [], "source": [ "sns.displot(df, element=\"step\", stat=\"density\")\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "7fe6e5ae-5fd0-4a01-b2ea-c2d2e81c0c8f", "metadata": { "id": "7fe6e5ae-5fd0-4a01-b2ea-c2d2e81c0c8f" }, "source": [ "Pero en ese caso es mejor graficar sólo la densidad:" ] }, { "cell_type": "code", "execution_count": null, "id": "07c9a2d0-14a1-4998-bd56-8fc100977836", "metadata": { "id": "07c9a2d0-14a1-4998-bd56-8fc100977836" }, "outputs": [], "source": [ "sns.displot(df, kind=\"kde\", fill=True)\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "2ea5ddcf-31d9-4112-9a09-8b284535169f", "metadata": { "id": "2ea5ddcf-31d9-4112-9a09-8b284535169f" }, "source": [ "A veces alcanza con un boxplot:" ] }, { "cell_type": "code", "execution_count": null, "id": "51c10007-1824-4756-af4d-8f3fccd707e3", "metadata": { "id": "51c10007-1824-4756-af4d-8f3fccd707e3" }, "outputs": [], "source": [ "#Boxplot por edad\n", "plt.figure(figsize=(5,2))\n", "sns.boxplot(x = iris.feature_names[0], data = df)\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "80884676-ac3f-4e50-9832-8f49dea4e54b", "metadata": { "id": "80884676-ac3f-4e50-9832-8f49dea4e54b" }, "source": [ "O ambos:" ] }, { "cell_type": "code", "execution_count": null, "id": "029f403e-53a9-4fcb-a5e8-ccd5b1c48794", "metadata": { "id": "029f403e-53a9-4fcb-a5e8-ccd5b1c48794" }, "outputs": [], "source": [ "fig, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw={\"height_ratios\": (.15, .85)})\n", "\n", "sns.boxplot(data=df, x=iris.feature_names[0], ax=ax_box)\n", "sns.histplot(data=df, x=iris.feature_names[0], ax=ax_hist, stat=\"density\", kde=\"True\")\n", "\n", "# Removemos el nombre del eje del boxplot\n", "ax_box.set(xlabel='')\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "ee8df5da-6ef5-4d2c-9dc3-afb032e59d8c", "metadata": { "id": "ee8df5da-6ef5-4d2c-9dc3-afb032e59d8c" }, "source": [ "El boxplot es muy útil para hacer comparaciones rápidas entre distribuciones:" ] }, { "cell_type": "code", "execution_count": null, "id": "3997b211-e4ea-46d2-a707-58ad71b69ef0", "metadata": { "id": "3997b211-e4ea-46d2-a707-58ad71b69ef0" }, "outputs": [], "source": [ "sns.boxplot(data=df)\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "28940252-5690-4991-ac08-4fe355c34745", "metadata": { "id": "28940252-5690-4991-ac08-4fe355c34745" }, "source": [ "Para las variables discretas, como es el caso del target $y$ en este ejmplo, es mejor un gráfico de barras. Dicho gráfico aproxima la función de probabilidad puntual (FPP) que está dada por $p(y)=P(Y=y)$." ] }, { "cell_type": "code", "execution_count": null, "id": "5b45d317-0c89-4568-9f7d-d3817d69ef73", "metadata": { "id": "5b45d317-0c89-4568-9f7d-d3817d69ef73" }, "outputs": [], "source": [ "# Plot de frecuencia\n", "sns.countplot(data=df, x='target')\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "03113aad-913f-46ff-a7b8-911176a9ba8b", "metadata": { "id": "03113aad-913f-46ff-a7b8-911176a9ba8b" }, "source": [ "#### Promedio" ] }, { "cell_type": "code", "execution_count": null, "id": "15be6e6c-814c-4e85-b5cd-cb09baa4186b", "metadata": { "id": "15be6e6c-814c-4e85-b5cd-cb09baa4186b" }, "outputs": [], "source": [ "# Calculamos la Esperanza de una variable\n", "mean_sl = df[iris.feature_names[0]].mean()\n", "print(f\"La esperanza de sepal length es {mean_sl:.2f} cm\")" ] }, { "cell_type": "markdown", "id": "3ddc0b07-c8d3-46c1-8f69-9960345e8192", "metadata": { "id": "3ddc0b07-c8d3-46c1-8f69-9960345e8192" }, "source": [ "#### Mediana" ] }, { "cell_type": "markdown", "id": "e8cd1696-9995-4cae-a8a1-de3b5cfa2f34", "metadata": { "id": "e8cd1696-9995-4cae-a8a1-de3b5cfa2f34" }, "source": [ "Es el valor que separa el conjunto de datos en dos partes iguales, donde la mitad de los datos están por encima y la otra mitad están por debajo de la mediana.\n", "\n", "Por ejemplo, si tenemos un conjunto de datos de cinco números ordenados de menor a mayor (1, 3, 5, 7, 9), la mediana sería el número que se encuentra en la posición intermedia (número 5). La mitad de los números son menores que 5 y la otra mitad son mayores que 5." ] }, { "cell_type": "code", "execution_count": null, "id": "75552f1b-06c3-4def-844a-6ba4f8b23ee5", "metadata": { "id": "75552f1b-06c3-4def-844a-6ba4f8b23ee5" }, "outputs": [], "source": [ "# Mediana\n", "median_sl = df[iris.feature_names[0]].median()\n", "print(f\"La mediana de sepal length es {median_sl:.2f} cm\")" ] }, { "cell_type": "markdown", "id": "70059edd-41d1-4bdf-a168-8b0c39905d27", "metadata": { "id": "70059edd-41d1-4bdf-a168-8b0c39905d27" }, "source": [ "#### Varianza" ] }, { "cell_type": "markdown", "id": "58b1d9ea-f363-4f1b-9c5b-9f3e6fb87fd8", "metadata": { "id": "58b1d9ea-f363-4f1b-9c5b-9f3e6fb87fd8" }, "source": [ "Es una medida de cuánto se dispersan los valores de una variable aleatoria alrededor de su media. Es una medida de la variabilidad de una distribución de probabilidad." ] }, { "cell_type": "code", "execution_count": null, "id": "9dca43f6-ad30-4b40-9d80-9194a3f451a0", "metadata": { "id": "9dca43f6-ad30-4b40-9d80-9194a3f451a0" }, "outputs": [], "source": [ "var_sl = df[iris.feature_names[0]].var()\n", "print(f\"La varianza de sepal length es {var_sl:.2f} cm^2\")" ] }, { "cell_type": "markdown", "id": "d3d67c38-e9ed-4855-a4f7-0688cc6ef690", "metadata": { "id": "d3d67c38-e9ed-4855-a4f7-0688cc6ef690" }, "source": [ "El desvío estandar es la raíz de la varianza:" ] }, { "cell_type": "code", "execution_count": null, "id": "4f1fda9a-f7f9-4e49-89e2-a2b4c21de68b", "metadata": { "id": "4f1fda9a-f7f9-4e49-89e2-a2b4c21de68b" }, "outputs": [], "source": [ "# Desviación estandar\n", "df[iris.feature_names[0]].std()" ] }, { "cell_type": "markdown", "id": "5966a068-7d31-45f0-bc44-6f7df6d65057", "metadata": { "id": "5966a068-7d31-45f0-bc44-6f7df6d65057" }, "source": [ "### Análisis multivariado" ] }, { "cell_type": "markdown", "id": "6d43f326-c0da-4ddd-a298-96ecea782c1a", "metadata": { "id": "6d43f326-c0da-4ddd-a298-96ecea782c1a" }, "source": [ "El objetivo del análisis multivariado es entender las relaciones entre las diferentes variables." ] }, { "cell_type": "markdown", "id": "4fb40c32-a6ad-432f-80d1-1d8d9a9795b1", "metadata": { "id": "4fb40c32-a6ad-432f-80d1-1d8d9a9795b1" }, "source": [ "Por ejemplo, al igual que hicimos con los resúmenes numéricos, podemos empezar analizando la distribución condicional $p(\\boldsymbol{x}|y)$:" ] }, { "cell_type": "code", "execution_count": null, "id": "9299bb30-2a80-439b-b47c-9d8442ad95c8", "metadata": { "id": "9299bb30-2a80-439b-b47c-9d8442ad95c8" }, "outputs": [], "source": [ "sns.displot(df, x=iris.feature_names[0], col=\"target\", stat=\"density\", kde=True)\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "6b952bb2-9170-4d84-84ee-d0b628ed5386", "metadata": { "id": "6b952bb2-9170-4d84-84ee-d0b628ed5386" }, "source": [ "Si no nos gustan los histogramas podemos hacer un stripplot:" ] }, { "cell_type": "code", "execution_count": null, "id": "941c34f9-d932-414e-b36a-e293746543cc", "metadata": { "id": "941c34f9-d932-414e-b36a-e293746543cc" }, "outputs": [], "source": [ "col = iris.feature_names[0]\n", "sns.stripplot(y=\"target\", x=col, data=df, jitter=True)\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "64fdebc2-055a-435b-b583-941dd1505352", "metadata": { "id": "64fdebc2-055a-435b-b583-941dd1505352" }, "source": [ "O mejor aún un boxplot:" ] }, { "cell_type": "code", "execution_count": null, "id": "45f66a96-0753-4008-ad50-923d6c9ad4f5", "metadata": { "id": "45f66a96-0753-4008-ad50-923d6c9ad4f5" }, "outputs": [], "source": [ "# Boxplot\n", "col = iris.feature_names[0]\n", "sns.boxplot(x = col, y = 'target', data = df)\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "8daf0ed1-468d-4785-882d-fa70de7250d9", "metadata": { "id": "8daf0ed1-468d-4785-882d-fa70de7250d9" }, "source": [ "Para hacerlo con todas las variables juntas debemos modificar el formato del dataset (long format):" ] }, { "cell_type": "code", "execution_count": null, "id": "798f51fd-801f-4362-b2e8-d007f652d8b6", "metadata": { "id": "798f51fd-801f-4362-b2e8-d007f652d8b6" }, "outputs": [], "source": [ "df_long = pd.melt(df, \"target\", var_name=\"a\", value_name=\"c\")\n", "display(df_long.head())" ] }, { "cell_type": "code", "execution_count": null, "id": "a6082141-50a4-4d5e-957d-d367bba8430a", "metadata": { "id": "a6082141-50a4-4d5e-957d-d367bba8430a" }, "outputs": [], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "c8208abf-47fb-4842-b86b-a363cd91dc65", "metadata": { "id": "c8208abf-47fb-4842-b86b-a363cd91dc65" }, "outputs": [], "source": [ "plt.figure(figsize=(8,4))\n", "sns.boxplot(x=\"a\", hue=\"target\", y=\"c\", data=df_long)\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "f61bbf7b-54d6-4ba7-801f-a2e481fb3fec", "metadata": { "id": "f61bbf7b-54d6-4ba7-801f-a2e481fb3fec" }, "source": [ "Con pandas directamente es un poco más sencillo:" ] }, { "cell_type": "code", "execution_count": null, "id": "f612c811-3e0f-493b-9a75-e5518eb322ac", "metadata": { "id": "f612c811-3e0f-493b-9a75-e5518eb322ac" }, "outputs": [], "source": [ "# Boxplot\n", "df.plot.box(by=\"target\",rot=90,figsize=(12, 6))\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "9a69e53b-1cb9-4efa-9a3e-82717419c4a9", "metadata": { "id": "9a69e53b-1cb9-4efa-9a3e-82717419c4a9" }, "source": [ "La relación entre dos variables predictoras podemos visualizarla con un scatter plot decorado con las curvas de nivel de la densidad conjunta:" ] }, { "cell_type": "code", "execution_count": null, "id": "3f94d0d9-b795-4827-ba93-727271114701", "metadata": { "id": "3f94d0d9-b795-4827-ba93-727271114701" }, "outputs": [], "source": [ "g = sns.jointplot(\n", " data=df,\n", " x=iris.feature_names[0],\n", " y=iris.feature_names[1],\n", " kind=\"kde\",\n", " fill=True,\n", " alpha=0.4\n", ")\n", "g.plot_joint(plt.scatter, c=\"w\", s=30, linewidth=1, marker=\"+\")\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "244d20ae-6a9b-4848-b9ce-37809f6f717f", "metadata": { "id": "244d20ae-6a9b-4848-b9ce-37809f6f717f" }, "source": [ "Podemos generalizar esto graficando la distribución condicional de dos variables predictoras dado el target:" ] }, { "cell_type": "code", "execution_count": null, "id": "a31cf9cd-af54-42d0-867b-a6d5c26dd5be", "metadata": { "id": "a31cf9cd-af54-42d0-867b-a6d5c26dd5be" }, "outputs": [], "source": [ "sns.jointplot(\n", " data=df,\n", " x=iris.feature_names[0],\n", " y=iris.feature_names[1],\n", " hue=\"target\",\n", " kind=\"kde\",\n", " fill=True,\n", " alpha=0.4\n", ")\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "9fc4aa56-4db2-4c62-bf9e-6696eaf611e0", "metadata": { "id": "9fc4aa56-4db2-4c62-bf9e-6696eaf611e0" }, "source": [ "Podemos también usar un scatter plot con boxplots:" ] }, { "cell_type": "code", "execution_count": null, "id": "107d5e77-737e-4c26-8e9d-d1c4d19ef9be", "metadata": { "id": "107d5e77-737e-4c26-8e9d-d1c4d19ef9be" }, "outputs": [], "source": [ "g = sns.JointGrid(data=df, x=iris.feature_names[0], y=iris.feature_names[1], hue=\"target\")\n", "g.plot_joint(sns.scatterplot)\n", "g.plot_marginals(sns.boxplot)\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "3941e59e-a88c-44cc-8041-084be185f936", "metadata": { "id": "3941e59e-a88c-44cc-8041-084be185f936" }, "source": [ "En caso de que no sean demasiadas variables predictoras podemos hacer los scatters todos juntos:" ] }, { "cell_type": "code", "execution_count": null, "id": "9aee928c-2da5-4060-b97d-340b2b7493ff", "metadata": { "id": "9aee928c-2da5-4060-b97d-340b2b7493ff" }, "outputs": [], "source": [ "sns.pairplot(df, hue=\"target\")\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "id": "1f23477d-076e-4dbc-a5ba-a287e81ca18b", "metadata": { "id": "1f23477d-076e-4dbc-a5ba-a287e81ca18b" }, "outputs": [], "source": [ "sns.pairplot(df, hue=\"target\", kind=\"kde\")\n", "plt.show()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.15" }, "colab": { "provenance": [] } }, "nbformat": 4, "nbformat_minor": 5 }