{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "provenance": [] }, "kernelspec": { "name": "python3", "display_name": "Python 3" }, "language_info": { "name": "python" } }, "cells": [ { "cell_type": "markdown", "metadata": { "id": "pOVfeqL80GN5" }, "source": [ "## Laboratorio LSTM\n", "\n", "En este laboratorio utilizaremos Redes LSTM para aprender a clasificar Tweets según su sentimiento (positivo/negativo/neutro).\n", "\n", "Trabajaremos sobre el dataset del challenge [TASS 2020](http://tass.sepln.org/2020/). Este dataset contiene tweets en español anotados con su respectiva polaridad de sentimiento (P: positivo, N: negativo, NEU: neutro).\n", "\n", "En particular utilizaremos los tweets correspondientes a Uruguay que fueron anotados por el Grupo de PLN de la Facultad de Ingeniería." ] }, { "cell_type": "markdown", "metadata": { "id": "vTYw7xBW1Gm5" }, "source": [ "## Preparación de los datos\n", "\n", "Para empezar importaremos el dataset a utilizar y visualezaremos la estructura de los datos.\n" ] }, { "cell_type": "code", "metadata": { "id": "53NVxACD1plD" }, "source": [ "import os\n", "import pandas as pd\n", "\n", "# Descomprimimos el archivo y verificamos\n", "! wget https://eva.fing.edu.uy/pluginfile.php/357781/mod_folder/content/0/tass2020.tar.gz\n", "! tar -zxf tass2020.tar.gz\n", "assert os.path.isfile('tass2020/train/uy.tsv'), 'No se encontró el archivo; asegurate de haberlo cargado'\n", "assert os.path.isfile('tass2020/dev/uy.tsv'), 'No se encontró el archivo; asegurate de haberlo cargado'\n", "assert os.path.isfile('tass2020/test/uy.tsv'), 'No se encontró el archivo; asegurate de haberlo cargado'\n", "\n", "# Agregamos etiquetas a los atributos\n", "data_train = pd.read_csv('tass2020/train/uy.tsv', sep='\\t', names=['id', 'text', 'polarity'])\n", "data_dev = pd.read_csv('tass2020/dev/uy.tsv', sep='\\t', names=['id', 'text', 'polarity'])\n", "data_test = pd.read_csv('tass2020/test/uy.tsv', sep='\\t', names=['id', 'text', 'polarity'])\n", "\n", "# Eliminamos el atributo 'id'\n", "data_train.drop(['id'], axis=1, inplace=True)\n", "data_dev.drop(['id'], axis=1, inplace=True)\n", "data_test.drop(['id'], axis=1, inplace=True)\n" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "8cCc_uIgH6Wv" }, "source": [ "# Separamos los atributos de la clase objetivo\n", "X_train_text = data_train['text']\n", "y_train_text = data_train['polarity'].values\n", "X_dev_text = data_dev['text']\n", "y_dev_text = data_dev['polarity'].values\n", "X_test_text = data_test['text']\n", "y_test_text = data_test['polarity'].values" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "U9eWVkPoOP5-" }, "source": [ "from keras.preprocessing.text import Tokenizer\n", "from keras.preprocessing import sequence\n", "\n", "max_length = 50\n", "\n", "# Toquenizamos las palabras\n", "t = Tokenizer()\n", "t.fit_on_texts(X_train_text)\n", "X_train = t.texts_to_sequences(X_train_text)\n", "X_dev = t.texts_to_sequences(X_dev_text)\n", "X_test = t.texts_to_sequences(X_test_text)\n", "\n", "# Agregamos padding\n", "X_train = sequence.pad_sequences(X_train, maxlen=max_length)\n", "X_dev = sequence.pad_sequences(X_dev, maxlen=max_length)\n", "X_test = sequence.pad_sequences(X_test, maxlen=max_length)\n" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "rZr97MYN7-f6" }, "source": [ "from sklearn.preprocessing import LabelEncoder\n", "\n", "# Convertimos las etiquetas a enteros\n", "le = LabelEncoder()\n", "le.fit([\"P\", \"NEU\", \"N\"])\n", "y_train = le.transform(y_train_text)\n", "y_dev = le.transform(y_dev_text)\n", "y_test = le.transform(y_test_text)\n" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "zDIFe-sU3Tm2" }, "source": [ "### Ejercicio 1 - Descripción de los datos\n", "\n", "1. Visualice y describa la estructura de los distintos atributos y la clase objetivo.\n", "1. ¿Cuántos ejemplos tiene en total cada partición (entrenamiento y evaluación)?\n", "1. ¿Y para cada valor de la clase objetivo?" ] }, { "cell_type": "code", "metadata": { "id": "cYg0NxNg2WPV" }, "source": [ "# Código para responder las preguntas\n", "\n" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "Qn8lP0xqBg-N" }, "source": [ "Respuestas:\n" ] }, { "cell_type": "markdown", "metadata": { "id": "M3SCysKy6H84" }, "source": [ "## Implementación de la Red Neuronal en Keras\n", "\n", "A continuación implementaremos la red LSTM utilizando Keras." ] }, { "cell_type": "markdown", "metadata": { "id": "N-s7hDk_6ogr" }, "source": [ "### Ejercicio 2 - Descripción de la red\n", "\n", "1. ¿Qué tipo de RNN utilizaría para esta tarea?\n", "1. ¿Y qué tipo de función de activación? ¿Por qué?\n" ] }, { "cell_type": "markdown", "metadata": { "id": "DtuIBUNT64VN" }, "source": [ "Respuestas:\n" ] }, { "cell_type": "markdown", "metadata": { "id": "3bGKWhpu6-zS" }, "source": [ "### Ejercicio 3 - Implementación de la red\n" ] }, { "cell_type": "markdown", "metadata": { "id": "wcBG8tH7OzGr" }, "source": [ "Implemente una red neuronal con las siguientes características para resolver el problema planteado:\n", "\n", "1. Entrada\n", "1. Capa LSTM, de 32 unidades\n", "1. Activación\n" ] }, { "cell_type": "code", "metadata": { "id": "arTpfOWc7Fw0" }, "source": [ "from tensorflow import keras\n", "from keras.models import Sequential\n", "from keras.layers import Input, LSTM, Dense\n", "import numpy as np" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "1nTG2b4fE2em" }, "source": [ "# Implementación del ejercicio 3" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "TymgKQRTDFZ0" }, "source": [ "### Ejercicio 4 - Compilación del modelo" ] }, { "cell_type": "markdown", "metadata": { "id": "8natwz2lDPP_" }, "source": [ "Compile el modelo utilizando:\n", "\n", "* Optimizador: Adam con learning rate de 0.01\n", "* Métrica de optimización: Accuracy\n", "* Función de loss adecuada" ] }, { "cell_type": "code", "metadata": { "id": "37l7ZX55FfP6" }, "source": [ "# Implementación del ejercicio 4\n" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "hRpB_dShE7vr" }, "source": [ "# Visualización del modelo\n", "model.summary()\n", "keras.utils.plot_model(model, to_file='model.png', show_shapes=True, show_layer_names=False)" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "s6l0xOpU7c92" }, "source": [ "## Entrenamiento del modelo\n" ] }, { "cell_type": "markdown", "metadata": { "id": "kt8aVoMp7knc" }, "source": [ "### Ejercicio 5 - Entrenamiento de la red\n" ] }, { "cell_type": "markdown", "metadata": { "id": "hqVkYxguEU42" }, "source": [ "Entrene el modelo anterior durante 20 épocas utilizando adecuadamente los datos de entrenamiento y validación." ] }, { "cell_type": "code", "metadata": { "id": "zF438E9k7tEc" }, "source": [ "# Implementación del ejercicio 5" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "ffxSAuid70VC" }, "source": [ "### Visualización\n" ] }, { "cell_type": "code", "metadata": { "id": "t9Nj5CXu78_D" }, "source": [ "import matplotlib.pyplot as plt\n", "\n", "def plot_history(history):\n", " # Plot training & validation accuracy values\n", " plt.plot(history.history['accuracy'])\n", " plt.plot(history.history['val_accuracy'])\n", " plt.title('Model accuracy')\n", " plt.ylabel('Accuracy')\n", " plt.xlabel('Epoch')\n", " plt.legend(['Train', 'Validation'], loc='upper left')\n", " plt.show()\n", "\n", " # Plot training & validation loss values\n", " plt.plot(history.history['loss'])\n", " plt.plot(history.history['val_loss'])\n", " plt.title('Model loss')\n", " plt.ylabel('Loss')\n", " plt.xlabel('Epoch')\n", " plt.legend(['Train', 'Validation'], loc='upper left')\n", " plt.show()\n", " plt.clf()" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "neZtw2k4_xUL" }, "source": [ "plot_history(model.history)" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "3Z1MLxIY8BhM" }, "source": [ "## Evaluación del modelo" ] }, { "cell_type": "markdown", "metadata": { "id": "5iUqFhX6Gs2Q" }, "source": [ "### Ejercicio 6 - Evaluación" ] }, { "cell_type": "markdown", "metadata": { "id": "96BqtUHwG2cB" }, "source": [ "Evalúe el modelo anterior utilizando los datos de test. ¿Qué accuracy tuvo el modelo?" ] }, { "cell_type": "code", "metadata": { "id": "xcm1pVn28IrQ" }, "source": [ "# Implementación del ejercicio 6\n" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "TE2eQX7d8UpZ" }, "source": [ "## Embeddings\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "id": "ro8AY3Km_cyX" }, "source": [ "### Embeddings pre-entrenados" ] }, { "cell_type": "markdown", "metadata": { "id": "bxu6uiDD_hb9" }, "source": [ "Para esta parte utilizaremos los embeddings pre-entrenados de [Fasttext](https://fasttext.cc/) para español." ] }, { "cell_type": "code", "metadata": { "id": "6U9TmRh5_yuY" }, "source": [ "# Descargamos y descomprimimos los Embeddings\n", "! wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.es.300.vec.gz\n", "! gzip -d cc.es.300.vec.gz" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "P0p0aTRZAFHH" }, "source": [ "# Cargamos los embeddings\n", "embeddings_index = {}\n", "with open(\"cc.es.300.vec\") as f:\n", " for line in f:\n", " word, coefs = line.split(maxsplit=1)\n", " coefs = np.fromstring(coefs, \"f\", sep=\" \")\n", " embeddings_index[word] = coefs" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "YKGpA2g4AMn2" }, "source": [ "# Preparamos la matriz de embeddings\n", "num_tokens = len(t.word_index) + 2\n", "embedding_dim = 300\n", "\n", "embedding_matrix = np.zeros((num_tokens, embedding_dim))\n", "for word, i in t.word_index.items():\n", " embedding_vector = embeddings_index.get(word)\n", " if embedding_vector is not None:\n", " # Words not found in embedding index will be all-zeros.\n", " # This includes the representation for \"padding\" and \"OOV\"\n", " embedding_matrix[i] = embedding_vector" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "zJQu7xKCDP4J" }, "source": [ "### Ejercicio 7 - Implementación de la red" ] }, { "cell_type": "markdown", "metadata": { "id": "1_7hxI9UOb1M" }, "source": [ "Ahora implemente, compile y visualice el siguiente modelo:\n", "\n", "1. Entrada\n", "1. **Capa de embeddings utilizando la matriz anterior**\n", "1. Capa LSTM, de 32 unidades con\n", "1. Activación Softmax\n" ] }, { "cell_type": "code", "metadata": { "id": "QDtwICB18brS" }, "source": [ "# Implementación del ejercicio 7" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "dL0-Ap-NDYhM" }, "source": [ "### Ejercicio 8 - Entrenamiento y visualización" ] }, { "cell_type": "markdown", "metadata": { "id": "CoQTqbQGGcdJ" }, "source": [ "Entrene el nuevo modelo" ] }, { "cell_type": "code", "metadata": { "id": "0VBD_RZJDbdE" }, "source": [ "# Implementación del ejercicio 8" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "DSMMrG22G7qL" }, "source": [ "# Visualizamos los resultados del entrenamiento\n", "plot_history(model_emb.history)" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "rkg_EDaoqbPC" }, "source": [ "### Ejercicio 9 - Mejorando el modelo" ] }, { "cell_type": "markdown", "metadata": { "id": "S4DgRwQNI-bu" }, "source": [ "A partir de los resultados obtenidos en entrenamiento. ¿Cómo podemos mejorar el modelo?" ] }, { "cell_type": "markdown", "metadata": { "id": "ho50j4h_JK2g" }, "source": [ "Respuesta:\n" ] }, { "cell_type": "markdown", "metadata": { "id": "oBMCJl_8JNuh" }, "source": [ "Agregue al modelo anterior las capas que sean necesarias." ] }, { "cell_type": "code", "metadata": { "id": "fb4fVu5VqhNz" }, "source": [ "# Implementacion del ejercicio 9" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "eJvq6idPrkU6" }, "source": [ "plot_history(model_emb.history)" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "JvG-FzTUDcaz" }, "source": [ "### Ejercicio 10 - Evaluación" ] }, { "cell_type": "markdown", "metadata": { "id": "RF8LWNP5KSt2" }, "source": [ "Evalúe el nuevo modelo utilizando los datos de test. ¿Qué accuracy obtuvo?" ] }, { "cell_type": "code", "metadata": { "id": "Fa6XpAs5ph0t" }, "source": [ "# Implementacion del ejercicio 10" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "_ZQR9XTZ8gPD" }, "source": [ "## Conclusiones\n" ] }, { "cell_type": "markdown", "metadata": { "id": "ISCZ3nAR8xd4" }, "source": [ "### Ejercicio 11 - Conclusiones sobre mejora" ] }, { "cell_type": "markdown", "metadata": { "id": "zywyTaZ59Cp8" }, "source": [ "1. Comentar brevemente los resultados obtenidos al agregar Embeddings.\n", "1. ¿De qué otra forma podemos mejorar los resultados?\n" ] } ] }