{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "t9ZNGOygrFsT" }, "source": [ "**Redes Neuronales para Lenguaje Natural, 2023**\n", "\n", "---\n", "# **Análisis de sentimento con BERT (en español)**\n", "\n", "En este *notebook* construiremos un sistema de análisis de sentimiento usando BERT. Los datos que utilizaremos son comentarios de películas de IMDB.\n", "\n", "Este notebook está basado en el capítulo 16 del libro \"Machine Learning with PyTorch and Scikit-Learn: Develop machine learning and deep learning models with Python\" de Raschka, S., Liu, Y. H., Mirjalili, V., & Dzhulgakov, D. (2022).\n", "\n", "---\n", "\n" ] }, { "cell_type": "markdown", "source": [ "Empezaremos por instalar la librería transformers." ], "metadata": { "id": "RmoRzsBZV41u" } }, { "cell_type": "code", "source": [ "!pip install transformers" ], "metadata": { "id": "atA3Pz0lV5J_" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "Algunos imports que utilizaremos a lo largo de este notebook:" ], "metadata": { "id": "r288PtJmV2EU" } }, { "cell_type": "code", "source": [ "import time\n", "\n", "import pandas as pd\n", "import torch\n", "\n", "import transformers\n", "from transformers import DistilBertTokenizerFast\n", "from transformers import DistilBertForSequenceClassification\n" ], "metadata": { "id": "CyC64OZJlE-1" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "Algunos ajustes que utilizaremos:" ], "metadata": { "id": "qdeSapk3WUPh" } }, { "cell_type": "code", "source": [ "torch.backends.cudnn.deterministic = True\n", "RANDOM_SEED = 248\n", "torch.manual_seed(RANDOM_SEED)\n", "DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')" ], "metadata": { "id": "8dxOTGCulFDi" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "DEVICE" ], "metadata": { "id": "6M4I0WYKwShN" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "## **Dataset**\n", "\n", "Vamos a utilizar el *IMDb movie review dataset* (en español, versión reducida), ejecute el siguiente bloque para descargarlo:" ], "metadata": { "id": "tB-KtT-qWiYN" } }, { "cell_type": "code", "source": [ "! wget https://eva.fing.edu.uy/mod/resource/view.php?id=194796 -O senti-corpus.zip\n", "! unzip senti-corpus.zip" ], "metadata": { "id": "aDfflRqcWpfz" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "import numpy as np\n", "import pandas as pd\n", "from collections import Counter\n", "from sklearn.metrics import precision_recall_fscore_support, accuracy_score\n", "\n", "train_df = pd.read_csv('./senti-train.tsv',header = None,sep='\\\\t')\n", "dev_df = pd.read_csv('./senti-dev.tsv',header = None,sep='\\\\t')\n", "test_df = pd.read_csv('./senti-test.tsv',header = None,sep='\\\\t')\n" ], "metadata": { "id": "9UIj0_u-UIP1" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "Cargamos el archivo en memoria con pandas:" ], "metadata": { "id": "ebqyJisqaY8s" } }, { "cell_type": "code", "source": [ "train_df.head()" ], "metadata": { "id": "gidnXvr5lFGG" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "Desplegamos la cantidad de filas y columnas de nuestros datos:" ], "metadata": { "id": "3QQZvR-6A5tg" } }, { "cell_type": "code", "source": [ "df.shape" ], "metadata": { "id": "sdiLZO9JlFIt" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "## **Particiones Train/Val/Test**\n", "\n", "Utilizamos las particiones train, val y test del dataset para realizar nuestros experimentos." ], "metadata": { "id": "JngOPLsBbNeL" } }, { "cell_type": "code", "source": [ "train_texts = train_df.iloc[:,0].values\n", "train_labels = train_df.iloc[:,1].values\n", "\n", "valid_texts = dev_df.iloc[:,0].values\n", "valid_labels = dev_df.iloc[:,1].values\n", "\n", "test_texts = test_df.iloc[:,0].values\n", "test_labels = test_df.iloc[:,1].values" ], "metadata": { "id": "y-fIofwPUXXk" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "print(train_texts[0])\n", "print(train_labels[0])" ], "metadata": { "id": "cJCsAILHVPvZ" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "## **Tokenizamos**\n", "\n", "Para utilizar un modelo pre-entrenado BERT es preciso tokenizar la entrada de la misma manera con la que el modelo fue pre-entrenado. Vamos a comenzar utilizando **DistilBERT** (https://huggingface.co/docs/transformers/model_doc/distilbert), en particular \"distilbert-base-uncased\" por lo tanto cargamos el tokenizer correspondiente:" ], "metadata": { "id": "Ny1Ob9Udbe8Z" } }, { "cell_type": "code", "source": [ "tokenizer = DistilBertTokenizerFast.from_pretrained('dccuchile/distilbert-base-spanish-uncased')" ], "metadata": { "id": "8H2udnofbSo5" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "Tokenizamos las particiones train, val y test." ], "metadata": { "id": "xlHBS2IaC_Ax" } }, { "cell_type": "code", "source": [ "train_encodings = tokenizer(list(train_texts), truncation=True, padding=True)\n", "valid_encodings = tokenizer(list(valid_texts), truncation=True, padding=True)\n", "test_encodings = tokenizer(list(test_texts), truncation=True, padding=True)" ], "metadata": { "id": "PkUfioYMbSr0" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "Resultado de la tokenización de un conjunto:" ], "metadata": { "id": "IrSbQVOwDJQA" } }, { "cell_type": "code", "source": [ "type(train_encodings)" ], "metadata": { "id": "76bpNkwLDJlm" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "Un elemento tokenizado:" ], "metadata": { "id": "-A63A4Ztdjw8" } }, { "cell_type": "code", "source": [ "train_encodings[0]" ], "metadata": { "id": "ZTkF--KobSuR" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "print(train_encodings[0].tokens)\n", "print(train_encodings[0].tokens.index('[PAD]'))\n", "print(train_encodings[0].attention_mask.index(0))" ], "metadata": { "id": "JZHoCotHPOnn" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "Observe algunos ejemplos y analice el contenido de cada campo (por dudas consulte https://huggingface.co/docs/tokenizers/api/encoding)" ], "metadata": { "id": "4_Y3qLjDdqPJ" } }, { "cell_type": "code", "source": [ "print(train_encodings[0].attention_mask)\n", "print(train_encodings[0].special_tokens_mask)\n" ], "metadata": { "id": "fs0-h3iGbSw2" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "**Pregunta:** ¿Qué objetivo cumple el campo *attention_mask*?\n", "\n", "**Respuesta:** [Escriba su respuesta]" ], "metadata": { "id": "AvRawCH8FFk9" } }, { "cell_type": "markdown", "source": [ "**Pregunta:** ¿Qué objetivo cumple el campo *special_tokens_mask*?\n", "\n", "**Respuesta:** [Escriba su respuesta]" ], "metadata": { "id": "H-GNuTMfFXZI" } }, { "cell_type": "markdown", "source": [ "\n", "\n", "## **Dataset pytorch class y DataLoaders**\n", "\n", "Instanciamos los datos tokenizados como un Dataset de pytorch." ], "metadata": { "id": "NXO6TGraevwC" } }, { "cell_type": "code", "source": [ "class IMDbDataset(torch.utils.data.Dataset):\n", " def __init__(self, encodings, labels):\n", " self.encodings = encodings\n", " self.labels = labels\n", "\n", " def __getitem__(self, idx):\n", " item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}\n", " item['labels'] = torch.tensor(self.labels[idx])\n", " return item\n", "\n", " def __len__(self):\n", " return len(self.labels)\n", "\n", "\n", "train_dataset = IMDbDataset(train_encodings, train_labels)\n", "valid_dataset = IMDbDataset(valid_encodings, valid_labels)\n", "test_dataset = IMDbDataset(test_encodings, test_labels)\n" ], "metadata": { "id": "LTLWLRq6bS29" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "Instanciamos el DataLoader de cada partición:" ], "metadata": { "id": "h4h3Zs-0knsj" } }, { "cell_type": "code", "source": [ "train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=16, shuffle=True)\n", "valid_loader = torch.utils.data.DataLoader(valid_dataset, batch_size=16, shuffle=False)\n", "test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=16, shuffle=False)" ], "metadata": { "id": "vIk8AzZZlFN5" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "## **Cargar y *fine-tuning* de un modelo BERT pre-entrenado**\n", "\n", "Cargamos \"distilbert-base-uncased\":" ], "metadata": { "id": "dJczS9mQlrTj" } }, { "cell_type": "code", "source": [ "model = DistilBertForSequenceClassification.from_pretrained('dccuchile/distilbert-base-spanish-uncased')\n", "model.to(DEVICE)\n" ], "metadata": { "id": "EbWij3kylV88" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "Evaluaremos los modelos resultantes con la siguente función para calcular *accuracy*:" ], "metadata": { "id": "L9oKwDCWI_hU" } }, { "cell_type": "code", "source": [ "def compute_accuracy(model, data_loader, device):\n", " with torch.no_grad():\n", " correct_pred, num_examples = 0, 0\n", "\n", " for batch_idx, batch in enumerate(data_loader):\n", "\n", " ### Prepare data\n", " input_ids = batch['input_ids'].to(device)\n", " attention_mask = batch['attention_mask'].to(device)\n", " labels = batch['labels'].to(device)\n", " outputs = model(input_ids, attention_mask=attention_mask)\n", " logits = outputs['logits']\n", " predicted_labels = torch.argmax(logits, 1)\n", " num_examples += labels.size(0)\n", " correct_pred += (predicted_labels == labels).sum()\n", "\n", " return correct_pred.float()/num_examples * 100" ], "metadata": { "id": "tM2zotIhlWCd" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "compute_accuracy(model,test_loader,DEVICE)" ], "metadata": { "id": "fSepKJUHP_Tc" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "Para realizar el fine-tuning del modelo pre-entrenado utilizamos el siguiente ciclo de entrenamiento:" ], "metadata": { "id": "NCwz3JPRJ3M_" } }, { "cell_type": "code", "source": [ "NUM_EPOCHS = 5\n", "optim = torch.optim.Adam(model.parameters(), lr=5e-6)\n", "\n", "\n", "start_time = time.time()\n", "\n", "for epoch in range(NUM_EPOCHS):\n", "\n", " model.train()\n", "\n", " for batch_idx, batch in enumerate(train_loader):\n", "\n", " ### Prepare data\n", " input_ids = batch['input_ids'].to(DEVICE)\n", " attention_mask = batch['attention_mask'].to(DEVICE)\n", " labels = batch['labels'].to(DEVICE)\n", "\n", " ### Forward\n", " outputs = model(input_ids, attention_mask=attention_mask, labels=labels)\n", " loss, logits = outputs['loss'], outputs['logits']\n", "\n", " ### Backward\n", " optim.zero_grad()\n", " loss.backward()\n", " optim.step()\n", "\n", " ### Logging\n", " if not batch_idx % 20:\n", " print (f'Epoch: {epoch+1:04d}/{NUM_EPOCHS:04d} | '\n", " f'Batch {batch_idx:04d}/{len(train_loader):04d} | '\n", " f'Loss: {loss:.4f}')\n", "\n", " model.eval()\n", "\n", " with torch.set_grad_enabled(False):\n", " print(f'Training accuracy: '\n", " f'{compute_accuracy(model, train_loader, DEVICE):.2f}%'\n", " f'\\nValid accuracy: '\n", " f'{compute_accuracy(model, valid_loader, DEVICE):.2f}%')\n", "\n", " print(f'Time elapsed: {(time.time() - start_time)/60:.2f} min')\n", "\n", "print(f'Total Training Time: {(time.time() - start_time)/60:.2f} min')\n" ], "metadata": { "id": "EF2hspvSlWHX" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "### **Evaluación del modelo**\n", "\n", "Evaluamos en **test** el modelo resultante del fine-tuning:" ], "metadata": { "id": "9hL2WFI_KGsY" } }, { "cell_type": "code", "source": [ "print(f'Test accuracy: {compute_accuracy(model, test_loader, DEVICE):.2f}%')" ], "metadata": { "id": "xQByjodClWO6" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "# **Ejercicio**\n", "\n", "Basándose en lo anterior, realice por lo menos **3** experimentos de la siguiente manera:\n", "\n", "- Instancie un modelo **BERT pre-entrenado**\n", "- Realice **fine-tuning** del modelo para clasificación en *IMDb movie review dataset*\n", "- **Ajuste hiperparámetros** utilizando la partición de **validación**\n", "\n", "\n", "Reporte el resultado en **test** del modelo que obtuvo los mejores resultados." ], "metadata": { "id": "eec9alDSZ5dJ" } }, { "cell_type": "code", "source": [ "# Escriba su código a partir de aquí" ], "metadata": { "id": "jSzk86DclWZl" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [], "metadata": { "id": "OPxckl5blWb7" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [], "metadata": { "id": "8N-TXghflXBr" }, "execution_count": null, "outputs": [] } ], "metadata": { "colab": { "provenance": [], "gpuType": "T4" }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "name": "python" }, "accelerator": "GPU" }, "nbformat": 4, "nbformat_minor": 0 }