{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "provenance": [] }, "kernelspec": { "name": "python3", "display_name": "Python 3" }, "language_info": { "name": "python" } }, "cells": [ { "cell_type": "markdown", "source": [ "Redes Neuronales para Lenguaje Natural, 2023\n", "\n", "---\n", "# Uso de Word Embeddings en un clasificador\n", "\n", "En este notebook exploraremos una colección de word embeddings preentrenada y veremos cómo utilizarla para entrenar un clasificador de análisis de sentimiento.\n", "\n", "\n", "---\n", "\n" ], "metadata": { "id": "t9ZNGOygrFsT" } }, { "cell_type": "markdown", "source": [ "Descargar los embeddings de SBWC" ], "metadata": { "id": "-wlcQnjW3PqL" } }, { "cell_type": "code", "source": [ "! wget https://cs.famaf.unc.edu.ar/~ccardellino/SBWCE/SBW-vectors-300-min5.bin.gz\n", "! gzip -d SBW-vectors-300-min5.bin.gz" ], "metadata": { "id": "U6Qas-YKZ2n5" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "Abrirlos con la biblioteca de embeddings gensim" ], "metadata": { "id": "K4UUqZJT3XAR" } }, { "cell_type": "code", "source": [ "from gensim.models import KeyedVectors\n", "wv = KeyedVectors.load_word2vec_format(\"./SBW-vectors-300-min5.bin\", binary=True)\n", "print(wv.vectors.shape)\n", "embeddings_size = wv.vectors.shape[1]" ], "metadata": { "id": "h9s_pZExZ8kj" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "Probar algunos casos simples de interés.\n", "\n", "¿Cómo son los vectores de gensim?\n", "\n", "¿Qué otras palabras hay cerca de una palabra objetivo?" ], "metadata": { "id": "TIMI0SKZ7y2s" } }, { "cell_type": "code", "source": [ "print(wv['perro'])\n", "print(wv.most_similar('perro'))" ], "metadata": { "id": "UqDxdxgS3j0j" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "Probar algunas analogías, utilizando el método most_similar() de gensim:\n", "\n", "rey - hombre + mujer ≈ reina\n", "\n", "parís - francia + uruguay ≈ montevideo\n", "\n", "Buscar por lo menos cuatro ejemplos más de analogías." ], "metadata": { "id": "JJUjIyQk6rxB" } }, { "cell_type": "code", "source": [ "# cálculo de analogías" ], "metadata": { "id": "1-UBB8CzcQtg" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "Probar algunas similitudes entre palabras utilizando el método similarity() de gensim\n", "\n", "Considere los pares:\n", "\n", "* perro - gato\n", "* frío - helado\n", "* democracia - monarquía\n", "* frío - caliente\n", "\n", "Escribir por lo menos seis pares más.\n", "\n", "¿Cuáles deberían estar más cerca según la intuición humana?\n", "\n", "¿Se cumple eso en los embeddings?\n" ], "metadata": { "id": "LhriDr5a7jyy" } }, { "cell_type": "code", "source": [ "# cálculo de similitud" ], "metadata": { "id": "oXWkQhFs5JDg" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "Probar el cálculo de distancia con el método distance() de gensim. ¿Cuál es la relación con el método usado en la parte anterior?" ], "metadata": { "id": "wm213GTUlgU8" } }, { "cell_type": "code", "source": [ "# cálculo de distancia" ], "metadata": { "id": "MQCt2C8g5oOr" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "Visualizar con dos técnicas de reducción de dimensionalidad: PCA y t-SNE\n", "\n", "Probaremos con un conjunto particular de palabras de distintas clases.\n", "\n", "Realice más pruebas utilizando otras palabras que le parezcan relevantes." ], "metadata": { "id": "VRz0CGy3_1b2" } }, { "cell_type": "code", "source": [ "import matplotlib.pyplot as plt\n", "import numpy as np\n", "\n", "words = ['perro','gato','elefante','tiburón','loro','paloma','ballena','democracia','trabajo','economía','política','guerra','aerodinámico','rápido','lento','intenso','furioso','azul','rojo','verde','amarillo','naranja','lunes','martes','domingo','febrero','diciembre','comer','saltar','dormir','volar','salir','entrar']\n", "X = np.array([wv[w] for w in words])" ], "metadata": { "id": "pFFrf52x-jgO" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "from sklearn.decomposition import PCA\n", "\n", "X_pca = PCA(n_components=2).fit_transform(X)\n", "\n", "fig, ax = plt.subplots(figsize=(10,10))\n", "ax.scatter(X_pca[:,0], X_pca[:,1])\n", "\n", "for i, w in enumerate(words):\n", " ax.annotate(w, X_pca[i])" ], "metadata": { "id": "eRjaqZFT6m69" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "from sklearn.manifold import TSNE\n", "\n", "X_tsne = TSNE(n_components=2, learning_rate='auto',init='random', perplexity=3).fit_transform(X)\n", "\n", "fig, ax = plt.subplots(figsize=(10,10))\n", "ax.scatter(X_tsne[:,0], X_tsne[:,1])\n", "\n", "for i, w in enumerate(words):\n", " ax.annotate(w, X_tsne[i])\n" ], "metadata": { "id": "GXRJjKBL-EDB" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "Obtener y cargar el corpus de análisis de sentimiento" ], "metadata": { "id": "8fUBKGc_jyjk" } }, { "cell_type": "code", "source": [ "! wget https://eva.fing.edu.uy/mod/resource/view.php?id=194796 -O senti-corpus.zip\n", "! unzip senti-corpus.zip" ], "metadata": { "id": "u34lygcpDC1m" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "import numpy as np\n", "import pandas as pd\n", "from collections import Counter\n", "from sklearn.metrics import precision_recall_fscore_support, accuracy_score\n", "\n", "train_df = pd.read_csv('./senti-train.tsv',sep='\\t')\n", "dev_df = pd.read_csv('./senti-dev.tsv',sep='\\t')\n", "test_df = pd.read_csv('./senti-test.tsv',sep='\\t')\n" ], "metadata": { "id": "Lgs3RHq1gARr" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "Para utilizar los textos del corpus en un clasificador, debemos realizar las siguientes acciones:\n", "\n", "1. Preprocesar los textos (p.e. tokenizar)\n", "2. Transformarlos a una representación vectorial (p.e. centroide)\n", "3. Obtener los labels del corpus (que son valores 0 o 1) como array de numpy.\n" ], "metadata": { "id": "uoxzvpVIr8tw" } }, { "cell_type": "code", "source": [ "# código de preprocesamiento\n", "\n", "train_tokens = ...\n", "dev_tokens = ...\n", "test_tokens = ...\n" ], "metadata": { "id": "yVnZQIv4jQlD" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "# código de vectorización\n", "\n", "train_v = ...\n", "dev_v = ...\n", "test_v = ...\n" ], "metadata": { "id": "YqUmpoZbk1lL" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "train_labels = np.array(train_df.iloc[:,1])\n", "dev_labels = np.array(dev_df.iloc[:,1])\n", "test_labels = np.array(test_df.iloc[:,1])\n", "\n", "print(\"train\",np.bincount(train_labels))\n", "print(\"dev\",np.bincount(dev_labels))\n", "print(\"test\",np.bincount(test_labels))\n" ], "metadata": { "id": "_oyreMwxjqC8" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "El siguiente código es un script de evaluación muy simple que toma un clasificador, un conjunto de datos y sus labels esperados, y nos devuelve Precisión, Recall, F1 y Accuracy.\n", "\n", "Utilizaremos este script de evaluación para comparar todos nuestros resultados.\n" ], "metadata": { "id": "MvUAyYwLkUj6" } }, { "cell_type": "code", "source": [ "from sklearn.metrics import precision_recall_fscore_support, accuracy_score\n", "\n", "def evaluate(clf,vectors,labels):\n", " pred = clf.predict(vectors)\n", " p,r,f,s = precision_recall_fscore_support(labels,pred,average='macro')\n", " a = accuracy_score(labels,pred)\n", " print(\"P %s, R %s, F %s, A %s\" % (p,r,f,a))\n" ], "metadata": { "id": "Zzf_Ydy3nPs8" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "Entrenar clasificadores en sklearn intentando encontrar el mejor para los datos de dev. Por ejemplo: LogisticRegression, RandomForestClassifier, SVC" ], "metadata": { "id": "n15McgVEskFN" } }, { "cell_type": "code", "source": [ "clf1 = ... # construyo y entreno el clasificador\n", "\n", "evaluate(clf1,dev_v,dev_labels)" ], "metadata": { "id": "ekZ_Itc_mDTK" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "clf2 = ...\n", "\n", "evaluate(clf2,dev_v,dev_labels)" ], "metadata": { "id": "xgfmagZyrJQA" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "Evaluar sobre el conjunto de test la performance del mejor clasificador encontrado" ], "metadata": { "id": "6fbo26IW8LLw" } }, { "cell_type": "code", "source": [ "clf_best = ... # el mejor clasificador que encontré\n", "evaluate(clf_best,test_v,test_labels)" ], "metadata": { "id": "1MXVm09a8fCg" }, "execution_count": null, "outputs": [] } ] }