{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "# Doc2Vec Model\n", "\n", "Introduces Gensim's Doc2Vec model and demonstrates its use on the\n", "[Lee Corpus](https://hekyll.services.adelaide.edu.au/dspace/bitstream/2440/28910/1/hdl_28910.pdf)_.\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import logging\n", "logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Doc2Vec is a `core_concepts_model` that represents each\n", "`core_concepts_document` as a `core_concepts_vector`. This\n", "tutorial introduces the model and demonstrates how to train and assess it.\n", "\n", "Here's a list of what we'll be doing:\n", "\n", "0. Review the relevant models: bag-of-words, Word2Vec, Doc2Vec\n", "1. Load and preprocess the training and test corpora (see `core_concepts_corpus`)\n", "2. Train a Doc2Vec `core_concepts_model` model using the training corpus\n", "3. Demonstrate how the trained model can be used to infer a `core_concepts_vector`\n", "4. Assess the model\n", "5. Test the model on the test corpus\n", "\n", "## Review: Bag-of-words\n", "\n", ".. Note:: Feel free to skip these review sections if you're already familiar with the models.\n", "\n", "You may be familiar with the [bag-of-words model](https://en.wikipedia.org/wiki/Bag-of-words_model) from the\n", "`core_concepts_vector` section.\n", "This model transforms each document to a fixed-length vector of integers.\n", "For example, given the sentences:\n", "\n", "- ``John likes to watch movies. Mary likes movies too.``\n", "- ``John also likes to watch football games. Mary hates football.``\n", "\n", "The model outputs the vectors:\n", "\n", "- ``[1, 2, 1, 1, 2, 1, 1, 0, 0, 0, 0]``\n", "- ``[1, 1, 1, 1, 0, 1, 0, 1, 2, 1, 1]``\n", "\n", "Each vector has 10 elements, where each element counts the number of times a\n", "particular word occurred in the document.\n", "The order of elements is arbitrary.\n", "In the example above, the order of the elements corresponds to the words:\n", "``[\"John\", \"likes\", \"to\", \"watch\", \"movies\", \"Mary\", \"too\", \"also\", \"football\", \"games\", \"hates\"]``.\n", "\n", "Bag-of-words models are surprisingly effective, but have several weaknesses.\n", "\n", "First, they lose all information about word order: \"John likes Mary\" and\n", "\"Mary likes John\" correspond to identical vectors. There is a solution: bag\n", "of [n-grams](https://en.wikipedia.org/wiki/N-gram)_\n", "models consider word phrases of length n to represent documents as\n", "fixed-length vectors to capture local word order but suffer from data\n", "sparsity and high dimensionality.\n", "\n", "Second, the model does not attempt to learn the meaning of the underlying\n", "words, and as a consequence, the distance between vectors doesn't always\n", "reflect the difference in meaning. The ``Word2Vec`` model addresses this\n", "second problem.\n", "\n", "## Review: ``Word2Vec`` Model\n", "\n", "``Word2Vec`` is a more recent model that embeds words in a lower-dimensional\n", "vector space using a shallow neural network. The result is a set of\n", "word-vectors where vectors close together in vector space have similar\n", "meanings based on context, and word-vectors distant to each other have\n", "differing meanings. For example, ``strong`` and ``powerful`` would be close\n", "together and ``strong`` and ``Paris`` would be relatively far.\n", "\n", "Gensim's :py:class:`~gensim.models.word2vec.Word2Vec` class implements this model.\n", "\n", "With the ``Word2Vec`` model, we can calculate the vectors for each **word** in a document.\n", "But what if we want to calculate a vector for the **entire document**\\ ?\n", "We could average the vectors for each word in the document - while this is quick and crude, it can often be useful.\n", "However, there is a better way...\n", "\n", "## Introducing: Paragraph Vector\n", "\n", ".. Important:: In Gensim, we refer to the Paragraph Vector model as ``Doc2Vec``.\n", "\n", "Le and Mikolov in 2014 introduced the [Doc2Vec algorithm](https://cs.stanford.edu/~quocle/paragraph_vector.pdf)_,\n", "which usually outperforms such simple-averaging of ``Word2Vec`` vectors.\n", "\n", "The basic idea is: act as if a document has another floating word-like\n", "vector, which contributes to all training predictions, and is updated like\n", "other word-vectors, but we will call it a doc-vector. Gensim's\n", ":py:class:`~gensim.models.doc2vec.Doc2Vec` class implements this algorithm.\n", "\n", "There are two implementations:\n", "\n", "1. Paragraph Vector - Distributed Memory (PV-DM)\n", "2. Paragraph Vector - Distributed Bag of Words (PV-DBOW)\n", "\n", ".. Important::\n", " Don't let the implementation details below scare you.\n", " They're advanced material: if it's too much, then move on to the next section.\n", "\n", "PV-DM is analogous to Word2Vec CBOW. The doc-vectors are obtained by training\n", "a neural network on the synthetic task of predicting a center word based an\n", "average of both context word-vectors and the full document's doc-vector.\n", "\n", "PV-DBOW is analogous to Word2Vec SG. The doc-vectors are obtained by training\n", "a neural network on the synthetic task of predicting a target word just from\n", "the full document's doc-vector. (It is also common to combine this with\n", "skip-gram testing, using both the doc-vector and nearby word-vectors to\n", "predict a single target word, but only one at a time.)\n", "\n", "## Prepare the Training and Test Data\n", "\n", "For this tutorial, we'll be training our model using the [Lee Background\n", "Corpus](https://hekyll.services.adelaide.edu.au/dspace/bitstream/2440/28910/1/hdl_28910.pdf)\n", "included in gensim. This corpus contains 314 documents selected from the\n", "Australian Broadcasting Corporation’s news mail service, which provides text\n", "e-mails of headline stories and covers a number of broad topics.\n", "\n", "And we'll test our model by eye using the much shorter [Lee Corpus](https://hekyll.services.adelaide.edu.au/dspace/bitstream/2440/28910/1/hdl_28910.pdf)\n", "which contains 50 documents.\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/vip/Library/Python/3.9/lib/python/site-packages/urllib3/__init__.py:35: NotOpenSSLWarning: urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'LibreSSL 2.8.3'. See: https://github.com/urllib3/urllib3/issues/3020\n", " warnings.warn(\n" ] } ], "source": [ "import os\n", "import gensim\n", "# Set file names for train and test data\n", "test_data_dir = os.path.join(gensim.__path__[0], 'test', 'test_data')\n", "lee_train_file = os.path.join(test_data_dir, 'lee_background.cor')\n", "lee_test_file = os.path.join(test_data_dir, 'lee.cor')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Define a Function to Read and Preprocess Text\n", "\n", "Below, we define a function to:\n", "\n", "- open the train/test file (with latin encoding)\n", "- read the file line-by-line\n", "- pre-process each line (tokenize text into individual words, remove punctuation, set to lowercase, etc)\n", "\n", "The file we're reading is a **corpus**.\n", "Each line of the file is a **document**.\n", "\n", ".. Important::\n", " To train the model, we'll need to associate a tag/number with each document\n", " of the training corpus. In our case, the tag is simply the zero-based line\n", " number.\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "import smart_open\n", "\n", "def read_corpus(fname, tokens_only=False):\n", " with smart_open.open(fname, encoding=\"iso-8859-1\") as f:\n", " for i, line in enumerate(f):\n", " tokens = gensim.utils.simple_preprocess(line)\n", " if tokens_only:\n", " yield tokens\n", " else:\n", " # For training data, add tags\n", " yield gensim.models.doc2vec.TaggedDocument(tokens, [i])\n", "\n", "train_corpus = list(read_corpus(lee_train_file))\n", "test_corpus = list(read_corpus(lee_test_file, tokens_only=True))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's take a look at the training corpus\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[TaggedDocument(words=['hundreds', 'of', 'people', 'have', 'been', 'forced', 'to', 'vacate', 'their', 'homes', 'in', 'the', 'southern', 'highlands', 'of', 'new', 'south', 'wales', 'as', 'strong', 'winds', 'today', 'pushed', 'huge', 'bushfire', 'towards', 'the', 'town', 'of', 'hill', 'top', 'new', 'blaze', 'near', 'goulburn', 'south', 'west', 'of', 'sydney', 'has', 'forced', 'the', 'closure', 'of', 'the', 'hume', 'highway', 'at', 'about', 'pm', 'aedt', 'marked', 'deterioration', 'in', 'the', 'weather', 'as', 'storm', 'cell', 'moved', 'east', 'across', 'the', 'blue', 'mountains', 'forced', 'authorities', 'to', 'make', 'decision', 'to', 'evacuate', 'people', 'from', 'homes', 'in', 'outlying', 'streets', 'at', 'hill', 'top', 'in', 'the', 'new', 'south', 'wales', 'southern', 'highlands', 'an', 'estimated', 'residents', 'have', 'left', 'their', 'homes', 'for', 'nearby', 'mittagong', 'the', 'new', 'south', 'wales', 'rural', 'fire', 'service', 'says', 'the', 'weather', 'conditions', 'which', 'caused', 'the', 'fire', 'to', 'burn', 'in', 'finger', 'formation', 'have', 'now', 'eased', 'and', 'about', 'fire', 'units', 'in', 'and', 'around', 'hill', 'top', 'are', 'optimistic', 'of', 'defending', 'all', 'properties', 'as', 'more', 'than', 'blazes', 'burn', 'on', 'new', 'year', 'eve', 'in', 'new', 'south', 'wales', 'fire', 'crews', 'have', 'been', 'called', 'to', 'new', 'fire', 'at', 'gunning', 'south', 'of', 'goulburn', 'while', 'few', 'details', 'are', 'available', 'at', 'this', 'stage', 'fire', 'authorities', 'says', 'it', 'has', 'closed', 'the', 'hume', 'highway', 'in', 'both', 'directions', 'meanwhile', 'new', 'fire', 'in', 'sydney', 'west', 'is', 'no', 'longer', 'threatening', 'properties', 'in', 'the', 'cranebrook', 'area', 'rain', 'has', 'fallen', 'in', 'some', 'parts', 'of', 'the', 'illawarra', 'sydney', 'the', 'hunter', 'valley', 'and', 'the', 'north', 'coast', 'but', 'the', 'bureau', 'of', 'meteorology', 'claire', 'richards', 'says', 'the', 'rain', 'has', 'done', 'little', 'to', 'ease', 'any', 'of', 'the', 'hundred', 'fires', 'still', 'burning', 'across', 'the', 'state', 'the', 'falls', 'have', 'been', 'quite', 'isolated', 'in', 'those', 'areas', 'and', 'generally', 'the', 'falls', 'have', 'been', 'less', 'than', 'about', 'five', 'millimetres', 'she', 'said', 'in', 'some', 'places', 'really', 'not', 'significant', 'at', 'all', 'less', 'than', 'millimetre', 'so', 'there', 'hasn', 'been', 'much', 'relief', 'as', 'far', 'as', 'rain', 'is', 'concerned', 'in', 'fact', 'they', 've', 'probably', 'hampered', 'the', 'efforts', 'of', 'the', 'firefighters', 'more', 'because', 'of', 'the', 'wind', 'gusts', 'that', 'are', 'associated', 'with', 'those', 'thunderstorms'], tags=[0]), TaggedDocument(words=['indian', 'security', 'forces', 'have', 'shot', 'dead', 'eight', 'suspected', 'militants', 'in', 'night', 'long', 'encounter', 'in', 'southern', 'kashmir', 'the', 'shootout', 'took', 'place', 'at', 'dora', 'village', 'some', 'kilometers', 'south', 'of', 'the', 'kashmiri', 'summer', 'capital', 'srinagar', 'the', 'deaths', 'came', 'as', 'pakistani', 'police', 'arrested', 'more', 'than', 'two', 'dozen', 'militants', 'from', 'extremist', 'groups', 'accused', 'of', 'staging', 'an', 'attack', 'on', 'india', 'parliament', 'india', 'has', 'accused', 'pakistan', 'based', 'lashkar', 'taiba', 'and', 'jaish', 'mohammad', 'of', 'carrying', 'out', 'the', 'attack', 'on', 'december', 'at', 'the', 'behest', 'of', 'pakistani', 'military', 'intelligence', 'military', 'tensions', 'have', 'soared', 'since', 'the', 'raid', 'with', 'both', 'sides', 'massing', 'troops', 'along', 'their', 'border', 'and', 'trading', 'tit', 'for', 'tat', 'diplomatic', 'sanctions', 'yesterday', 'pakistan', 'announced', 'it', 'had', 'arrested', 'lashkar', 'taiba', 'chief', 'hafiz', 'mohammed', 'saeed', 'police', 'in', 'karachi', 'say', 'it', 'is', 'likely', 'more', 'raids', 'will', 'be', 'launched', 'against', 'the', 'two', 'groups', 'as', 'well', 'as', 'other', 'militant', 'organisations', 'accused', 'of', 'targetting', 'india', 'military', 'tensions', 'between', 'india', 'and', 'pakistan', 'have', 'escalated', 'to', 'level', 'not', 'seen', 'since', 'their', 'war'], tags=[1])]\n" ] } ], "source": [ "print(train_corpus[:2])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And the testing corpus looks like this:\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[['the', 'national', 'executive', 'of', 'the', 'strife', 'torn', 'democrats', 'last', 'night', 'appointed', 'little', 'known', 'west', 'australian', 'senator', 'brian', 'greig', 'as', 'interim', 'leader', 'shock', 'move', 'likely', 'to', 'provoke', 'further', 'conflict', 'between', 'the', 'party', 'senators', 'and', 'its', 'organisation', 'in', 'move', 'to', 'reassert', 'control', 'over', 'the', 'party', 'seven', 'senators', 'the', 'national', 'executive', 'last', 'night', 'rejected', 'aden', 'ridgeway', 'bid', 'to', 'become', 'interim', 'leader', 'in', 'favour', 'of', 'senator', 'greig', 'supporter', 'of', 'deposed', 'leader', 'natasha', 'stott', 'despoja', 'and', 'an', 'outspoken', 'gay', 'rights', 'activist'], ['cash', 'strapped', 'financial', 'services', 'group', 'amp', 'has', 'shelved', 'million', 'plan', 'to', 'buy', 'shares', 'back', 'from', 'investors', 'and', 'will', 'raise', 'million', 'in', 'fresh', 'capital', 'after', 'profits', 'crashed', 'in', 'the', 'six', 'months', 'to', 'june', 'chief', 'executive', 'paul', 'batchelor', 'said', 'the', 'result', 'was', 'solid', 'in', 'what', 'he', 'described', 'as', 'the', 'worst', 'conditions', 'for', 'stock', 'markets', 'in', 'years', 'amp', 'half', 'year', 'profit', 'sank', 'per', 'cent', 'to', 'million', 'or', 'share', 'as', 'australia', 'largest', 'investor', 'and', 'fund', 'manager', 'failed', 'to', 'hit', 'projected', 'per', 'cent', 'earnings', 'growth', 'targets', 'and', 'was', 'battered', 'by', 'falling', 'returns', 'on', 'share', 'markets']]\n" ] } ], "source": [ "print(test_corpus[:2])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that the testing corpus is just a list of lists and does not contain\n", "any tags.\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Training the Model\n", "\n", "Now, we'll instantiate a Doc2Vec model with a vector size with 50 dimensions and\n", "iterating over the training corpus 40 times. We set the minimum word count to\n", "2 in order to discard words with very few occurrences. (Without a variety of\n", "representative examples, retaining such infrequent words can often make a\n", "model worse!) Typical iteration counts in the published [Paragraph Vector paper](https://cs.stanford.edu/~quocle/paragraph_vector.pdf)_\n", "results, using 10s-of-thousands to millions of docs, are 10-20. More\n", "iterations take more time and eventually reach a point of diminishing\n", "returns.\n", "\n", "However, this is a very very small dataset (300 documents) with shortish\n", "documents (a few hundred words). Adding training passes can sometimes help\n", "with such small datasets.\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2025-03-24 22:54:49,967 : INFO : Doc2Vec lifecycle event {'params': 'Doc2Vec', 'datetime': '2025-03-24T22:54:49.967607', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n" ] } ], "source": [ "model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=40)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Build a vocabulary\n", "\n" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2025-03-24 22:54:52,066 : INFO : collecting all words and their counts\n", "2025-03-24 22:54:52,072 : INFO : PROGRESS: at example #0, processed 0 words (0 words/s), 0 word types, 0 tags\n", "2025-03-24 22:54:52,087 : INFO : collected 6981 word types and 300 unique tags from a corpus of 300 examples and 58152 words\n", "2025-03-24 22:54:52,088 : INFO : Creating a fresh vocabulary\n", "2025-03-24 22:54:52,098 : INFO : Doc2Vec lifecycle event {'msg': 'effective_min_count=2 retains 3955 unique words (56.65% of original 6981, drops 3026)', 'datetime': '2025-03-24T22:54:52.098541', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-24 22:54:52,098 : INFO : Doc2Vec lifecycle event {'msg': 'effective_min_count=2 leaves 55126 word corpus (94.80% of original 58152, drops 3026)', 'datetime': '2025-03-24T22:54:52.098990', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-24 22:54:52,110 : INFO : deleting the raw counts dictionary of 6981 items\n", "2025-03-24 22:54:52,111 : INFO : sample=0.001 downsamples 46 most-common words\n", "2025-03-24 22:54:52,111 : INFO : Doc2Vec lifecycle event {'msg': 'downsampling leaves estimated 42390.98914085061 word corpus (76.9%% of prior 55126)', 'datetime': '2025-03-24T22:54:52.111709', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-24 22:54:52,126 : INFO : estimated required memory for 3955 words and 50 dimensions: 3679500 bytes\n", "2025-03-24 22:54:52,127 : INFO : resetting layer weights\n" ] } ], "source": [ "model.build_vocab(train_corpus)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Essentially, the vocabulary is a list (accessible via\n", "``model.wv.index_to_key``) of all of the unique words extracted from the training corpus.\n", "Additional attributes for each word are available using the ``model.wv.get_vecattr()`` method,\n", "For example, to see how many times ``penalty`` appeared in the training corpus:\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Word 'penalty' appeared 4 times in the training corpus.\n" ] } ], "source": [ "print(f\"Word 'penalty' appeared {model.wv.get_vecattr('penalty', 'count')} times in the training corpus.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, train the model on the corpus.\n", "In the usual case, where Gensim installation found a BLAS library for optimized\n", "bulk vector operations, this training on this tiny 300 document, ~60k word corpus \n", "should take just a few seconds. (More realistic datasets of tens-of-millions\n", "of words or more take proportionately longer.) If for some reason a BLAS library \n", "isn't available, training uses a fallback approach that takes 60x-120x longer, \n", "so even this tiny training will take minutes rather than seconds. (And, in that \n", "case, you should also notice a warning in the logging letting you know there's \n", "something worth fixing.) So, be sure your installation uses the BLAS-optimized \n", "Gensim if you value your time.\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2025-03-24 22:54:58,219 : INFO : Doc2Vec lifecycle event {'msg': 'training model with 3 workers on 3955 vocabulary and 50 features, using sg=0 hs=0 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-24T22:54:58.219164', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-24 22:54:58,255 : INFO : EPOCH 0: training on 58152 raw words (42668 effective words) took 0.0s, 1580960 effective words/s\n", "2025-03-24 22:54:58,276 : INFO : EPOCH 1: training on 58152 raw words (42635 effective words) took 0.0s, 2152780 effective words/s\n", "2025-03-24 22:54:58,296 : INFO : EPOCH 2: training on 58152 raw words (42747 effective words) took 0.0s, 2296811 effective words/s\n", "2025-03-24 22:54:58,316 : INFO : EPOCH 3: training on 58152 raw words (42743 effective words) took 0.0s, 2268792 effective words/s\n", "2025-03-24 22:54:58,336 : INFO : EPOCH 4: training on 58152 raw words (42638 effective words) took 0.0s, 2286067 effective words/s\n", "2025-03-24 22:54:58,356 : INFO : EPOCH 5: training on 58152 raw words (42605 effective words) took 0.0s, 2261777 effective words/s\n", "2025-03-24 22:54:58,377 : INFO : EPOCH 6: training on 58152 raw words (42786 effective words) took 0.0s, 2272592 effective words/s\n", "2025-03-24 22:54:58,397 : INFO : EPOCH 7: training on 58152 raw words (42687 effective words) took 0.0s, 2248662 effective words/s\n", "2025-03-24 22:54:58,417 : INFO : EPOCH 8: training on 58152 raw words (42650 effective words) took 0.0s, 2214889 effective words/s\n", "2025-03-24 22:54:58,436 : INFO : EPOCH 9: training on 58152 raw words (42591 effective words) took 0.0s, 2287620 effective words/s\n", "2025-03-24 22:54:58,456 : INFO : EPOCH 10: training on 58152 raw words (42640 effective words) took 0.0s, 2288285 effective words/s\n", "2025-03-24 22:54:58,476 : INFO : EPOCH 11: training on 58152 raw words (42691 effective words) took 0.0s, 2295925 effective words/s\n", "2025-03-24 22:54:58,496 : INFO : EPOCH 12: training on 58152 raw words (42644 effective words) took 0.0s, 2315286 effective words/s\n", "2025-03-24 22:54:58,516 : INFO : EPOCH 13: training on 58152 raw words (42720 effective words) took 0.0s, 2240261 effective words/s\n", "2025-03-24 22:54:58,536 : INFO : EPOCH 14: training on 58152 raw words (42732 effective words) took 0.0s, 2228222 effective words/s\n", "2025-03-24 22:54:58,556 : INFO : EPOCH 15: training on 58152 raw words (42615 effective words) took 0.0s, 2246008 effective words/s\n", "2025-03-24 22:54:58,576 : INFO : EPOCH 16: training on 58152 raw words (42729 effective words) took 0.0s, 2261033 effective words/s\n", "2025-03-24 22:54:58,597 : INFO : EPOCH 17: training on 58152 raw words (42777 effective words) took 0.0s, 2228578 effective words/s\n", "2025-03-24 22:54:58,616 : INFO : EPOCH 18: training on 58152 raw words (42788 effective words) took 0.0s, 2291595 effective words/s\n", "2025-03-24 22:54:58,637 : INFO : EPOCH 19: training on 58152 raw words (42706 effective words) took 0.0s, 2187658 effective words/s\n", "2025-03-24 22:54:58,657 : INFO : EPOCH 20: training on 58152 raw words (42722 effective words) took 0.0s, 2266154 effective words/s\n", "2025-03-24 22:54:58,677 : INFO : EPOCH 21: training on 58152 raw words (42587 effective words) took 0.0s, 2303146 effective words/s\n", "2025-03-24 22:54:58,697 : INFO : EPOCH 22: training on 58152 raw words (42603 effective words) took 0.0s, 2229386 effective words/s\n", "2025-03-24 22:54:58,717 : INFO : EPOCH 23: training on 58152 raw words (42828 effective words) took 0.0s, 2255075 effective words/s\n", "2025-03-24 22:54:58,737 : INFO : EPOCH 24: training on 58152 raw words (42680 effective words) took 0.0s, 2309034 effective words/s\n", "2025-03-24 22:54:58,757 : INFO : EPOCH 25: training on 58152 raw words (42660 effective words) took 0.0s, 2270588 effective words/s\n", "2025-03-24 22:54:58,777 : INFO : EPOCH 26: training on 58152 raw words (42722 effective words) took 0.0s, 2233475 effective words/s\n", "2025-03-24 22:54:58,797 : INFO : EPOCH 27: training on 58152 raw words (42668 effective words) took 0.0s, 2207922 effective words/s\n", "2025-03-24 22:54:58,817 : INFO : EPOCH 28: training on 58152 raw words (42715 effective words) took 0.0s, 2276403 effective words/s\n", "2025-03-24 22:54:58,838 : INFO : EPOCH 29: training on 58152 raw words (42551 effective words) took 0.0s, 2174325 effective words/s\n", "2025-03-24 22:54:58,857 : INFO : EPOCH 30: training on 58152 raw words (42704 effective words) took 0.0s, 2355866 effective words/s\n", "2025-03-24 22:54:58,878 : INFO : EPOCH 31: training on 58152 raw words (42691 effective words) took 0.0s, 2202828 effective words/s\n", "2025-03-24 22:54:58,898 : INFO : EPOCH 32: training on 58152 raw words (42619 effective words) took 0.0s, 2278456 effective words/s\n", "2025-03-24 22:54:58,919 : INFO : EPOCH 33: training on 58152 raw words (42681 effective words) took 0.0s, 2157899 effective words/s\n", "2025-03-24 22:54:58,939 : INFO : EPOCH 34: training on 58152 raw words (42650 effective words) took 0.0s, 2230896 effective words/s\n", "2025-03-24 22:54:58,959 : INFO : EPOCH 35: training on 58152 raw words (42784 effective words) took 0.0s, 2216821 effective words/s\n", "2025-03-24 22:54:58,982 : INFO : EPOCH 36: training on 58152 raw words (42668 effective words) took 0.0s, 1990425 effective words/s\n", "2025-03-24 22:54:59,005 : INFO : EPOCH 37: training on 58152 raw words (42619 effective words) took 0.0s, 1980901 effective words/s\n", "2025-03-24 22:54:59,028 : INFO : EPOCH 38: training on 58152 raw words (42754 effective words) took 0.0s, 2013189 effective words/s\n", "2025-03-24 22:54:59,051 : INFO : EPOCH 39: training on 58152 raw words (42687 effective words) took 0.0s, 1972521 effective words/s\n", "2025-03-24 22:54:59,051 : INFO : Doc2Vec lifecycle event {'msg': 'training on 2326080 raw words (1707385 effective words) took 0.8s, 2059638 effective words/s', 'datetime': '2025-03-24T22:54:59.051615', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n" ] } ], "source": [ "model.train(train_corpus, total_examples=model.corpus_count, epochs=model.epochs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we can use the trained model to infer a vector for any piece of text\n", "by passing a list of words to the ``model.infer_vector`` function. This\n", "vector can then be compared with other vectors via cosine similarity.\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[-0.18305157 -0.32143816 -0.11223511 0.1451582 -0.12424907 -0.05481448\n", " 0.04999834 -0.04683854 -0.23850189 -0.13069412 0.19557707 -0.05232161\n", " 0.06482678 -0.01454878 -0.04377953 -0.1583473 0.12038469 0.17531975\n", " 0.0907402 -0.04079067 -0.01197778 -0.03637475 0.20755337 -0.05807645\n", " -0.00419346 0.04313023 -0.26443157 0.05070484 -0.11251434 -0.07298681\n", " 0.42898747 0.09477782 0.10651541 0.15182146 0.14760782 0.1478032\n", " 0.06508146 -0.24064566 -0.10539611 -0.00808867 0.04140319 0.00125086\n", " 0.09329521 -0.11701933 0.02833019 0.01827002 -0.08145288 -0.08292808\n", " 0.07526062 0.00223478]\n" ] } ], "source": [ "vector = model.infer_vector(['only', 'you', 'can', 'prevent', 'forest', 'fires'])\n", "print(vector)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that ``infer_vector()`` does *not* take a string, but rather a list of\n", "string tokens, which should have already been tokenized the same way as the\n", "``words`` property of original training document objects.\n", "\n", "Also note that because the underlying training/inference algorithms are an\n", "iterative approximation problem that makes use of internal randomization,\n", "repeated inferences of the same text will return slightly different vectors.\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Assessing the Model\n", "\n", "To assess our new model, we'll first infer new vectors for each document of\n", "the training corpus, compare the inferred vectors with the training corpus,\n", "and then returning the rank of the document based on self-similarity.\n", "Basically, we're pretending as if the training corpus is some new unseen data\n", "and then seeing how they compare with the trained model. The expectation is\n", "that we've likely overfit our model (i.e., all of the ranks will be less than\n", "2) and so we should be able to find similar documents very easily.\n", "Additionally, we'll keep track of the second ranks for a comparison of less\n", "similar documents.\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "ranks = []\n", "second_ranks = []\n", "for doc_id in range(len(train_corpus)):\n", " inferred_vector = model.infer_vector(train_corpus[doc_id].words)\n", " sims = model.dv.most_similar([inferred_vector], topn=len(model.dv))\n", " rank = [docid for docid, sim in sims].index(doc_id)\n", " ranks.append(rank)\n", "\n", " second_ranks.append(sims[1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's count how each document ranks with respect to the training corpus\n", "\n", "NB. Results vary between runs due to random seeding and very small corpus\n", "\n" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Counter({0: 291, 1: 9})\n" ] } ], "source": [ "import collections\n", "\n", "counter = collections.Counter(ranks)\n", "print(counter)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Basically, greater than 95% of the inferred documents are found to be most\n", "similar to itself and about 5% of the time it is mistakenly most similar to\n", "another document. Checking the inferred-vector against a\n", "training-vector is a sort of 'sanity check' as to whether the model is\n", "behaving in a usefully consistent manner, though not a real 'accuracy' value.\n", "\n", "This is great and not entirely surprising. We can take a look at an example:\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Document (299): «australia will take on france in the doubles rubber of the davis cup tennis final today with the tie levelled at wayne arthurs and todd woodbridge are scheduled to lead australia in the doubles against cedric pioline and fabrice santoro however changes can be made to the line up up to an hour before the match and australian team captain john fitzgerald suggested he might do just that we ll make team appraisal of the whole situation go over the pros and cons and make decision french team captain guy forget says he will not make changes but does not know what to expect from australia todd is the best doubles player in the world right now so expect him to play he said would probably use wayne arthurs but don know what to expect really pat rafter salvaged australia davis cup campaign yesterday with win in the second singles match rafter overcame an arm injury to defeat french number one sebastien grosjean in three sets the australian says he is happy with his form it not very pretty tennis there isn too many consistent bounces you are playing like said bit of classic old grass court rafter said rafter levelled the score after lleyton hewitt shock five set loss to nicholas escude in the first singles rubber but rafter says he felt no added pressure after hewitt defeat knew had good team to back me up even if we were down he said knew could win on the last day know the boys can win doubles so even if we were down still feel we are good enough team to win and vice versa they are good enough team to beat us as well»\n", "\n", "SIMILAR/DISSIMILAR DOCS PER MODEL Doc2Vec:\n", "\n", "MOST (299, 0.9518495798110962): «australia will take on france in the doubles rubber of the davis cup tennis final today with the tie levelled at wayne arthurs and todd woodbridge are scheduled to lead australia in the doubles against cedric pioline and fabrice santoro however changes can be made to the line up up to an hour before the match and australian team captain john fitzgerald suggested he might do just that we ll make team appraisal of the whole situation go over the pros and cons and make decision french team captain guy forget says he will not make changes but does not know what to expect from australia todd is the best doubles player in the world right now so expect him to play he said would probably use wayne arthurs but don know what to expect really pat rafter salvaged australia davis cup campaign yesterday with win in the second singles match rafter overcame an arm injury to defeat french number one sebastien grosjean in three sets the australian says he is happy with his form it not very pretty tennis there isn too many consistent bounces you are playing like said bit of classic old grass court rafter said rafter levelled the score after lleyton hewitt shock five set loss to nicholas escude in the first singles rubber but rafter says he felt no added pressure after hewitt defeat knew had good team to back me up even if we were down he said knew could win on the last day know the boys can win doubles so even if we were down still feel we are good enough team to win and vice versa they are good enough team to beat us as well»\n", "\n", "SECOND-MOST (104, 0.8065248131752014): «australian cricket captain steve waugh has supported fast bowler brett lee after criticism of his intimidatory bowling to the south african tailenders in the first test in adelaide earlier this month lee was fined for giving new zealand tailender shane bond an unsportsmanlike send off during the third test in perth waugh says tailenders should not be protected from short pitched bowling these days you re earning big money you ve got responsibility to learn how to bat he said mean there no times like years ago when it was not professional and sort of bowlers code these days you re professional our batsmen work very hard at their batting and expect other tailenders to do likewise meanwhile waugh says his side will need to guard against complacency after convincingly winning the first test by runs waugh says despite the dominance of his side in the first test south africa can never be taken lightly it only one test match out of three or six whichever way you want to look at it so there lot of work to go he said but it nice to win the first battle definitely it gives us lot of confidence going into melbourne you know the big crowd there we love playing in front of the boxing day crowd so that will be to our advantage as well south africa begins four day match against new south wales in sydney on thursday in the lead up to the boxing day test veteran fast bowler allan donald will play in the warm up match and is likely to take his place in the team for the second test south african captain shaun pollock expects much better performance from his side in the melbourne test we still believe that we didn play to our full potential so if we can improve on our aspects the output we put out on the field will be lot better and we still believe we have side that is good enough to beat australia on our day he said»\n", "\n", "MEDIAN (115, 0.23674415051937103): «australia is continuing to negotiate with the united states government in an effort to interview the australian david hicks who was captured fighting alongside taliban forces in afghanistan mr hicks is being held by the united states on board ship in the afghanistan region where the australian federal police and australian security intelligence organisation asio officials are trying to gain access foreign affairs minister alexander downer has also confirmed that the australian government is investigating reports that another australian has been fighting for taliban forces in afghanistan we often get reports of people going to different parts of the world and asking us to investigate them he said we always investigate sometimes it is impossible to find out we just don know in this case but it is not to say that we think there are lot of australians in afghanistan the only case we know is hicks mr downer says it is unclear when mr hicks will be back on australian soil but he is hopeful the americans will facilitate australian authorities interviewing him»\n", "\n", "LEAST (243, -0.1268550306558609): «four afghan factions have reached agreement on an interim cabinet during talks in germany the united nations says the administration which will take over from december will be headed by the royalist anti taliban commander hamed karzai it concludes more than week of negotiations outside bonn and is aimed at restoring peace and stability to the war ravaged country the year old former deputy foreign minister who is currently battling the taliban around the southern city of kandahar is an ally of the exiled afghan king mohammed zahir shah he will serve as chairman of an interim authority that will govern afghanistan for six month period before loya jirga or grand traditional assembly of elders in turn appoints an month transitional government meanwhile united states marines are now reported to have been deployed in eastern afghanistan where opposition forces are closing in on al qaeda soldiers reports from the area say there has been gun battle between the opposition and al qaeda close to the tora bora cave complex where osama bin laden is thought to be hiding in the south of the country american marines are taking part in patrols around the air base they have secured near kandahar but are unlikely to take part in any assault on the city however the chairman of the joint chiefs of staff general richard myers says they are prepared for anything they are prepared for engagements they re robust fighting force and they re absolutely ready to engage if that required he said»\n", "\n" ] } ], "source": [ "print('Document ({}): «{}»\\n'.format(doc_id, ' '.join(train_corpus[doc_id].words)))\n", "print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\\n' % model)\n", "for label, index in [('MOST', 0), ('SECOND-MOST', 1), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:\n", " print(u'%s %s: «%s»\\n' % (label, sims[index], ' '.join(train_corpus[sims[index][0]].words)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice above that the most similar document (usually the same text) is has a\n", "similarity score approaching 1.0. However, the similarity score for the\n", "second-ranked documents should be significantly lower (assuming the documents\n", "are in fact different) and the reasoning becomes obvious when we examine the\n", "text itself.\n", "\n", "We can run the next cell repeatedly to see a sampling other target-document\n", "comparisons.\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Train Document (128): «qantas has moved to assure travellers there will be no disruption to flights over the christmas period despite threats of industrial action qantas maintenance workers have rejected the airline proposals for wage freeze as negotiations over the dispute continue in the industrial relations commission qantas chief executive geoff dixon has expressed his disappointment at the maintenance workers actions mr dixon points out per cent of the airline workforce have already agreed in principle to accept wages freeze together with an incentives scheme mr dixon claims maintenance workers earn on average per cent above average weekly earnings and also receive generous staff travel benefits mr dixon has assured nervous travellers that even if the workers do go out on strike qantas flights will not be disrupted maintenance unions are refusing to soften their stance against qantas wage freeze proposal qantas and two maintenance unions are continuing negotiations in the industrial relations commission where unions produced leaked airline briefing paper which says qantas is prepared to escalate the strike to force resolution»\n", "\n", "Similar Document (135, 0.8998598456382751): «dispute which could threaten air services returns to the industrial relations commission today qantas maintenance workers have rejected the airline proposals for wages freeze the dispute involving maintenance workers has been running for around six months after lengthy negotiations last weekend qantas had sought ballot of the maintenance workers the unions claim per cent of the workforce voted against the company latest offer the national secretary of the australian manufacturing workers union amwu doug cameron did not rule out the grounding of qantas jets if the dispute continues and he says the company would only have itself to blame if qantas doesn come to the party think it inevitable that the industrial action will continue and that will be qantas responsibility he said»\n", "\n" ] } ], "source": [ "# Pick a random document from the corpus and infer a vector from the model\n", "import random\n", "doc_id = random.randint(0, len(train_corpus) - 1)\n", "\n", "# Compare and print the second-most-similar document\n", "print('Train Document ({}): «{}»\\n'.format(doc_id, ' '.join(train_corpus[doc_id].words)))\n", "sim_id = second_ranks[doc_id]\n", "print('Similar Document {}: «{}»\\n'.format(sim_id, ' '.join(train_corpus[sim_id[0]].words)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Testing the Model\n", "\n", "Using the same approach above, we'll infer the vector for a randomly chosen\n", "test document, and compare the document to our model by eye.\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Test Document (15): «the bush administration has drawn up plans to escalate the war of words against iraq with new campaigns to step up pressure on baghdad and rally world opinion behind the us drive to oust president saddam hussein this week the state department will begin mobilising iraqis from across north america europe and the arab world training them to appear on talk shows write opinion articles and give speeches on reasons to end president saddam rule»\n", "\n", "SIMILAR/DISSIMILAR DOCS PER MODEL Doc2Vec:\n", "\n", "MOST (57, 0.7186471819877625): «afghanistan new interim government is to meet for the first time later today after an historic inauguration ceremony in the afghan capital kabul interim president hamid karzai and his fellow cabinet members are looking to start rebuilding afghanistan war ravaged economy mr karzai says he expects the reconstruction to cost many billions of dollars after years of war afghanistan must go from an economy of war to an economy of peace mr karzai said those people who ve earned living by taking the gun must be enabled with programs with plans with projects to put the gun aside and go to the various other forms of economic activity that can bring them livelihood he said»\n", "\n", "MEDIAN (136, 0.3209247589111328): «new report suggests the costs of an aging australian population have been exaggerated the report issued by the australia institute says detailed examination of population and health data shows an aging population will not create an unsustainable burden on shrinking workforce far from being an economic and social burden it found the majority of older people enjoyed healthy and independent lives many making financial contributions to their families and participating in voluntary community activities the paper challenges the assumption an older population will see health costs rise to unsustainable levels it says rising health costs are caused mainly by factors other than aging such as the growth of medical technology rising consumer demand and escalating prices»\n", "\n", "LEAST (129, -0.028218572959303856): «the governor general will issue statement this week to answer allegations about his response to alleged sexual abuse at queensland school dr peter hollingworth was the anglican archbishop of brisbane when teacher at toowoomba anglican school allegedly abused students there more than decade ago pressure has been mounting on dr hollingworth to speak out after public criticism of his role in responding to the claims of abuse spokeswoman says dr hollingworth is becoming concerned that if he does not respond publicly to the allegations he may jeopardise the standing of the position of governor general the spokeswoman says dr hollingworth will issue written statement in the next few days after obtaining legal advice four people were killed and eight others injured when fire broke out overnight at hotel in central paris fire service spokesperson says the fire which was brought under control within two hours could have been an act of arson the number of people staying in the hotel du palais at the time the fire was not immediately known the inferno began at around am in the elevator shaft of the six storey hotel next to the theatre du chatelet in paris first arrondissement the centre of the french capital the flames spread quickly via the shaft to the building roof firemen helped several hotel guests to safety through the windows of their rooms two of the victims were found asphyxiated on the fifth floor one of the injured was said to be in serious condition in hospital according to police one man was arrested at the scene and an inquiry has been opened the theatre was undamaged»\n", "\n" ] } ], "source": [ "# Pick a random document from the test corpus and infer a vector from the model\n", "doc_id = random.randint(0, len(test_corpus) - 1)\n", "inferred_vector = model.infer_vector(test_corpus[doc_id])\n", "sims = model.dv.most_similar([inferred_vector], topn=len(model.dv))\n", "\n", "# Compare and print the most/median/least similar documents from the train corpus\n", "print('Test Document ({}): «{}»\\n'.format(doc_id, ' '.join(test_corpus[doc_id])))\n", "print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\\n' % model)\n", "for label, index in [('MOST', 0), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:\n", " print(u'%s %s: «%s»\\n' % (label, sims[index], ' '.join(train_corpus[sims[index][0]].words)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion\n", "\n", "Let's review what we've seen in this tutorial:\n", "\n", "0. Review the relevant models: bag-of-words, Word2Vec, Doc2Vec\n", "1. Load and preprocess the training and test corpora (see `core_concepts_corpus`)\n", "2. Train a Doc2Vec `core_concepts_model` model using the training corpus\n", "3. Demonstrate how the trained model can be used to infer a `core_concepts_vector`\n", "4. Assess the model\n", "5. Test the model on the test corpus\n", "\n", "That's it! Doc2Vec is a great way to explore relationships between documents.\n", "\n", "## Additional Resources\n", "\n", "If you'd like to know more about the subject matter of this tutorial, check out the links below.\n", "\n", "* [Word2Vec Paper](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf)\n", "* [Doc2Vec Paper](https://cs.stanford.edu/~quocle/paragraph_vector.pdf)\n", "* [Dr. Michael D. Lee's Website](http://faculty.sites.uci.edu/mdlee)\n", "* [Lee Corpus](http://faculty.sites.uci.edu/mdlee/similarity-data/)_\n", "* [IMDB Doc2Vec Tutorial](doc2vec-IMDB.ipynb)\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.6" } }, "nbformat": 4, "nbformat_minor": 1 }