{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Word2Vec Model\n", "==============\n", "\n", "Introduces Gensim's Word2Vec model and demonstrates its use on the `Lee Evaluation Corpus\n", "`_.\n", "\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import logging\n", "logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In case you missed the buzz, Word2Vec is a widely used algorithm based on neural\n", "networks, commonly referred to as \"deep learning\" (though word2vec itself is rather shallow).\n", "Using large amounts of unannotated plain text, word2vec learns relationships\n", "between words automatically. The output are vectors, one vector per word,\n", "with remarkable linear relationships that allow us to do things like:\n", "\n", "* vec(\"king\") - vec(\"man\") + vec(\"woman\") =~ vec(\"queen\")\n", "* vec(\"Montreal Canadiens\") – vec(\"Montreal\") + vec(\"Toronto\") =~ vec(\"Toronto Maple Leafs\").\n", "\n", "Word2vec is very useful in `automatic text tagging\n", "`_\\ , recommender\n", "systems and machine translation.\n", "\n", "This tutorial:\n", "\n", "#. Introduces ``Word2Vec`` as an improvement over traditional bag-of-words\n", "#. Shows off a demo of ``Word2Vec`` using a pre-trained model\n", "#. Demonstrates training a new model from your own data\n", "#. Demonstrates loading and saving models\n", "#. Introduces several training parameters and demonstrates their effect\n", "#. Discusses memory requirements\n", "#. Visualizes Word2Vec embeddings by applying dimensionality reduction\n", "\n", "Review: Bag-of-words\n", "--------------------\n", "\n", ".. Note:: Feel free to skip these review sections if you're already familiar with the models.\n", "\n", "You may be familiar with the `bag-of-words model\n", "`_ from the\n", "`core_concepts_vector` section.\n", "This model transforms each document to a fixed-length vector of integers.\n", "For example, given the sentences:\n", "\n", "- ``John likes to watch movies. Mary likes movies too.``\n", "- ``John also likes to watch football games. Mary hates football.``\n", "\n", "The model outputs the vectors:\n", "\n", "- ``[1, 2, 1, 1, 2, 1, 1, 0, 0, 0, 0]``\n", "- ``[1, 1, 1, 1, 0, 1, 0, 1, 2, 1, 1]``\n", "\n", "Each vector has 10 elements, where each element counts the number of times a\n", "particular word occurred in the document.\n", "The order of elements is arbitrary.\n", "In the example above, the order of the elements corresponds to the words:\n", "``[\"John\", \"likes\", \"to\", \"watch\", \"movies\", \"Mary\", \"too\", \"also\", \"football\", \"games\", \"hates\"]``.\n", "\n", "Bag-of-words models are surprisingly effective, but have several weaknesses.\n", "\n", "First, they lose all information about word order: \"John likes Mary\" and\n", "\"Mary likes John\" correspond to identical vectors. There is a solution: bag\n", "of `n-grams `__\n", "models consider word phrases of length n to represent documents as\n", "fixed-length vectors to capture local word order but suffer from data\n", "sparsity and high dimensionality.\n", "\n", "Second, the model does not attempt to learn the meaning of the underlying\n", "words, and as a consequence, the distance between vectors doesn't always\n", "reflect the difference in meaning. The ``Word2Vec`` model addresses this\n", "second problem.\n", "\n", "Introducing: the ``Word2Vec`` Model\n", "-----------------------------------\n", "\n", "``Word2Vec`` is a more recent model that embeds words in a lower-dimensional\n", "vector space using a shallow neural network. The result is a set of\n", "word-vectors where vectors close together in vector space have similar\n", "meanings based on context, and word-vectors distant to each other have\n", "differing meanings. For example, ``strong`` and ``powerful`` would be close\n", "together and ``strong`` and ``Paris`` would be relatively far.\n", "\n", "The are two versions of this model and :py:class:`~gensim.models.word2vec.Word2Vec`\n", "class implements them both:\n", "\n", "1. Skip-grams (SG)\n", "2. Continuous-bag-of-words (CBOW)\n", "\n", ".. Important::\n", " Don't let the implementation details below scare you.\n", " They're advanced material: if it's too much, then move on to the next section.\n", "\n", "The `Word2Vec Skip-gram `__\n", "model, for example, takes in pairs (word1, word2) generated by moving a\n", "window across text data, and trains a 1-hidden-layer neural network based on\n", "the synthetic task of given an input word, giving us a predicted probability\n", "distribution of nearby words to the input. A virtual `one-hot\n", "`__ encoding of words\n", "goes through a 'projection layer' to the hidden layer; these projection\n", "weights are later interpreted as the word embeddings. So if the hidden layer\n", "has 300 neurons, this network will give us 300-dimensional word embeddings.\n", "\n", "Continuous-bag-of-words Word2vec is very similar to the skip-gram model. It\n", "is also a 1-hidden-layer neural network. The synthetic training task now uses\n", "the average of multiple input context words, rather than a single word as in\n", "skip-gram, to predict the center word. Again, the projection weights that\n", "turn one-hot words into averageable vectors, of the same width as the hidden\n", "layer, are interpreted as the word embeddings.\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Word2Vec Demo\n", "-------------\n", "\n", "To see what ``Word2Vec`` can do, let's download a pre-trained model and play\n", "around with it. We will fetch the Word2Vec model trained on part of the\n", "Google News dataset, covering approximately 3 million words and phrases. Such\n", "a model can take hours to train, but since it's already available,\n", "downloading and loading it with Gensim takes minutes.\n", "\n", ".. Important::\n", " The model is approximately 2GB, so you'll need a decent network connection\n", " to proceed. Otherwise, skip ahead to the \"Training Your Own Model\" section\n", " below.\n", "\n", "You may also check out an `online word2vec demo\n", "`_ where you can try\n", "this vector algebra for yourself. That demo runs ``word2vec`` on the\n", "**entire** Google News dataset, of **about 100 billion words**.\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2025-03-27 14:15:23,239 : INFO : loading projection weights from /Users/vip/gensim-data/word2vec-google-news-300/word2vec-google-news-300.gz\n", "2025-03-27 14:15:43,108 : INFO : KeyedVectors lifecycle event {'msg': 'loaded (3000000, 300) matrix of type float32 from /Users/vip/gensim-data/word2vec-google-news-300/word2vec-google-news-300.gz', 'binary': True, 'encoding': 'utf8', 'datetime': '2025-03-27T14:15:43.108564', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'load_word2vec_format'}\n" ] } ], "source": [ "import gensim.downloader as api\n", "wv = api.load('word2vec-google-news-300')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A common operation is to retrieve the vocabulary of a model. That is trivial:\n", "\n" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "word #0/3000000 is \n", "word #1/3000000 is in\n", "word #2/3000000 is for\n", "word #3/3000000 is that\n", "word #4/3000000 is is\n", "word #5/3000000 is on\n", "word #6/3000000 is ##\n", "word #7/3000000 is The\n", "word #8/3000000 is with\n", "word #9/3000000 is said\n" ] } ], "source": [ "for index, word in enumerate(wv.index_to_key):\n", " if index == 10:\n", " break\n", " print(f\"word #{index}/{len(wv.index_to_key)} is {word}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can easily obtain vectors for terms the model is familiar with:\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "vec_king = wv['king']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Unfortunately, the model is unable to infer vectors for unfamiliar words.\n", "This is one limitation of Word2Vec: if this limitation matters to you, check\n", "out the FastText model.\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The word 'cameroon' does not appear in this model\n" ] } ], "source": [ "try:\n", " vec_cameroon = wv['cameroon']\n", "except KeyError:\n", " print(\"The word 'cameroon' does not appear in this model\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Moving on, ``Word2Vec`` supports several word similarity tasks out of the\n", "box. You can see how the similarity intuitively decreases as the words get\n", "less and less similar.\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "'car'\t'minivan'\t0.69\n", "'car'\t'bicycle'\t0.54\n", "'car'\t'airplane'\t0.42\n", "'car'\t'cereal'\t0.14\n", "'car'\t'communism'\t0.06\n" ] } ], "source": [ "pairs = [\n", " ('car', 'minivan'), # a minivan is a kind of car\n", " ('car', 'bicycle'), # still a wheeled vehicle\n", " ('car', 'airplane'), # ok, no wheels, but still a vehicle\n", " ('car', 'cereal'), # ... and so on\n", " ('car', 'communism'),\n", "]\n", "for w1, w2 in pairs:\n", " print('%r\\t%r\\t%.2f' % (w1, w2, wv.similarity(w1, w2)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Print the 5 most similar words to \"car\" or \"minivan\"\n", "\n" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[('SUV', 0.8532192707061768), ('vehicle', 0.8175783753395081), ('pickup_truck', 0.7763689756393433), ('Jeep', 0.7567334175109863), ('Ford_Explorer', 0.7565719485282898)]\n" ] } ], "source": [ "print(wv.most_similar(positive=['car', 'minivan'], topn=5))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Which of the below does not belong in the sequence?\n", "\n" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "car\n" ] } ], "source": [ "print(wv.doesnt_match(['fire', 'water', 'land', 'sea', 'air', 'car']))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Training Your Own Model\n", "-----------------------\n", "\n", "To start, you'll need some data for training the model. For the following\n", "examples, we'll use the `Lee Evaluation Corpus\n", "`_\n", "(which you `already have\n", "`_\n", "if you've installed Gensim).\n", "\n", "This corpus is small enough to fit entirely in memory, but we'll implement a\n", "memory-friendly iterator that reads it line-by-line to demonstrate how you\n", "would handle a larger corpus.\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2025-03-27 14:16:43,057 : INFO : adding document #0 to Dictionary<0 unique tokens: []>\n", "2025-03-27 14:16:43,058 : INFO : built Dictionary<12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...> from 9 documents (total 29 corpus positions)\n", "2025-03-27 14:16:43,058 : INFO : Dictionary lifecycle event {'msg': \"built Dictionary<12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...> from 9 documents (total 29 corpus positions)\", 'datetime': '2025-03-27T14:16:43.058804', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n" ] } ], "source": [ "from gensim.test.utils import datapath\n", "from gensim import utils\n", "\n", "class MyCorpus:\n", " \"\"\"An iterator that yields sentences (lists of str).\"\"\"\n", "\n", " def __iter__(self):\n", " corpus_path = datapath('lee_background.cor')\n", " for line in open(corpus_path):\n", " # assume there's one document per line, tokens separated by whitespace\n", " yield utils.simple_preprocess(line)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we wanted to do any custom preprocessing, e.g. decode a non-standard\n", "encoding, lowercase, remove numbers, extract named entities... All of this can\n", "be done inside the ``MyCorpus`` iterator and ``word2vec`` doesn’t need to\n", "know. All that is required is that the input yields one sentence (list of\n", "utf8 words) after another.\n", "\n", "Let's go ahead and train a model on our corpus. Don't worry about the\n", "training parameters much for now, we'll revisit them later.\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2025-03-27 14:17:00,060 : INFO : collecting all words and their counts\n", "2025-03-27 14:17:00,063 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:17:00,127 : INFO : collected 6981 word types from a corpus of 58152 raw words and 300 sentences\n", "2025-03-27 14:17:00,127 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:17:00,131 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 1750 unique words (25.07% of original 6981, drops 5231)', 'datetime': '2025-03-27T14:17:00.131400', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:17:00,131 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 49335 word corpus (84.84% of original 58152, drops 8817)', 'datetime': '2025-03-27T14:17:00.131712', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:17:00,135 : INFO : deleting the raw counts dictionary of 6981 items\n", "2025-03-27 14:17:00,135 : INFO : sample=0.001 downsamples 51 most-common words\n", "2025-03-27 14:17:00,136 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 35935.33721568072 word corpus (72.8%% of prior 49335)', 'datetime': '2025-03-27T14:17:00.136217', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:17:00,142 : INFO : estimated required memory for 1750 words and 100 dimensions: 2275000 bytes\n", "2025-03-27 14:17:00,142 : INFO : resetting layer weights\n", "2025-03-27 14:17:00,143 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:17:00.143668', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:17:00,172 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 1750 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:17:00.172864', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:17:00,230 : INFO : EPOCH 0: training on 58152 raw words (35936 effective words) took 0.1s, 638228 effective words/s\n", "2025-03-27 14:17:00,286 : INFO : EPOCH 1: training on 58152 raw words (35991 effective words) took 0.1s, 656303 effective words/s\n", "2025-03-27 14:17:00,339 : INFO : EPOCH 2: training on 58152 raw words (35921 effective words) took 0.1s, 692074 effective words/s\n", "2025-03-27 14:17:00,393 : INFO : EPOCH 3: training on 58152 raw words (35959 effective words) took 0.1s, 676909 effective words/s\n", "2025-03-27 14:17:00,447 : INFO : EPOCH 4: training on 58152 raw words (35979 effective words) took 0.1s, 684933 effective words/s\n", "2025-03-27 14:17:00,447 : INFO : Word2Vec lifecycle event {'msg': 'training on 290760 raw words (179786 effective words) took 0.3s, 654816 effective words/s', 'datetime': '2025-03-27T14:17:00.447817', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:17:00,448 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:17:00.448036', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n" ] } ], "source": [ "import gensim.models\n", "\n", "sentences = MyCorpus()\n", "model = gensim.models.Word2Vec(sentences=sentences)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once we have our model, we can use it in the same way as in the demo above.\n", "\n", "The main part of the model is ``model.wv``\\ , where \"wv\" stands for \"word vectors\".\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "vec_king = model.wv['king']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Retrieving the vocabulary works the same way:\n", "\n" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "word #0/3000000 is \n", "word #1/3000000 is in\n", "word #2/3000000 is for\n", "word #3/3000000 is that\n", "word #4/3000000 is is\n", "word #5/3000000 is on\n", "word #6/3000000 is ##\n", "word #7/3000000 is The\n", "word #8/3000000 is with\n", "word #9/3000000 is said\n" ] } ], "source": [ "for index, word in enumerate(wv.index_to_key):\n", " if index == 10:\n", " break\n", " print(f\"word #{index}/{len(wv.index_to_key)} is {word}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Storing and loading models\n", "--------------------------\n", "\n", "You'll notice that training non-trivial models can take time. Once you've\n", "trained your model and it works as expected, you can save it to disk. That\n", "way, you don't have to spend time training it all over again later.\n", "\n", "You can store/load models using the standard gensim methods:\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2025-03-27 14:17:15,337 : INFO : Word2Vec lifecycle event {'fname_or_handle': '/var/folders/w_/5zj48w1d0xb7ycgdm6pk40v00000gn/T/gensim-model-k8l_1pq4', 'separately': 'None', 'sep_limit': 10485760, 'ignore': frozenset(), 'datetime': '2025-03-27T14:17:15.337430', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'saving'}\n", "2025-03-27 14:17:15,339 : INFO : not storing attribute cum_table\n", "2025-03-27 14:17:15,345 : INFO : saved /var/folders/w_/5zj48w1d0xb7ycgdm6pk40v00000gn/T/gensim-model-k8l_1pq4\n", "2025-03-27 14:17:15,346 : INFO : loading Word2Vec object from /var/folders/w_/5zj48w1d0xb7ycgdm6pk40v00000gn/T/gensim-model-k8l_1pq4\n", "2025-03-27 14:17:15,347 : INFO : loading wv recursively from /var/folders/w_/5zj48w1d0xb7ycgdm6pk40v00000gn/T/gensim-model-k8l_1pq4.wv.* with mmap=None\n", "2025-03-27 14:17:15,348 : INFO : setting ignored attribute cum_table to None\n", "2025-03-27 14:17:15,356 : INFO : Word2Vec lifecycle event {'fname': '/var/folders/w_/5zj48w1d0xb7ycgdm6pk40v00000gn/T/gensim-model-k8l_1pq4', 'datetime': '2025-03-27T14:17:15.356060', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'loaded'}\n" ] } ], "source": [ "import tempfile\n", "\n", "with tempfile.NamedTemporaryFile(prefix='gensim-model-', delete=False) as tmp:\n", " temporary_filepath = tmp.name\n", " model.save(temporary_filepath)\n", " #\n", " # The model is now safely stored in the filepath.\n", " # You can copy it to other machines, share it with others, etc.\n", " #\n", " # To load a saved model:\n", " #\n", " new_model = gensim.models.Word2Vec.load(temporary_filepath)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "which uses pickle internally, optionally ``mmap``\\ ‘ing the model’s internal\n", "large NumPy matrices into virtual memory directly from disk files, for\n", "inter-process memory sharing.\n", "\n", "In addition, you can load models created by the original C tool, both using\n", "its text and binary formats::\n", "\n", " model = gensim.models.KeyedVectors.load_word2vec_format('/tmp/vectors.txt', binary=False)\n", " # using gzipped/bz2 input works too, no need to unzip\n", " model = gensim.models.KeyedVectors.load_word2vec_format('/tmp/vectors.bin.gz', binary=True)\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Training Parameters\n", "-------------------\n", "\n", "``Word2Vec`` accepts several parameters that affect both training speed and quality.\n", "\n", "min_count\n", "---------\n", "\n", "``min_count`` is for pruning the internal dictionary. Words that appear only\n", "once or twice in a billion-word corpus are probably uninteresting typos and\n", "garbage. In addition, there’s not enough data to make any meaningful training\n", "on those words, so it’s best to ignore them:\n", "\n", "default value of min_count=5\n", "\n" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2025-03-27 14:17:26,709 : INFO : collecting all words and their counts\n", "2025-03-27 14:17:26,711 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:17:26,773 : INFO : collected 6981 word types from a corpus of 58152 raw words and 300 sentences\n", "2025-03-27 14:17:26,774 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:17:26,776 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=10 retains 889 unique words (12.73% of original 6981, drops 6092)', 'datetime': '2025-03-27T14:17:26.776847', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:17:26,777 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=10 leaves 43776 word corpus (75.28% of original 58152, drops 14376)', 'datetime': '2025-03-27T14:17:26.777106', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:17:26,779 : INFO : deleting the raw counts dictionary of 6981 items\n", "2025-03-27 14:17:26,779 : INFO : sample=0.001 downsamples 55 most-common words\n", "2025-03-27 14:17:26,779 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 29691.39528319831 word corpus (67.8%% of prior 43776)', 'datetime': '2025-03-27T14:17:26.779815', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:17:26,782 : INFO : estimated required memory for 889 words and 100 dimensions: 1155700 bytes\n", "2025-03-27 14:17:26,783 : INFO : resetting layer weights\n", "2025-03-27 14:17:26,783 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:17:26.783950', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:17:26,784 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 889 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:17:26.784135', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:17:26,838 : INFO : EPOCH 0: training on 58152 raw words (29735 effective words) took 0.1s, 555663 effective words/s\n", "2025-03-27 14:17:26,890 : INFO : EPOCH 1: training on 58152 raw words (29637 effective words) took 0.1s, 580644 effective words/s\n", "2025-03-27 14:17:26,942 : INFO : EPOCH 2: training on 58152 raw words (29668 effective words) took 0.1s, 579282 effective words/s\n", "2025-03-27 14:17:26,994 : INFO : EPOCH 3: training on 58152 raw words (29757 effective words) took 0.1s, 590708 effective words/s\n", "2025-03-27 14:17:27,046 : INFO : EPOCH 4: training on 58152 raw words (29649 effective words) took 0.1s, 580325 effective words/s\n", "2025-03-27 14:17:27,046 : INFO : Word2Vec lifecycle event {'msg': 'training on 290760 raw words (148446 effective words) took 0.3s, 566255 effective words/s', 'datetime': '2025-03-27T14:17:27.046456', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:17:27,046 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:17:27.046663', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n" ] } ], "source": [ "model = gensim.models.Word2Vec(sentences, min_count=10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "vector_size\n", "-----------\n", "\n", "``vector_size`` is the number of dimensions (N) of the N-dimensional space that\n", "gensim Word2Vec maps the words onto.\n", "\n", "Bigger size values require more training data, but can lead to better (more\n", "accurate) models. Reasonable values are in the tens to hundreds.\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2025-03-27 14:17:38,448 : INFO : collecting all words and their counts\n", "2025-03-27 14:17:38,452 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:17:38,515 : INFO : collected 6981 word types from a corpus of 58152 raw words and 300 sentences\n", "2025-03-27 14:17:38,516 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:17:38,519 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 1750 unique words (25.07% of original 6981, drops 5231)', 'datetime': '2025-03-27T14:17:38.519961', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:17:38,520 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 49335 word corpus (84.84% of original 58152, drops 8817)', 'datetime': '2025-03-27T14:17:38.520639', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:17:38,524 : INFO : deleting the raw counts dictionary of 6981 items\n", "2025-03-27 14:17:38,524 : INFO : sample=0.001 downsamples 51 most-common words\n", "2025-03-27 14:17:38,525 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 35935.33721568072 word corpus (72.8%% of prior 49335)', 'datetime': '2025-03-27T14:17:38.525204', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:17:38,536 : INFO : estimated required memory for 1750 words and 200 dimensions: 3675000 bytes\n", "2025-03-27 14:17:38,537 : INFO : resetting layer weights\n", "2025-03-27 14:17:38,539 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:17:38.539299', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:17:38,539 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 1750 vocabulary and 200 features, using sg=0 hs=0 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:17:38.539522', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:17:38,595 : INFO : EPOCH 0: training on 58152 raw words (35994 effective words) took 0.1s, 658978 effective words/s\n", "2025-03-27 14:17:38,651 : INFO : EPOCH 1: training on 58152 raw words (36034 effective words) took 0.1s, 653169 effective words/s\n", "2025-03-27 14:17:38,706 : INFO : EPOCH 2: training on 58152 raw words (35839 effective words) took 0.1s, 661644 effective words/s\n", "2025-03-27 14:17:38,761 : INFO : EPOCH 3: training on 58152 raw words (35837 effective words) took 0.1s, 658906 effective words/s\n", "2025-03-27 14:17:38,818 : INFO : EPOCH 4: training on 58152 raw words (35892 effective words) took 0.1s, 642252 effective words/s\n", "2025-03-27 14:17:38,819 : INFO : Word2Vec lifecycle event {'msg': 'training on 290760 raw words (179596 effective words) took 0.3s, 642691 effective words/s', 'datetime': '2025-03-27T14:17:38.819160', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:17:38,819 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:17:38.819391', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n" ] } ], "source": [ "# The default value of vector_size is 100.\n", "model = gensim.models.Word2Vec(sentences, vector_size=200)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "workers\n", "-------\n", "\n", "``workers`` , the last of the major parameters (full list `here\n", "`_)\n", "is for training parallelization, to speed up training:\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2025-03-27 14:17:43,033 : INFO : collecting all words and their counts\n", "2025-03-27 14:17:43,040 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:17:43,094 : INFO : collected 6981 word types from a corpus of 58152 raw words and 300 sentences\n", "2025-03-27 14:17:43,094 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:17:43,099 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 1750 unique words (25.07% of original 6981, drops 5231)', 'datetime': '2025-03-27T14:17:43.099101', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:17:43,099 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 49335 word corpus (84.84% of original 58152, drops 8817)', 'datetime': '2025-03-27T14:17:43.099482', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:17:43,103 : INFO : deleting the raw counts dictionary of 6981 items\n", "2025-03-27 14:17:43,104 : INFO : sample=0.001 downsamples 51 most-common words\n", "2025-03-27 14:17:43,104 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 35935.33721568072 word corpus (72.8%% of prior 49335)', 'datetime': '2025-03-27T14:17:43.104893', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:17:43,110 : INFO : estimated required memory for 1750 words and 100 dimensions: 2275000 bytes\n", "2025-03-27 14:17:43,111 : INFO : resetting layer weights\n", "2025-03-27 14:17:43,112 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:17:43.112633', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:17:43,112 : INFO : Word2Vec lifecycle event {'msg': 'training model with 4 workers on 1750 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:17:43.112831', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:17:43,169 : INFO : EPOCH 0: training on 58152 raw words (35936 effective words) took 0.1s, 642622 effective words/s\n", "2025-03-27 14:17:43,224 : INFO : EPOCH 1: training on 58152 raw words (35991 effective words) took 0.1s, 678474 effective words/s\n", "2025-03-27 14:17:43,276 : INFO : EPOCH 2: training on 58152 raw words (35921 effective words) took 0.1s, 696209 effective words/s\n", "2025-03-27 14:17:43,331 : INFO : EPOCH 3: training on 58152 raw words (35959 effective words) took 0.1s, 673475 effective words/s\n", "2025-03-27 14:17:43,385 : INFO : EPOCH 4: training on 58152 raw words (35979 effective words) took 0.1s, 672566 effective words/s\n", "2025-03-27 14:17:43,385 : INFO : Word2Vec lifecycle event {'msg': 'training on 290760 raw words (179786 effective words) took 0.3s, 658780 effective words/s', 'datetime': '2025-03-27T14:17:43.385938', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:17:43,386 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:17:43.386162', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n" ] } ], "source": [ "# default value of workers=3 (tutorial says 1...)\n", "model = gensim.models.Word2Vec(sentences, workers=4)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The ``workers`` parameter only has an effect if you have `Cython\n", "`_ installed. Without Cython, you’ll only be able to use\n", "one core because of the `GIL\n", "`_ (and ``word2vec``\n", "training will be `miserably slow\n", "`_\\ ).\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Memory\n", "------\n", "\n", "At its core, ``word2vec`` model parameters are stored as matrices (NumPy\n", "arrays). Each array is **#vocabulary** (controlled by the ``min_count`` parameter)\n", "times **vector size** (the ``vector_size`` parameter) of floats (single precision aka 4 bytes).\n", "\n", "Three such matrices are held in RAM (work is underway to reduce that number\n", "to two, or even one). So if your input contains 100,000 unique words, and you\n", "asked for layer ``vector_size=200``\\ , the model will require approx.\n", "``100,000*200*4*3 bytes = ~229MB``.\n", "\n", "There’s a little extra memory needed for storing the vocabulary tree (100,000 words would\n", "take a few megabytes), but unless your words are extremely loooong strings, memory\n", "footprint will be dominated by the three matrices above.\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Evaluating\n", "----------\n", "\n", "``Word2Vec`` training is an unsupervised task, there’s no good way to\n", "objectively evaluate the result. Evaluation depends on your end application.\n", "\n", "Google has released their testing set of about 20,000 syntactic and semantic\n", "test examples, following the “A is to B as C is to D” task. It is provided in\n", "the 'datasets' folder.\n", "\n", "For example a syntactic analogy of comparative type is ``bad:worse;good:?``.\n", "There are total of 9 types of syntactic comparisons in the dataset like\n", "plural nouns and nouns of opposite meaning.\n", "\n", "The semantic questions contain five types of semantic analogies, such as\n", "capital cities (``Paris:France;Tokyo:?``) or family members\n", "(``brother:sister;dad:?``).\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Gensim supports the same evaluation set, in exactly the same format:\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2025-03-27 14:17:57,250 : INFO : Evaluating word analogies for top 300000 words in the model on /Users/vip/Library/Python/3.9/lib/python/site-packages/gensim/test/test_data/questions-words.txt\n", "2025-03-27 14:17:57,305 : INFO : capital-common-countries: 0.0% (0/6)\n", "2025-03-27 14:17:57,329 : INFO : capital-world: 0.0% (0/2)\n", "2025-03-27 14:17:57,373 : INFO : family: 0.0% (0/6)\n", "2025-03-27 14:17:57,407 : INFO : gram3-comparative: 0.0% (0/20)\n", "2025-03-27 14:17:57,431 : INFO : gram4-superlative: 0.0% (0/12)\n", "2025-03-27 14:17:57,457 : INFO : gram5-present-participle: 0.0% (0/20)\n", "2025-03-27 14:17:57,504 : INFO : gram6-nationality-adjective: 0.0% (0/30)\n", "2025-03-27 14:17:57,536 : INFO : gram7-past-tense: 0.0% (0/20)\n", "2025-03-27 14:17:57,581 : INFO : gram8-plural: 0.0% (0/30)\n", "2025-03-27 14:17:57,585 : INFO : Quadruplets with out-of-vocabulary words: 99.3%\n", "2025-03-27 14:17:57,586 : INFO : NB: analogies containing OOV words were skipped from evaluation! To change this behavior, use \"dummy4unknown=True\"\n", "2025-03-27 14:17:57,587 : INFO : Total accuracy: 0.0% (0/146)\n" ] }, { "data": { "text/plain": [ "(0.0,\n", " [{'section': 'capital-common-countries',\n", " 'correct': [],\n", " 'incorrect': [('CANBERRA', 'AUSTRALIA', 'KABUL', 'AFGHANISTAN'),\n", " ('CANBERRA', 'AUSTRALIA', 'PARIS', 'FRANCE'),\n", " ('KABUL', 'AFGHANISTAN', 'PARIS', 'FRANCE'),\n", " ('KABUL', 'AFGHANISTAN', 'CANBERRA', 'AUSTRALIA'),\n", " ('PARIS', 'FRANCE', 'CANBERRA', 'AUSTRALIA'),\n", " ('PARIS', 'FRANCE', 'KABUL', 'AFGHANISTAN')]},\n", " {'section': 'capital-world',\n", " 'correct': [],\n", " 'incorrect': [('CANBERRA', 'AUSTRALIA', 'KABUL', 'AFGHANISTAN'),\n", " ('KABUL', 'AFGHANISTAN', 'PARIS', 'FRANCE')]},\n", " {'section': 'currency', 'correct': [], 'incorrect': []},\n", " {'section': 'city-in-state', 'correct': [], 'incorrect': []},\n", " {'section': 'family',\n", " 'correct': [],\n", " 'incorrect': [('HE', 'SHE', 'HIS', 'HER'),\n", " ('HE', 'SHE', 'MAN', 'WOMAN'),\n", " ('HIS', 'HER', 'MAN', 'WOMAN'),\n", " ('HIS', 'HER', 'HE', 'SHE'),\n", " ('MAN', 'WOMAN', 'HE', 'SHE'),\n", " ('MAN', 'WOMAN', 'HIS', 'HER')]},\n", " {'section': 'gram1-adjective-to-adverb', 'correct': [], 'incorrect': []},\n", " {'section': 'gram2-opposite', 'correct': [], 'incorrect': []},\n", " {'section': 'gram3-comparative',\n", " 'correct': [],\n", " 'incorrect': [('GOOD', 'BETTER', 'GREAT', 'GREATER'),\n", " ('GOOD', 'BETTER', 'LONG', 'LONGER'),\n", " ('GOOD', 'BETTER', 'LOW', 'LOWER'),\n", " ('GOOD', 'BETTER', 'SMALL', 'SMALLER'),\n", " ('GREAT', 'GREATER', 'LONG', 'LONGER'),\n", " ('GREAT', 'GREATER', 'LOW', 'LOWER'),\n", " ('GREAT', 'GREATER', 'SMALL', 'SMALLER'),\n", " ('GREAT', 'GREATER', 'GOOD', 'BETTER'),\n", " ('LONG', 'LONGER', 'LOW', 'LOWER'),\n", " ('LONG', 'LONGER', 'SMALL', 'SMALLER'),\n", " ('LONG', 'LONGER', 'GOOD', 'BETTER'),\n", " ('LONG', 'LONGER', 'GREAT', 'GREATER'),\n", " ('LOW', 'LOWER', 'SMALL', 'SMALLER'),\n", " ('LOW', 'LOWER', 'GOOD', 'BETTER'),\n", " ('LOW', 'LOWER', 'GREAT', 'GREATER'),\n", " ('LOW', 'LOWER', 'LONG', 'LONGER'),\n", " ('SMALL', 'SMALLER', 'GOOD', 'BETTER'),\n", " ('SMALL', 'SMALLER', 'GREAT', 'GREATER'),\n", " ('SMALL', 'SMALLER', 'LONG', 'LONGER'),\n", " ('SMALL', 'SMALLER', 'LOW', 'LOWER')]},\n", " {'section': 'gram4-superlative',\n", " 'correct': [],\n", " 'incorrect': [('BIG', 'BIGGEST', 'GOOD', 'BEST'),\n", " ('BIG', 'BIGGEST', 'GREAT', 'GREATEST'),\n", " ('BIG', 'BIGGEST', 'LARGE', 'LARGEST'),\n", " ('GOOD', 'BEST', 'GREAT', 'GREATEST'),\n", " ('GOOD', 'BEST', 'LARGE', 'LARGEST'),\n", " ('GOOD', 'BEST', 'BIG', 'BIGGEST'),\n", " ('GREAT', 'GREATEST', 'LARGE', 'LARGEST'),\n", " ('GREAT', 'GREATEST', 'BIG', 'BIGGEST'),\n", " ('GREAT', 'GREATEST', 'GOOD', 'BEST'),\n", " ('LARGE', 'LARGEST', 'BIG', 'BIGGEST'),\n", " ('LARGE', 'LARGEST', 'GOOD', 'BEST'),\n", " ('LARGE', 'LARGEST', 'GREAT', 'GREATEST')]},\n", " {'section': 'gram5-present-participle',\n", " 'correct': [],\n", " 'incorrect': [('GO', 'GOING', 'LOOK', 'LOOKING'),\n", " ('GO', 'GOING', 'PLAY', 'PLAYING'),\n", " ('GO', 'GOING', 'RUN', 'RUNNING'),\n", " ('GO', 'GOING', 'SAY', 'SAYING'),\n", " ('LOOK', 'LOOKING', 'PLAY', 'PLAYING'),\n", " ('LOOK', 'LOOKING', 'RUN', 'RUNNING'),\n", " ('LOOK', 'LOOKING', 'SAY', 'SAYING'),\n", " ('LOOK', 'LOOKING', 'GO', 'GOING'),\n", " ('PLAY', 'PLAYING', 'RUN', 'RUNNING'),\n", " ('PLAY', 'PLAYING', 'SAY', 'SAYING'),\n", " ('PLAY', 'PLAYING', 'GO', 'GOING'),\n", " ('PLAY', 'PLAYING', 'LOOK', 'LOOKING'),\n", " ('RUN', 'RUNNING', 'SAY', 'SAYING'),\n", " ('RUN', 'RUNNING', 'GO', 'GOING'),\n", " ('RUN', 'RUNNING', 'LOOK', 'LOOKING'),\n", " ('RUN', 'RUNNING', 'PLAY', 'PLAYING'),\n", " ('SAY', 'SAYING', 'GO', 'GOING'),\n", " ('SAY', 'SAYING', 'LOOK', 'LOOKING'),\n", " ('SAY', 'SAYING', 'PLAY', 'PLAYING'),\n", " ('SAY', 'SAYING', 'RUN', 'RUNNING')]},\n", " {'section': 'gram6-nationality-adjective',\n", " 'correct': [],\n", " 'incorrect': [('AUSTRALIA', 'AUSTRALIAN', 'FRANCE', 'FRENCH'),\n", " ('AUSTRALIA', 'AUSTRALIAN', 'INDIA', 'INDIAN'),\n", " ('AUSTRALIA', 'AUSTRALIAN', 'ISRAEL', 'ISRAELI'),\n", " ('AUSTRALIA', 'AUSTRALIAN', 'JAPAN', 'JAPANESE'),\n", " ('AUSTRALIA', 'AUSTRALIAN', 'SWITZERLAND', 'SWISS'),\n", " ('FRANCE', 'FRENCH', 'INDIA', 'INDIAN'),\n", " ('FRANCE', 'FRENCH', 'ISRAEL', 'ISRAELI'),\n", " ('FRANCE', 'FRENCH', 'JAPAN', 'JAPANESE'),\n", " ('FRANCE', 'FRENCH', 'SWITZERLAND', 'SWISS'),\n", " ('FRANCE', 'FRENCH', 'AUSTRALIA', 'AUSTRALIAN'),\n", " ('INDIA', 'INDIAN', 'ISRAEL', 'ISRAELI'),\n", " ('INDIA', 'INDIAN', 'JAPAN', 'JAPANESE'),\n", " ('INDIA', 'INDIAN', 'SWITZERLAND', 'SWISS'),\n", " ('INDIA', 'INDIAN', 'AUSTRALIA', 'AUSTRALIAN'),\n", " ('INDIA', 'INDIAN', 'FRANCE', 'FRENCH'),\n", " ('ISRAEL', 'ISRAELI', 'JAPAN', 'JAPANESE'),\n", " ('ISRAEL', 'ISRAELI', 'SWITZERLAND', 'SWISS'),\n", " ('ISRAEL', 'ISRAELI', 'AUSTRALIA', 'AUSTRALIAN'),\n", " ('ISRAEL', 'ISRAELI', 'FRANCE', 'FRENCH'),\n", " ('ISRAEL', 'ISRAELI', 'INDIA', 'INDIAN'),\n", " ('JAPAN', 'JAPANESE', 'SWITZERLAND', 'SWISS'),\n", " ('JAPAN', 'JAPANESE', 'AUSTRALIA', 'AUSTRALIAN'),\n", " ('JAPAN', 'JAPANESE', 'FRANCE', 'FRENCH'),\n", " ('JAPAN', 'JAPANESE', 'INDIA', 'INDIAN'),\n", " ('JAPAN', 'JAPANESE', 'ISRAEL', 'ISRAELI'),\n", " ('SWITZERLAND', 'SWISS', 'AUSTRALIA', 'AUSTRALIAN'),\n", " ('SWITZERLAND', 'SWISS', 'FRANCE', 'FRENCH'),\n", " ('SWITZERLAND', 'SWISS', 'INDIA', 'INDIAN'),\n", " ('SWITZERLAND', 'SWISS', 'ISRAEL', 'ISRAELI'),\n", " ('SWITZERLAND', 'SWISS', 'JAPAN', 'JAPANESE')]},\n", " {'section': 'gram7-past-tense',\n", " 'correct': [],\n", " 'incorrect': [('GOING', 'WENT', 'PAYING', 'PAID'),\n", " ('GOING', 'WENT', 'PLAYING', 'PLAYED'),\n", " ('GOING', 'WENT', 'SAYING', 'SAID'),\n", " ('GOING', 'WENT', 'TAKING', 'TOOK'),\n", " ('PAYING', 'PAID', 'PLAYING', 'PLAYED'),\n", " ('PAYING', 'PAID', 'SAYING', 'SAID'),\n", " ('PAYING', 'PAID', 'TAKING', 'TOOK'),\n", " ('PAYING', 'PAID', 'GOING', 'WENT'),\n", " ('PLAYING', 'PLAYED', 'SAYING', 'SAID'),\n", " ('PLAYING', 'PLAYED', 'TAKING', 'TOOK'),\n", " ('PLAYING', 'PLAYED', 'GOING', 'WENT'),\n", " ('PLAYING', 'PLAYED', 'PAYING', 'PAID'),\n", " ('SAYING', 'SAID', 'TAKING', 'TOOK'),\n", " ('SAYING', 'SAID', 'GOING', 'WENT'),\n", " ('SAYING', 'SAID', 'PAYING', 'PAID'),\n", " ('SAYING', 'SAID', 'PLAYING', 'PLAYED'),\n", " ('TAKING', 'TOOK', 'GOING', 'WENT'),\n", " ('TAKING', 'TOOK', 'PAYING', 'PAID'),\n", " ('TAKING', 'TOOK', 'PLAYING', 'PLAYED'),\n", " ('TAKING', 'TOOK', 'SAYING', 'SAID')]},\n", " {'section': 'gram8-plural',\n", " 'correct': [],\n", " 'incorrect': [('BUILDING', 'BUILDINGS', 'CAR', 'CARS'),\n", " ('BUILDING', 'BUILDINGS', 'CHILD', 'CHILDREN'),\n", " ('BUILDING', 'BUILDINGS', 'MAN', 'MEN'),\n", " ('BUILDING', 'BUILDINGS', 'ROAD', 'ROADS'),\n", " ('BUILDING', 'BUILDINGS', 'WOMAN', 'WOMEN'),\n", " ('CAR', 'CARS', 'CHILD', 'CHILDREN'),\n", " ('CAR', 'CARS', 'MAN', 'MEN'),\n", " ('CAR', 'CARS', 'ROAD', 'ROADS'),\n", " ('CAR', 'CARS', 'WOMAN', 'WOMEN'),\n", " ('CAR', 'CARS', 'BUILDING', 'BUILDINGS'),\n", " ('CHILD', 'CHILDREN', 'MAN', 'MEN'),\n", " ('CHILD', 'CHILDREN', 'ROAD', 'ROADS'),\n", " ('CHILD', 'CHILDREN', 'WOMAN', 'WOMEN'),\n", " ('CHILD', 'CHILDREN', 'BUILDING', 'BUILDINGS'),\n", " ('CHILD', 'CHILDREN', 'CAR', 'CARS'),\n", " ('MAN', 'MEN', 'ROAD', 'ROADS'),\n", " ('MAN', 'MEN', 'WOMAN', 'WOMEN'),\n", " ('MAN', 'MEN', 'BUILDING', 'BUILDINGS'),\n", " ('MAN', 'MEN', 'CAR', 'CARS'),\n", " ('MAN', 'MEN', 'CHILD', 'CHILDREN'),\n", " ('ROAD', 'ROADS', 'WOMAN', 'WOMEN'),\n", " ('ROAD', 'ROADS', 'BUILDING', 'BUILDINGS'),\n", " ('ROAD', 'ROADS', 'CAR', 'CARS'),\n", " ('ROAD', 'ROADS', 'CHILD', 'CHILDREN'),\n", " ('ROAD', 'ROADS', 'MAN', 'MEN'),\n", " ('WOMAN', 'WOMEN', 'BUILDING', 'BUILDINGS'),\n", " ('WOMAN', 'WOMEN', 'CAR', 'CARS'),\n", " ('WOMAN', 'WOMEN', 'CHILD', 'CHILDREN'),\n", " ('WOMAN', 'WOMEN', 'MAN', 'MEN'),\n", " ('WOMAN', 'WOMEN', 'ROAD', 'ROADS')]},\n", " {'section': 'gram9-plural-verbs', 'correct': [], 'incorrect': []},\n", " {'section': 'Total accuracy',\n", " 'correct': [],\n", " 'incorrect': [('CANBERRA', 'AUSTRALIA', 'KABUL', 'AFGHANISTAN'),\n", " ('CANBERRA', 'AUSTRALIA', 'PARIS', 'FRANCE'),\n", " ('KABUL', 'AFGHANISTAN', 'PARIS', 'FRANCE'),\n", " ('KABUL', 'AFGHANISTAN', 'CANBERRA', 'AUSTRALIA'),\n", " ('PARIS', 'FRANCE', 'CANBERRA', 'AUSTRALIA'),\n", " ('PARIS', 'FRANCE', 'KABUL', 'AFGHANISTAN'),\n", " ('CANBERRA', 'AUSTRALIA', 'KABUL', 'AFGHANISTAN'),\n", " ('KABUL', 'AFGHANISTAN', 'PARIS', 'FRANCE'),\n", " ('HE', 'SHE', 'HIS', 'HER'),\n", " ('HE', 'SHE', 'MAN', 'WOMAN'),\n", " ('HIS', 'HER', 'MAN', 'WOMAN'),\n", " ('HIS', 'HER', 'HE', 'SHE'),\n", " ('MAN', 'WOMAN', 'HE', 'SHE'),\n", " ('MAN', 'WOMAN', 'HIS', 'HER'),\n", " ('GOOD', 'BETTER', 'GREAT', 'GREATER'),\n", " ('GOOD', 'BETTER', 'LONG', 'LONGER'),\n", " ('GOOD', 'BETTER', 'LOW', 'LOWER'),\n", " ('GOOD', 'BETTER', 'SMALL', 'SMALLER'),\n", " ('GREAT', 'GREATER', 'LONG', 'LONGER'),\n", " ('GREAT', 'GREATER', 'LOW', 'LOWER'),\n", " ('GREAT', 'GREATER', 'SMALL', 'SMALLER'),\n", " ('GREAT', 'GREATER', 'GOOD', 'BETTER'),\n", " ('LONG', 'LONGER', 'LOW', 'LOWER'),\n", " ('LONG', 'LONGER', 'SMALL', 'SMALLER'),\n", " ('LONG', 'LONGER', 'GOOD', 'BETTER'),\n", " ('LONG', 'LONGER', 'GREAT', 'GREATER'),\n", " ('LOW', 'LOWER', 'SMALL', 'SMALLER'),\n", " ('LOW', 'LOWER', 'GOOD', 'BETTER'),\n", " ('LOW', 'LOWER', 'GREAT', 'GREATER'),\n", " ('LOW', 'LOWER', 'LONG', 'LONGER'),\n", " ('SMALL', 'SMALLER', 'GOOD', 'BETTER'),\n", " ('SMALL', 'SMALLER', 'GREAT', 'GREATER'),\n", " ('SMALL', 'SMALLER', 'LONG', 'LONGER'),\n", " ('SMALL', 'SMALLER', 'LOW', 'LOWER'),\n", " ('BIG', 'BIGGEST', 'GOOD', 'BEST'),\n", " ('BIG', 'BIGGEST', 'GREAT', 'GREATEST'),\n", " ('BIG', 'BIGGEST', 'LARGE', 'LARGEST'),\n", " ('GOOD', 'BEST', 'GREAT', 'GREATEST'),\n", " ('GOOD', 'BEST', 'LARGE', 'LARGEST'),\n", " ('GOOD', 'BEST', 'BIG', 'BIGGEST'),\n", " ('GREAT', 'GREATEST', 'LARGE', 'LARGEST'),\n", " ('GREAT', 'GREATEST', 'BIG', 'BIGGEST'),\n", " ('GREAT', 'GREATEST', 'GOOD', 'BEST'),\n", " ('LARGE', 'LARGEST', 'BIG', 'BIGGEST'),\n", " ('LARGE', 'LARGEST', 'GOOD', 'BEST'),\n", " ('LARGE', 'LARGEST', 'GREAT', 'GREATEST'),\n", " ('GO', 'GOING', 'LOOK', 'LOOKING'),\n", " ('GO', 'GOING', 'PLAY', 'PLAYING'),\n", " ('GO', 'GOING', 'RUN', 'RUNNING'),\n", " ('GO', 'GOING', 'SAY', 'SAYING'),\n", " ('LOOK', 'LOOKING', 'PLAY', 'PLAYING'),\n", " ('LOOK', 'LOOKING', 'RUN', 'RUNNING'),\n", " ('LOOK', 'LOOKING', 'SAY', 'SAYING'),\n", " ('LOOK', 'LOOKING', 'GO', 'GOING'),\n", " ('PLAY', 'PLAYING', 'RUN', 'RUNNING'),\n", " ('PLAY', 'PLAYING', 'SAY', 'SAYING'),\n", " ('PLAY', 'PLAYING', 'GO', 'GOING'),\n", " ('PLAY', 'PLAYING', 'LOOK', 'LOOKING'),\n", " ('RUN', 'RUNNING', 'SAY', 'SAYING'),\n", " ('RUN', 'RUNNING', 'GO', 'GOING'),\n", " ('RUN', 'RUNNING', 'LOOK', 'LOOKING'),\n", " ('RUN', 'RUNNING', 'PLAY', 'PLAYING'),\n", " ('SAY', 'SAYING', 'GO', 'GOING'),\n", " ('SAY', 'SAYING', 'LOOK', 'LOOKING'),\n", " ('SAY', 'SAYING', 'PLAY', 'PLAYING'),\n", " ('SAY', 'SAYING', 'RUN', 'RUNNING'),\n", " ('AUSTRALIA', 'AUSTRALIAN', 'FRANCE', 'FRENCH'),\n", " ('AUSTRALIA', 'AUSTRALIAN', 'INDIA', 'INDIAN'),\n", " ('AUSTRALIA', 'AUSTRALIAN', 'ISRAEL', 'ISRAELI'),\n", " ('AUSTRALIA', 'AUSTRALIAN', 'JAPAN', 'JAPANESE'),\n", " ('AUSTRALIA', 'AUSTRALIAN', 'SWITZERLAND', 'SWISS'),\n", " ('FRANCE', 'FRENCH', 'INDIA', 'INDIAN'),\n", " ('FRANCE', 'FRENCH', 'ISRAEL', 'ISRAELI'),\n", " ('FRANCE', 'FRENCH', 'JAPAN', 'JAPANESE'),\n", " ('FRANCE', 'FRENCH', 'SWITZERLAND', 'SWISS'),\n", " ('FRANCE', 'FRENCH', 'AUSTRALIA', 'AUSTRALIAN'),\n", " ('INDIA', 'INDIAN', 'ISRAEL', 'ISRAELI'),\n", " ('INDIA', 'INDIAN', 'JAPAN', 'JAPANESE'),\n", " ('INDIA', 'INDIAN', 'SWITZERLAND', 'SWISS'),\n", " ('INDIA', 'INDIAN', 'AUSTRALIA', 'AUSTRALIAN'),\n", " ('INDIA', 'INDIAN', 'FRANCE', 'FRENCH'),\n", " ('ISRAEL', 'ISRAELI', 'JAPAN', 'JAPANESE'),\n", " ('ISRAEL', 'ISRAELI', 'SWITZERLAND', 'SWISS'),\n", " ('ISRAEL', 'ISRAELI', 'AUSTRALIA', 'AUSTRALIAN'),\n", " ('ISRAEL', 'ISRAELI', 'FRANCE', 'FRENCH'),\n", " ('ISRAEL', 'ISRAELI', 'INDIA', 'INDIAN'),\n", " ('JAPAN', 'JAPANESE', 'SWITZERLAND', 'SWISS'),\n", " ('JAPAN', 'JAPANESE', 'AUSTRALIA', 'AUSTRALIAN'),\n", " ('JAPAN', 'JAPANESE', 'FRANCE', 'FRENCH'),\n", " ('JAPAN', 'JAPANESE', 'INDIA', 'INDIAN'),\n", " ('JAPAN', 'JAPANESE', 'ISRAEL', 'ISRAELI'),\n", " ('SWITZERLAND', 'SWISS', 'AUSTRALIA', 'AUSTRALIAN'),\n", " ('SWITZERLAND', 'SWISS', 'FRANCE', 'FRENCH'),\n", " ('SWITZERLAND', 'SWISS', 'INDIA', 'INDIAN'),\n", " ('SWITZERLAND', 'SWISS', 'ISRAEL', 'ISRAELI'),\n", " ('SWITZERLAND', 'SWISS', 'JAPAN', 'JAPANESE'),\n", " ('GOING', 'WENT', 'PAYING', 'PAID'),\n", " ('GOING', 'WENT', 'PLAYING', 'PLAYED'),\n", " ('GOING', 'WENT', 'SAYING', 'SAID'),\n", " ('GOING', 'WENT', 'TAKING', 'TOOK'),\n", " ('PAYING', 'PAID', 'PLAYING', 'PLAYED'),\n", " ('PAYING', 'PAID', 'SAYING', 'SAID'),\n", " ('PAYING', 'PAID', 'TAKING', 'TOOK'),\n", " ('PAYING', 'PAID', 'GOING', 'WENT'),\n", " ('PLAYING', 'PLAYED', 'SAYING', 'SAID'),\n", " ('PLAYING', 'PLAYED', 'TAKING', 'TOOK'),\n", " ('PLAYING', 'PLAYED', 'GOING', 'WENT'),\n", " ('PLAYING', 'PLAYED', 'PAYING', 'PAID'),\n", " ('SAYING', 'SAID', 'TAKING', 'TOOK'),\n", " ('SAYING', 'SAID', 'GOING', 'WENT'),\n", " ('SAYING', 'SAID', 'PAYING', 'PAID'),\n", " ('SAYING', 'SAID', 'PLAYING', 'PLAYED'),\n", " ('TAKING', 'TOOK', 'GOING', 'WENT'),\n", " ('TAKING', 'TOOK', 'PAYING', 'PAID'),\n", " ('TAKING', 'TOOK', 'PLAYING', 'PLAYED'),\n", " ('TAKING', 'TOOK', 'SAYING', 'SAID'),\n", " ('BUILDING', 'BUILDINGS', 'CAR', 'CARS'),\n", " ('BUILDING', 'BUILDINGS', 'CHILD', 'CHILDREN'),\n", " ('BUILDING', 'BUILDINGS', 'MAN', 'MEN'),\n", " ('BUILDING', 'BUILDINGS', 'ROAD', 'ROADS'),\n", " ('BUILDING', 'BUILDINGS', 'WOMAN', 'WOMEN'),\n", " ('CAR', 'CARS', 'CHILD', 'CHILDREN'),\n", " ('CAR', 'CARS', 'MAN', 'MEN'),\n", " ('CAR', 'CARS', 'ROAD', 'ROADS'),\n", " ('CAR', 'CARS', 'WOMAN', 'WOMEN'),\n", " ('CAR', 'CARS', 'BUILDING', 'BUILDINGS'),\n", " ('CHILD', 'CHILDREN', 'MAN', 'MEN'),\n", " ('CHILD', 'CHILDREN', 'ROAD', 'ROADS'),\n", " ('CHILD', 'CHILDREN', 'WOMAN', 'WOMEN'),\n", " ('CHILD', 'CHILDREN', 'BUILDING', 'BUILDINGS'),\n", " ('CHILD', 'CHILDREN', 'CAR', 'CARS'),\n", " ('MAN', 'MEN', 'ROAD', 'ROADS'),\n", " ('MAN', 'MEN', 'WOMAN', 'WOMEN'),\n", " ('MAN', 'MEN', 'BUILDING', 'BUILDINGS'),\n", " ('MAN', 'MEN', 'CAR', 'CARS'),\n", " ('MAN', 'MEN', 'CHILD', 'CHILDREN'),\n", " ('ROAD', 'ROADS', 'WOMAN', 'WOMEN'),\n", " ('ROAD', 'ROADS', 'BUILDING', 'BUILDINGS'),\n", " ('ROAD', 'ROADS', 'CAR', 'CARS'),\n", " ('ROAD', 'ROADS', 'CHILD', 'CHILDREN'),\n", " ('ROAD', 'ROADS', 'MAN', 'MEN'),\n", " ('WOMAN', 'WOMEN', 'BUILDING', 'BUILDINGS'),\n", " ('WOMAN', 'WOMEN', 'CAR', 'CARS'),\n", " ('WOMAN', 'WOMEN', 'CHILD', 'CHILDREN'),\n", " ('WOMAN', 'WOMEN', 'MAN', 'MEN'),\n", " ('WOMAN', 'WOMEN', 'ROAD', 'ROADS')]}])" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.wv.evaluate_word_analogies(datapath('questions-words.txt'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This ``evaluate_word_analogies`` method takes an `optional parameter\n", "`_\n", "``restrict_vocab`` which limits which test examples are to be considered.\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the December 2016 release of Gensim we added a better way to evaluate semantic similarity.\n", "\n", "By default it uses an academic dataset WS-353 but one can create a dataset\n", "specific to your business based on it. It contains word pairs together with\n", "human-assigned similarity judgments. It measures the relatedness or\n", "co-occurrence of two words. For example, 'coast' and 'shore' are very similar\n", "as they appear in the same context. At the same time 'clothes' and 'closet'\n", "are less similar because they are related but not interchangeable.\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2025-03-27 14:18:01,673 : INFO : Skipping line #2 with OOV words: love\tsex\t6.77\n", "2025-03-27 14:18:01,675 : INFO : Skipping line #3 with OOV words: tiger\tcat\t7.35\n", "2025-03-27 14:18:01,675 : INFO : Skipping line #4 with OOV words: tiger\ttiger\t10.00\n", "2025-03-27 14:18:01,676 : INFO : Skipping line #5 with OOV words: book\tpaper\t7.46\n", "2025-03-27 14:18:01,676 : INFO : Skipping line #6 with OOV words: computer\tkeyboard\t7.62\n", "2025-03-27 14:18:01,677 : INFO : Skipping line #7 with OOV words: computer\tinternet\t7.58\n", "2025-03-27 14:18:01,679 : INFO : Skipping line #9 with OOV words: train\tcar\t6.31\n", "2025-03-27 14:18:01,680 : INFO : Skipping line #10 with OOV words: telephone\tcommunication\t7.50\n", "2025-03-27 14:18:01,681 : INFO : Skipping line #14 with OOV words: bread\tbutter\t6.19\n", "2025-03-27 14:18:01,681 : INFO : Skipping line #15 with OOV words: cucumber\tpotato\t5.92\n", "2025-03-27 14:18:01,682 : INFO : Skipping line #16 with OOV words: doctor\tnurse\t7.00\n", "2025-03-27 14:18:01,682 : INFO : Skipping line #18 with OOV words: student\tprofessor\t6.81\n", "2025-03-27 14:18:01,683 : INFO : Skipping line #19 with OOV words: smart\tstudent\t4.62\n", "2025-03-27 14:18:01,683 : INFO : Skipping line #20 with OOV words: smart\tstupid\t5.81\n", "2025-03-27 14:18:01,683 : INFO : Skipping line #21 with OOV words: company\tstock\t7.08\n", "2025-03-27 14:18:01,684 : INFO : Skipping line #22 with OOV words: stock\tmarket\t8.08\n", "2025-03-27 14:18:01,684 : INFO : Skipping line #23 with OOV words: stock\tphone\t1.62\n", "2025-03-27 14:18:01,685 : INFO : Skipping line #24 with OOV words: stock\tCD\t1.31\n", "2025-03-27 14:18:01,685 : INFO : Skipping line #25 with OOV words: stock\tjaguar\t0.92\n", "2025-03-27 14:18:01,685 : INFO : Skipping line #26 with OOV words: stock\tegg\t1.81\n", "2025-03-27 14:18:01,686 : INFO : Skipping line #27 with OOV words: fertility\tegg\t6.69\n", "2025-03-27 14:18:01,686 : INFO : Skipping line #28 with OOV words: stock\tlive\t3.73\n", "2025-03-27 14:18:01,686 : INFO : Skipping line #29 with OOV words: stock\tlife\t0.92\n", "2025-03-27 14:18:01,687 : INFO : Skipping line #30 with OOV words: book\tlibrary\t7.46\n", "2025-03-27 14:18:01,687 : INFO : Skipping line #32 with OOV words: wood\tforest\t7.73\n", "2025-03-27 14:18:01,687 : INFO : Skipping line #33 with OOV words: money\tcash\t9.15\n", "2025-03-27 14:18:01,687 : INFO : Skipping line #34 with OOV words: professor\tcucumber\t0.31\n", "2025-03-27 14:18:01,688 : INFO : Skipping line #35 with OOV words: king\tcabbage\t0.23\n", "2025-03-27 14:18:01,688 : INFO : Skipping line #36 with OOV words: king\tqueen\t8.58\n", "2025-03-27 14:18:01,688 : INFO : Skipping line #37 with OOV words: king\trook\t5.92\n", "2025-03-27 14:18:01,689 : INFO : Skipping line #38 with OOV words: bishop\trabbi\t6.69\n", "2025-03-27 14:18:01,689 : INFO : Skipping line #41 with OOV words: holy\tsex\t1.62\n", "2025-03-27 14:18:01,690 : INFO : Skipping line #42 with OOV words: fuck\tsex\t9.44\n", "2025-03-27 14:18:01,690 : INFO : Skipping line #43 with OOV words: Maradona\tfootball\t8.62\n", "2025-03-27 14:18:01,690 : INFO : Skipping line #44 with OOV words: football\tsoccer\t9.03\n", "2025-03-27 14:18:01,691 : INFO : Skipping line #45 with OOV words: football\tbasketball\t6.81\n", "2025-03-27 14:18:01,691 : INFO : Skipping line #46 with OOV words: football\ttennis\t6.63\n", "2025-03-27 14:18:01,691 : INFO : Skipping line #47 with OOV words: tennis\tracket\t7.56\n", "2025-03-27 14:18:01,692 : INFO : Skipping line #50 with OOV words: Arafat\tJackson\t2.50\n", "2025-03-27 14:18:01,692 : INFO : Skipping line #51 with OOV words: law\tlawyer\t8.38\n", "2025-03-27 14:18:01,692 : INFO : Skipping line #52 with OOV words: movie\tstar\t7.38\n", "2025-03-27 14:18:01,693 : INFO : Skipping line #53 with OOV words: movie\tpopcorn\t6.19\n", "2025-03-27 14:18:01,693 : INFO : Skipping line #54 with OOV words: movie\tcritic\t6.73\n", "2025-03-27 14:18:01,694 : INFO : Skipping line #55 with OOV words: movie\ttheater\t7.92\n", "2025-03-27 14:18:01,694 : INFO : Skipping line #56 with OOV words: physics\tproton\t8.12\n", "2025-03-27 14:18:01,694 : INFO : Skipping line #57 with OOV words: physics\tchemistry\t7.35\n", "2025-03-27 14:18:01,694 : INFO : Skipping line #58 with OOV words: space\tchemistry\t4.88\n", "2025-03-27 14:18:01,695 : INFO : Skipping line #59 with OOV words: alcohol\tchemistry\t5.54\n", "2025-03-27 14:18:01,695 : INFO : Skipping line #60 with OOV words: vodka\tgin\t8.46\n", "2025-03-27 14:18:01,695 : INFO : Skipping line #61 with OOV words: vodka\tbrandy\t8.13\n", "2025-03-27 14:18:01,695 : INFO : Skipping line #62 with OOV words: drink\tcar\t3.04\n", "2025-03-27 14:18:01,696 : INFO : Skipping line #63 with OOV words: drink\tear\t1.31\n", "2025-03-27 14:18:01,696 : INFO : Skipping line #64 with OOV words: drink\tmouth\t5.96\n", "2025-03-27 14:18:01,696 : INFO : Skipping line #65 with OOV words: drink\teat\t6.87\n", "2025-03-27 14:18:01,696 : INFO : Skipping line #66 with OOV words: baby\tmother\t7.85\n", "2025-03-27 14:18:01,697 : INFO : Skipping line #67 with OOV words: drink\tmother\t2.65\n", "2025-03-27 14:18:01,697 : INFO : Skipping line #68 with OOV words: car\tautomobile\t8.94\n", "2025-03-27 14:18:01,697 : INFO : Skipping line #69 with OOV words: gem\tjewel\t8.96\n", "2025-03-27 14:18:01,697 : INFO : Skipping line #70 with OOV words: journey\tvoyage\t9.29\n", "2025-03-27 14:18:01,698 : INFO : Skipping line #71 with OOV words: boy\tlad\t8.83\n", "2025-03-27 14:18:01,698 : INFO : Skipping line #72 with OOV words: coast\tshore\t9.10\n", "2025-03-27 14:18:01,698 : INFO : Skipping line #73 with OOV words: asylum\tmadhouse\t8.87\n", "2025-03-27 14:18:01,699 : INFO : Skipping line #74 with OOV words: magician\twizard\t9.02\n", "2025-03-27 14:18:01,699 : INFO : Skipping line #75 with OOV words: midday\tnoon\t9.29\n", "2025-03-27 14:18:01,699 : INFO : Skipping line #76 with OOV words: furnace\tstove\t8.79\n", "2025-03-27 14:18:01,699 : INFO : Skipping line #77 with OOV words: food\tfruit\t7.52\n", "2025-03-27 14:18:01,699 : INFO : Skipping line #78 with OOV words: bird\tcock\t7.10\n", "2025-03-27 14:18:01,700 : INFO : Skipping line #79 with OOV words: bird\tcrane\t7.38\n", "2025-03-27 14:18:01,700 : INFO : Skipping line #80 with OOV words: tool\timplement\t6.46\n", "2025-03-27 14:18:01,700 : INFO : Skipping line #81 with OOV words: brother\tmonk\t6.27\n", "2025-03-27 14:18:01,701 : INFO : Skipping line #82 with OOV words: crane\timplement\t2.69\n", "2025-03-27 14:18:01,701 : INFO : Skipping line #83 with OOV words: lad\tbrother\t4.46\n", "2025-03-27 14:18:01,701 : INFO : Skipping line #84 with OOV words: journey\tcar\t5.85\n", "2025-03-27 14:18:01,702 : INFO : Skipping line #85 with OOV words: monk\toracle\t5.00\n", "2025-03-27 14:18:01,702 : INFO : Skipping line #86 with OOV words: cemetery\twoodland\t2.08\n", "2025-03-27 14:18:01,702 : INFO : Skipping line #87 with OOV words: food\trooster\t4.42\n", "2025-03-27 14:18:01,702 : INFO : Skipping line #89 with OOV words: forest\tgraveyard\t1.85\n", "2025-03-27 14:18:01,703 : INFO : Skipping line #90 with OOV words: shore\twoodland\t3.08\n", "2025-03-27 14:18:01,703 : INFO : Skipping line #91 with OOV words: monk\tslave\t0.92\n", "2025-03-27 14:18:01,703 : INFO : Skipping line #92 with OOV words: coast\tforest\t3.15\n", "2025-03-27 14:18:01,703 : INFO : Skipping line #93 with OOV words: lad\twizard\t0.92\n", "2025-03-27 14:18:01,704 : INFO : Skipping line #94 with OOV words: chord\tsmile\t0.54\n", "2025-03-27 14:18:01,704 : INFO : Skipping line #95 with OOV words: glass\tmagician\t2.08\n", "2025-03-27 14:18:01,704 : INFO : Skipping line #96 with OOV words: noon\tstring\t0.54\n", "2025-03-27 14:18:01,705 : INFO : Skipping line #97 with OOV words: rooster\tvoyage\t0.62\n", "2025-03-27 14:18:01,705 : INFO : Skipping line #98 with OOV words: money\tdollar\t8.42\n", "2025-03-27 14:18:01,706 : INFO : Skipping line #99 with OOV words: money\tcash\t9.08\n", "2025-03-27 14:18:01,706 : INFO : Skipping line #100 with OOV words: money\tcurrency\t9.04\n", "2025-03-27 14:18:01,706 : INFO : Skipping line #101 with OOV words: money\twealth\t8.27\n", "2025-03-27 14:18:01,706 : INFO : Skipping line #103 with OOV words: money\tpossession\t7.29\n", "2025-03-27 14:18:01,707 : INFO : Skipping line #105 with OOV words: money\tdeposit\t7.73\n", "2025-03-27 14:18:01,707 : INFO : Skipping line #106 with OOV words: money\twithdrawal\t6.88\n", "2025-03-27 14:18:01,707 : INFO : Skipping line #107 with OOV words: money\tlaundering\t5.65\n", "2025-03-27 14:18:01,707 : INFO : Skipping line #109 with OOV words: tiger\tjaguar\t8.00\n", "2025-03-27 14:18:01,708 : INFO : Skipping line #110 with OOV words: tiger\tfeline\t8.00\n", "2025-03-27 14:18:01,708 : INFO : Skipping line #111 with OOV words: tiger\tcarnivore\t7.08\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2025-03-27 14:18:01,709 : INFO : Skipping line #112 with OOV words: tiger\tmammal\t6.85\n", "2025-03-27 14:18:01,709 : INFO : Skipping line #113 with OOV words: tiger\tanimal\t7.00\n", "2025-03-27 14:18:01,709 : INFO : Skipping line #114 with OOV words: tiger\torganism\t4.77\n", "2025-03-27 14:18:01,709 : INFO : Skipping line #115 with OOV words: tiger\tfauna\t5.62\n", "2025-03-27 14:18:01,709 : INFO : Skipping line #116 with OOV words: tiger\tzoo\t5.87\n", "2025-03-27 14:18:01,710 : INFO : Skipping line #117 with OOV words: psychology\tpsychiatry\t8.08\n", "2025-03-27 14:18:01,710 : INFO : Skipping line #118 with OOV words: psychology\tanxiety\t7.00\n", "2025-03-27 14:18:01,710 : INFO : Skipping line #119 with OOV words: psychology\tfear\t6.85\n", "2025-03-27 14:18:01,710 : INFO : Skipping line #120 with OOV words: psychology\tdepression\t7.42\n", "2025-03-27 14:18:01,710 : INFO : Skipping line #121 with OOV words: psychology\tclinic\t6.58\n", "2025-03-27 14:18:01,711 : INFO : Skipping line #122 with OOV words: psychology\tdoctor\t6.42\n", "2025-03-27 14:18:01,711 : INFO : Skipping line #123 with OOV words: psychology\tFreud\t8.21\n", "2025-03-27 14:18:01,711 : INFO : Skipping line #124 with OOV words: psychology\tmind\t7.69\n", "2025-03-27 14:18:01,711 : INFO : Skipping line #125 with OOV words: psychology\thealth\t7.23\n", "2025-03-27 14:18:01,712 : INFO : Skipping line #126 with OOV words: psychology\tscience\t6.71\n", "2025-03-27 14:18:01,712 : INFO : Skipping line #127 with OOV words: psychology\tdiscipline\t5.58\n", "2025-03-27 14:18:01,712 : INFO : Skipping line #128 with OOV words: psychology\tcognition\t7.48\n", "2025-03-27 14:18:01,713 : INFO : Skipping line #129 with OOV words: planet\tstar\t8.45\n", "2025-03-27 14:18:01,713 : INFO : Skipping line #130 with OOV words: planet\tconstellation\t8.06\n", "2025-03-27 14:18:01,713 : INFO : Skipping line #131 with OOV words: planet\tmoon\t8.08\n", "2025-03-27 14:18:01,713 : INFO : Skipping line #132 with OOV words: planet\tsun\t8.02\n", "2025-03-27 14:18:01,714 : INFO : Skipping line #133 with OOV words: planet\tgalaxy\t8.11\n", "2025-03-27 14:18:01,714 : INFO : Skipping line #134 with OOV words: planet\tspace\t7.92\n", "2025-03-27 14:18:01,714 : INFO : Skipping line #135 with OOV words: planet\tastronomer\t7.94\n", "2025-03-27 14:18:01,714 : INFO : Skipping line #136 with OOV words: precedent\texample\t5.85\n", "2025-03-27 14:18:01,715 : INFO : Skipping line #137 with OOV words: precedent\tinformation\t3.85\n", "2025-03-27 14:18:01,715 : INFO : Skipping line #138 with OOV words: precedent\tcognition\t2.81\n", "2025-03-27 14:18:01,715 : INFO : Skipping line #139 with OOV words: precedent\tlaw\t6.65\n", "2025-03-27 14:18:01,715 : INFO : Skipping line #140 with OOV words: precedent\tcollection\t2.50\n", "2025-03-27 14:18:01,715 : INFO : Skipping line #141 with OOV words: precedent\tgroup\t1.77\n", "2025-03-27 14:18:01,715 : INFO : Skipping line #142 with OOV words: precedent\tantecedent\t6.04\n", "2025-03-27 14:18:01,716 : INFO : Skipping line #143 with OOV words: cup\tcoffee\t6.58\n", "2025-03-27 14:18:01,716 : INFO : Skipping line #144 with OOV words: cup\ttableware\t6.85\n", "2025-03-27 14:18:01,716 : INFO : Skipping line #145 with OOV words: cup\tarticle\t2.40\n", "2025-03-27 14:18:01,716 : INFO : Skipping line #146 with OOV words: cup\tartifact\t2.92\n", "2025-03-27 14:18:01,717 : INFO : Skipping line #147 with OOV words: cup\tobject\t3.69\n", "2025-03-27 14:18:01,717 : INFO : Skipping line #148 with OOV words: cup\tentity\t2.15\n", "2025-03-27 14:18:01,717 : INFO : Skipping line #149 with OOV words: cup\tdrink\t7.25\n", "2025-03-27 14:18:01,717 : INFO : Skipping line #151 with OOV words: cup\tsubstance\t1.92\n", "2025-03-27 14:18:01,718 : INFO : Skipping line #152 with OOV words: cup\tliquid\t5.90\n", "2025-03-27 14:18:01,718 : INFO : Skipping line #153 with OOV words: jaguar\tcat\t7.42\n", "2025-03-27 14:18:01,718 : INFO : Skipping line #154 with OOV words: jaguar\tcar\t7.27\n", "2025-03-27 14:18:01,719 : INFO : Skipping line #157 with OOV words: energy\tlaboratory\t5.09\n", "2025-03-27 14:18:01,719 : INFO : Skipping line #158 with OOV words: computer\tlaboratory\t6.78\n", "2025-03-27 14:18:01,719 : INFO : Skipping line #159 with OOV words: weapon\tsecret\t6.06\n", "2025-03-27 14:18:01,719 : INFO : Skipping line #160 with OOV words: FBI\tfingerprint\t6.94\n", "2025-03-27 14:18:01,719 : INFO : Skipping line #161 with OOV words: FBI\tinvestigation\t8.31\n", "2025-03-27 14:18:01,720 : INFO : Skipping line #163 with OOV words: Mars\twater\t2.94\n", "2025-03-27 14:18:01,720 : INFO : Skipping line #164 with OOV words: Mars\tscientist\t5.63\n", "2025-03-27 14:18:01,720 : INFO : Skipping line #166 with OOV words: canyon\tlandscape\t7.53\n", "2025-03-27 14:18:01,720 : INFO : Skipping line #167 with OOV words: image\tsurface\t4.56\n", "2025-03-27 14:18:01,721 : INFO : Skipping line #168 with OOV words: discovery\tspace\t6.34\n", "2025-03-27 14:18:01,721 : INFO : Skipping line #169 with OOV words: water\tseepage\t6.56\n", "2025-03-27 14:18:01,721 : INFO : Skipping line #170 with OOV words: sign\trecess\t2.38\n", "2025-03-27 14:18:01,721 : INFO : Skipping line #172 with OOV words: mile\tkilometer\t8.66\n", "2025-03-27 14:18:01,722 : INFO : Skipping line #173 with OOV words: computer\tnews\t4.47\n", "2025-03-27 14:18:01,722 : INFO : Skipping line #174 with OOV words: territory\tsurface\t5.34\n", "2025-03-27 14:18:01,722 : INFO : Skipping line #175 with OOV words: atmosphere\tlandscape\t3.69\n", "2025-03-27 14:18:01,722 : INFO : Skipping line #176 with OOV words: president\tmedal\t3.00\n", "2025-03-27 14:18:01,722 : INFO : Skipping line #179 with OOV words: skin\teye\t6.22\n", "2025-03-27 14:18:01,723 : INFO : Skipping line #181 with OOV words: theater\thistory\t3.91\n", "2025-03-27 14:18:01,723 : INFO : Skipping line #182 with OOV words: volunteer\tmotto\t2.56\n", "2025-03-27 14:18:01,723 : INFO : Skipping line #183 with OOV words: prejudice\trecognition\t3.00\n", "2025-03-27 14:18:01,723 : INFO : Skipping line #184 with OOV words: decoration\tvalor\t5.63\n", "2025-03-27 14:18:01,724 : INFO : Skipping line #185 with OOV words: century\tyear\t7.59\n", "2025-03-27 14:18:01,724 : INFO : Skipping line #186 with OOV words: century\tnation\t3.16\n", "2025-03-27 14:18:01,724 : INFO : Skipping line #187 with OOV words: delay\tracism\t1.19\n", "2025-03-27 14:18:01,725 : INFO : Skipping line #191 with OOV words: minority\tpeace\t3.69\n", "2025-03-27 14:18:01,725 : INFO : Skipping line #194 with OOV words: deployment\tdeparture\t4.25\n", "2025-03-27 14:18:01,725 : INFO : Skipping line #195 with OOV words: deployment\twithdrawal\t5.88\n", "2025-03-27 14:18:01,725 : INFO : Skipping line #197 with OOV words: announcement\tnews\t7.56\n", "2025-03-27 14:18:01,725 : INFO : Skipping line #198 with OOV words: announcement\teffort\t2.75\n", "2025-03-27 14:18:01,726 : INFO : Skipping line #199 with OOV words: stroke\thospital\t7.03\n", "2025-03-27 14:18:01,726 : INFO : Skipping line #200 with OOV words: disability\tdeath\t5.47\n", "2025-03-27 14:18:01,726 : INFO : Skipping line #201 with OOV words: victim\temergency\t6.47\n", "2025-03-27 14:18:01,726 : INFO : Skipping line #203 with OOV words: journal\tassociation\t4.97\n", "2025-03-27 14:18:01,726 : INFO : Skipping line #205 with OOV words: doctor\tliability\t5.19\n", "2025-03-27 14:18:01,727 : INFO : Skipping line #206 with OOV words: liability\tinsurance\t7.03\n", "2025-03-27 14:18:01,727 : INFO : Skipping line #207 with OOV words: school\tcenter\t3.44\n", "2025-03-27 14:18:01,727 : INFO : Skipping line #208 with OOV words: reason\thypertension\t2.31\n", "2025-03-27 14:18:01,727 : INFO : Skipping line #209 with OOV words: reason\tcriterion\t5.91\n", "2025-03-27 14:18:01,728 : INFO : Skipping line #210 with OOV words: hundred\tpercent\t7.38\n", "2025-03-27 14:18:01,728 : INFO : Skipping line #211 with OOV words: Harvard\tYale\t8.13\n", "2025-03-27 14:18:01,728 : INFO : Skipping line #212 with OOV words: hospital\tinfrastructure\t4.63\n", "2025-03-27 14:18:01,728 : INFO : Skipping line #213 with OOV words: death\trow\t5.25\n", "2025-03-27 14:18:01,728 : INFO : Skipping line #214 with OOV words: death\tinmate\t5.03\n", "2025-03-27 14:18:01,729 : INFO : Skipping line #215 with OOV words: lawyer\tevidence\t6.69\n", "2025-03-27 14:18:01,729 : INFO : Skipping line #218 with OOV words: word\tsimilarity\t4.75\n", "2025-03-27 14:18:01,729 : INFO : Skipping line #219 with OOV words: board\trecommendation\t4.47\n", "2025-03-27 14:18:01,729 : INFO : Skipping line #221 with OOV words: OPEC\tcountry\t5.63\n", "2025-03-27 14:18:01,729 : INFO : Skipping line #222 with OOV words: peace\tatmosphere\t3.69\n", "2025-03-27 14:18:01,730 : INFO : Skipping line #224 with OOV words: territory\tkilometer\t5.28\n", "2025-03-27 14:18:01,730 : INFO : Skipping line #226 with OOV words: competition\tprice\t6.44\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2025-03-27 14:18:01,730 : INFO : Skipping line #227 with OOV words: consumer\tconfidence\t4.13\n", "2025-03-27 14:18:01,730 : INFO : Skipping line #228 with OOV words: consumer\tenergy\t4.75\n", "2025-03-27 14:18:01,731 : INFO : Skipping line #231 with OOV words: credit\tcard\t8.06\n", "2025-03-27 14:18:01,731 : INFO : Skipping line #233 with OOV words: hotel\treservation\t8.03\n", "2025-03-27 14:18:01,731 : INFO : Skipping line #234 with OOV words: grocery\tmoney\t5.94\n", "2025-03-27 14:18:01,731 : INFO : Skipping line #235 with OOV words: registration\tarrangement\t6.00\n", "2025-03-27 14:18:01,731 : INFO : Skipping line #236 with OOV words: arrangement\taccommodation\t5.41\n", "2025-03-27 14:18:01,732 : INFO : Skipping line #238 with OOV words: type\tkind\t8.97\n", "2025-03-27 14:18:01,732 : INFO : Skipping line #239 with OOV words: arrival\thotel\t6.00\n", "2025-03-27 14:18:01,732 : INFO : Skipping line #240 with OOV words: bed\tcloset\t6.72\n", "2025-03-27 14:18:01,732 : INFO : Skipping line #241 with OOV words: closet\tclothes\t8.00\n", "2025-03-27 14:18:01,732 : INFO : Skipping line #242 with OOV words: situation\tconclusion\t4.81\n", "2025-03-27 14:18:01,733 : INFO : Skipping line #243 with OOV words: situation\tisolation\t3.88\n", "2025-03-27 14:18:01,733 : INFO : Skipping line #244 with OOV words: impartiality\tinterest\t5.16\n", "2025-03-27 14:18:01,733 : INFO : Skipping line #245 with OOV words: direction\tcombination\t2.25\n", "2025-03-27 14:18:01,733 : INFO : Skipping line #246 with OOV words: street\tplace\t6.44\n", "2025-03-27 14:18:01,733 : INFO : Skipping line #247 with OOV words: street\tavenue\t8.88\n", "2025-03-27 14:18:01,734 : INFO : Skipping line #248 with OOV words: street\tblock\t6.88\n", "2025-03-27 14:18:01,734 : INFO : Skipping line #249 with OOV words: street\tchildren\t4.94\n", "2025-03-27 14:18:01,734 : INFO : Skipping line #250 with OOV words: listing\tproximity\t2.56\n", "2025-03-27 14:18:01,734 : INFO : Skipping line #251 with OOV words: listing\tcategory\t6.38\n", "2025-03-27 14:18:01,734 : INFO : Skipping line #252 with OOV words: cell\tphone\t7.81\n", "2025-03-27 14:18:01,735 : INFO : Skipping line #253 with OOV words: production\thike\t1.75\n", "2025-03-27 14:18:01,735 : INFO : Skipping line #254 with OOV words: benchmark\tindex\t4.25\n", "2025-03-27 14:18:01,735 : INFO : Skipping line #256 with OOV words: media\tgain\t2.88\n", "2025-03-27 14:18:01,735 : INFO : Skipping line #257 with OOV words: dividend\tpayment\t7.63\n", "2025-03-27 14:18:01,735 : INFO : Skipping line #258 with OOV words: dividend\tcalculation\t6.48\n", "2025-03-27 14:18:01,736 : INFO : Skipping line #259 with OOV words: calculation\tcomputation\t8.44\n", "2025-03-27 14:18:01,736 : INFO : Skipping line #260 with OOV words: currency\tmarket\t7.50\n", "2025-03-27 14:18:01,736 : INFO : Skipping line #261 with OOV words: OPEC\toil\t8.59\n", "2025-03-27 14:18:01,736 : INFO : Skipping line #262 with OOV words: oil\tstock\t6.34\n", "2025-03-27 14:18:01,737 : INFO : Skipping line #263 with OOV words: announcement\tproduction\t3.38\n", "2025-03-27 14:18:01,737 : INFO : Skipping line #264 with OOV words: announcement\twarning\t6.00\n", "2025-03-27 14:18:01,737 : INFO : Skipping line #265 with OOV words: profit\twarning\t3.88\n", "2025-03-27 14:18:01,737 : INFO : Skipping line #266 with OOV words: profit\tloss\t7.63\n", "2025-03-27 14:18:01,738 : INFO : Skipping line #267 with OOV words: dollar\tyen\t7.78\n", "2025-03-27 14:18:01,738 : INFO : Skipping line #268 with OOV words: dollar\tbuck\t9.22\n", "2025-03-27 14:18:01,738 : INFO : Skipping line #269 with OOV words: dollar\tprofit\t7.38\n", "2025-03-27 14:18:01,738 : INFO : Skipping line #270 with OOV words: dollar\tloss\t6.09\n", "2025-03-27 14:18:01,738 : INFO : Skipping line #271 with OOV words: computer\tsoftware\t8.50\n", "2025-03-27 14:18:01,739 : INFO : Skipping line #272 with OOV words: network\thardware\t8.31\n", "2025-03-27 14:18:01,739 : INFO : Skipping line #273 with OOV words: phone\tequipment\t7.13\n", "2025-03-27 14:18:01,739 : INFO : Skipping line #274 with OOV words: equipment\tmaker\t5.91\n", "2025-03-27 14:18:01,739 : INFO : Skipping line #275 with OOV words: luxury\tcar\t6.47\n", "2025-03-27 14:18:01,739 : INFO : Skipping line #277 with OOV words: report\tgain\t3.63\n", "2025-03-27 14:18:01,740 : INFO : Skipping line #278 with OOV words: investor\tearning\t7.13\n", "2025-03-27 14:18:01,740 : INFO : Skipping line #279 with OOV words: liquid\twater\t7.89\n", "2025-03-27 14:18:01,740 : INFO : Skipping line #280 with OOV words: baseball\tseason\t5.97\n", "2025-03-27 14:18:01,740 : INFO : Skipping line #283 with OOV words: marathon\tsprint\t7.47\n", "2025-03-27 14:18:01,741 : INFO : Skipping line #285 with OOV words: game\tdefeat\t6.97\n", "2025-03-27 14:18:01,741 : INFO : Skipping line #287 with OOV words: seafood\tsea\t7.47\n", "2025-03-27 14:18:01,741 : INFO : Skipping line #288 with OOV words: seafood\tfood\t8.34\n", "2025-03-27 14:18:01,741 : INFO : Skipping line #289 with OOV words: seafood\tlobster\t8.70\n", "2025-03-27 14:18:01,741 : INFO : Skipping line #290 with OOV words: lobster\tfood\t7.81\n", "2025-03-27 14:18:01,742 : INFO : Skipping line #291 with OOV words: lobster\twine\t5.70\n", "2025-03-27 14:18:01,742 : INFO : Skipping line #292 with OOV words: food\tpreparation\t6.22\n", "2025-03-27 14:18:01,742 : INFO : Skipping line #293 with OOV words: video\tarchive\t6.34\n", "2025-03-27 14:18:01,742 : INFO : Skipping line #298 with OOV words: championship\ttournament\t8.36\n", "2025-03-27 14:18:01,742 : INFO : Skipping line #299 with OOV words: fighting\tdefeating\t7.41\n", "2025-03-27 14:18:01,743 : INFO : Skipping line #301 with OOV words: day\tsummer\t3.94\n", "2025-03-27 14:18:01,743 : INFO : Skipping line #302 with OOV words: summer\tdrought\t7.16\n", "2025-03-27 14:18:01,743 : INFO : Skipping line #303 with OOV words: summer\tnature\t5.63\n", "2025-03-27 14:18:01,743 : INFO : Skipping line #304 with OOV words: day\tdawn\t7.53\n", "2025-03-27 14:18:01,743 : INFO : Skipping line #305 with OOV words: nature\tenvironment\t8.31\n", "2025-03-27 14:18:01,744 : INFO : Skipping line #306 with OOV words: environment\tecology\t8.81\n", "2025-03-27 14:18:01,744 : INFO : Skipping line #307 with OOV words: nature\tman\t6.25\n", "2025-03-27 14:18:01,744 : INFO : Skipping line #311 with OOV words: soap\topera\t7.94\n", "2025-03-27 14:18:01,744 : INFO : Skipping line #312 with OOV words: opera\tperformance\t6.88\n", "2025-03-27 14:18:01,744 : INFO : Skipping line #313 with OOV words: life\tlesson\t5.94\n", "2025-03-27 14:18:01,745 : INFO : Skipping line #315 with OOV words: production\tcrew\t6.25\n", "2025-03-27 14:18:01,745 : INFO : Skipping line #316 with OOV words: television\tfilm\t7.72\n", "2025-03-27 14:18:01,745 : INFO : Skipping line #317 with OOV words: lover\tquarrel\t6.19\n", "2025-03-27 14:18:01,745 : INFO : Skipping line #318 with OOV words: viewer\tserial\t2.97\n", "2025-03-27 14:18:01,745 : INFO : Skipping line #319 with OOV words: possibility\tgirl\t1.94\n", "2025-03-27 14:18:01,746 : INFO : Skipping line #320 with OOV words: population\tdevelopment\t3.75\n", "2025-03-27 14:18:01,746 : INFO : Skipping line #321 with OOV words: morality\timportance\t3.31\n", "2025-03-27 14:18:01,746 : INFO : Skipping line #322 with OOV words: morality\tmarriage\t3.69\n", "2025-03-27 14:18:01,746 : INFO : Skipping line #323 with OOV words: Mexico\tBrazil\t7.44\n", "2025-03-27 14:18:01,747 : INFO : Skipping line #324 with OOV words: gender\tequality\t6.41\n", "2025-03-27 14:18:01,747 : INFO : Skipping line #325 with OOV words: change\tattitude\t5.44\n", "2025-03-27 14:18:01,747 : INFO : Skipping line #327 with OOV words: opera\tindustry\t2.63\n", "2025-03-27 14:18:01,748 : INFO : Skipping line #328 with OOV words: sugar\tapproach\t0.88\n", "2025-03-27 14:18:01,748 : INFO : Skipping line #329 with OOV words: practice\tinstitution\t3.19\n", "2025-03-27 14:18:01,748 : INFO : Skipping line #330 with OOV words: ministry\tculture\t4.69\n", "2025-03-27 14:18:01,748 : INFO : Skipping line #331 with OOV words: problem\tchallenge\t6.75\n", "2025-03-27 14:18:01,748 : INFO : Skipping line #332 with OOV words: size\tprominence\t5.31\n", "2025-03-27 14:18:01,748 : INFO : Skipping line #333 with OOV words: country\tcitizen\t7.31\n", "2025-03-27 14:18:01,749 : INFO : Skipping line #334 with OOV words: planet\tpeople\t5.75\n", "2025-03-27 14:18:01,749 : INFO : Skipping line #335 with OOV words: development\tissue\t3.97\n", "2025-03-27 14:18:01,749 : INFO : Skipping line #336 with OOV words: experience\tmusic\t3.47\n", "2025-03-27 14:18:01,749 : INFO : Skipping line #337 with OOV words: music\tproject\t3.63\n", "2025-03-27 14:18:01,750 : INFO : Skipping line #338 with OOV words: glass\tmetal\t5.56\n", "2025-03-27 14:18:01,750 : INFO : Skipping line #339 with OOV words: aluminum\tmetal\t7.83\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2025-03-27 14:18:01,750 : INFO : Skipping line #340 with OOV words: chance\tcredibility\t3.88\n", "2025-03-27 14:18:01,750 : INFO : Skipping line #341 with OOV words: exhibit\tmemorabilia\t5.31\n", "2025-03-27 14:18:01,750 : INFO : Skipping line #342 with OOV words: concert\tvirtuoso\t6.81\n", "2025-03-27 14:18:01,751 : INFO : Skipping line #343 with OOV words: rock\tjazz\t7.59\n", "2025-03-27 14:18:01,751 : INFO : Skipping line #344 with OOV words: museum\ttheater\t7.19\n", "2025-03-27 14:18:01,751 : INFO : Skipping line #345 with OOV words: observation\tarchitecture\t4.38\n", "2025-03-27 14:18:01,751 : INFO : Skipping line #347 with OOV words: preservation\tworld\t6.19\n", "2025-03-27 14:18:01,752 : INFO : Skipping line #348 with OOV words: admission\tticket\t7.69\n", "2025-03-27 14:18:01,752 : INFO : Skipping line #349 with OOV words: shower\tthunderstorm\t6.31\n", "2025-03-27 14:18:01,752 : INFO : Skipping line #350 with OOV words: shower\tflood\t6.03\n", "2025-03-27 14:18:01,752 : INFO : Skipping line #354 with OOV words: architecture\tcentury\t3.78\n", "2025-03-27 14:18:01,754 : INFO : Pearson correlation coefficient against /Users/vip/Library/Python/3.9/lib/python/site-packages/gensim/test/test_data/wordsim353.tsv: 0.1715\n", "2025-03-27 14:18:01,754 : INFO : Spearman rank-order correlation coefficient against /Users/vip/Library/Python/3.9/lib/python/site-packages/gensim/test/test_data/wordsim353.tsv: 0.1426\n", "2025-03-27 14:18:01,754 : INFO : Pairs with unknown words ratio: 83.0%\n" ] }, { "data": { "text/plain": [ "(PearsonRResult(statistic=0.17147094626235634, pvalue=0.19019767223751688),\n", " SignificanceResult(statistic=0.14264276870117013, pvalue=0.276933183632982),\n", " 83.0028328611898)" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.wv.evaluate_word_pairs(datapath('wordsim353.tsv'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ ".. Important::\n", " Good performance on Google's or WS-353 test set doesn’t mean word2vec will\n", " work well in your application, or vice versa. It’s always best to evaluate\n", " directly on your intended task. For an example of how to use word2vec in a\n", " classifier pipeline, see this `tutorial\n", " `_.\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Online training / Resuming training\n", "-----------------------------------\n", "\n", "Advanced users can load a model and continue training it with more sentences\n", "and `new vocabulary words `_:\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2025-03-27 14:18:17,184 : INFO : loading Word2Vec object from /var/folders/w_/5zj48w1d0xb7ycgdm6pk40v00000gn/T/gensim-model-k8l_1pq4\n", "2025-03-27 14:18:17,191 : INFO : loading wv recursively from /var/folders/w_/5zj48w1d0xb7ycgdm6pk40v00000gn/T/gensim-model-k8l_1pq4.wv.* with mmap=None\n", "2025-03-27 14:18:17,192 : INFO : setting ignored attribute cum_table to None\n", "2025-03-27 14:18:17,201 : INFO : Word2Vec lifecycle event {'fname': '/var/folders/w_/5zj48w1d0xb7ycgdm6pk40v00000gn/T/gensim-model-k8l_1pq4', 'datetime': '2025-03-27T14:18:17.201501', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'loaded'}\n", "2025-03-27 14:18:17,202 : INFO : collecting all words and their counts\n", "2025-03-27 14:18:17,202 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:18:17,203 : INFO : collected 13 word types from a corpus of 13 raw words and 1 sentences\n", "2025-03-27 14:18:17,203 : INFO : Updating model with new vocabulary\n", "2025-03-27 14:18:17,207 : INFO : Word2Vec lifecycle event {'msg': 'added 0 new unique words (0.00% of original 13) and increased the count of 0 pre-existing words (0.00% of original 13)', 'datetime': '2025-03-27T14:18:17.207446', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:17,208 : INFO : deleting the raw counts dictionary of 13 items\n", "2025-03-27 14:18:17,208 : INFO : sample=0.001 downsamples 0 most-common words\n", "2025-03-27 14:18:17,208 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 0 word corpus (0.0%% of prior 0)', 'datetime': '2025-03-27T14:18:17.208794', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:17,216 : INFO : estimated required memory for 1750 words and 100 dimensions: 2275000 bytes\n", "2025-03-27 14:18:17,216 : INFO : updating layer weights\n", "2025-03-27 14:18:17,217 : INFO : Word2Vec lifecycle event {'update': True, 'trim_rule': 'None', 'datetime': '2025-03-27T14:18:17.217470', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:18:17,217 : WARNING : Effective 'alpha' higher than previous training cycles\n", "2025-03-27 14:18:17,218 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 1750 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:18:17.218191', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:18:17,219 : INFO : EPOCH 0: training on 13 raw words (6 effective words) took 0.0s, 19329 effective words/s\n", "2025-03-27 14:18:17,221 : INFO : EPOCH 1: training on 13 raw words (4 effective words) took 0.0s, 19386 effective words/s\n", "2025-03-27 14:18:17,222 : INFO : EPOCH 2: training on 13 raw words (5 effective words) took 0.0s, 24470 effective words/s\n", "2025-03-27 14:18:17,224 : INFO : EPOCH 3: training on 13 raw words (5 effective words) took 0.0s, 13587 effective words/s\n", "2025-03-27 14:18:17,225 : INFO : EPOCH 4: training on 13 raw words (6 effective words) took 0.0s, 51046 effective words/s\n", "2025-03-27 14:18:17,226 : INFO : Word2Vec lifecycle event {'msg': 'training on 65 raw words (26 effective words) took 0.0s, 3454 effective words/s', 'datetime': '2025-03-27T14:18:17.226029', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n" ] } ], "source": [ "model = gensim.models.Word2Vec.load(temporary_filepath)\n", "more_sentences = [\n", " ['Advanced', 'users', 'can', 'load', 'a', 'model',\n", " 'and', 'continue', 'training', 'it', 'with', 'more', 'sentences'],\n", "]\n", "model.build_vocab(more_sentences, update=True)\n", "model.train(more_sentences, total_examples=model.corpus_count, epochs=model.epochs)\n", "\n", "# cleaning up temporary file\n", "import os\n", "os.remove(temporary_filepath)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You may need to tweak the ``total_words`` parameter to ``train()``,\n", "depending on what learning rate decay you want to simulate.\n", "\n", "Note that it’s not possible to resume training with models generated by the C\n", "tool, ``KeyedVectors.load_word2vec_format()``. You can still use them for\n", "querying/similarity, but information vital for training (the vocab tree) is\n", "missing there.\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Training Loss Computation\n", "-------------------------\n", "\n", "The parameter ``compute_loss`` can be used to toggle computation of loss\n", "while training the Word2Vec model. The computed loss is stored in the model\n", "attribute ``running_training_loss`` and can be retrieved using the function\n", "``get_latest_training_loss`` as follows :\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2025-03-27 14:18:27,135 : INFO : collecting all words and their counts\n", "2025-03-27 14:18:27,141 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:18:27,196 : INFO : collected 6981 word types from a corpus of 58152 raw words and 300 sentences\n", "2025-03-27 14:18:27,197 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:18:27,209 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=1 retains 6981 unique words (100.00% of original 6981, drops 0)', 'datetime': '2025-03-27T14:18:27.209211', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:27,209 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=1 leaves 58152 word corpus (100.00% of original 58152, drops 0)', 'datetime': '2025-03-27T14:18:27.209743', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:27,226 : INFO : deleting the raw counts dictionary of 6981 items\n", "2025-03-27 14:18:27,227 : INFO : sample=0.001 downsamples 43 most-common words\n", "2025-03-27 14:18:27,234 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 45723.4541622429 word corpus (78.6%% of prior 58152)', 'datetime': '2025-03-27T14:18:27.234194', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:27,257 : INFO : estimated required memory for 6981 words and 100 dimensions: 9075300 bytes\n", "2025-03-27 14:18:27,258 : INFO : resetting layer weights\n", "2025-03-27 14:18:27,260 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:18:27.260906', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:18:27,261 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 6981 vocabulary and 100 features, using sg=1 hs=0 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:18:27.261234', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:18:27,389 : INFO : EPOCH 0: training on 58152 raw words (45629 effective words) took 0.1s, 359377 effective words/s\n", "2025-03-27 14:18:27,512 : INFO : EPOCH 1: training on 58152 raw words (45777 effective words) took 0.1s, 374436 effective words/s\n", "2025-03-27 14:18:27,629 : INFO : EPOCH 2: training on 58152 raw words (45684 effective words) took 0.1s, 394596 effective words/s\n", "2025-03-27 14:18:27,750 : INFO : EPOCH 3: training on 58152 raw words (45798 effective words) took 0.1s, 380392 effective words/s\n", "2025-03-27 14:18:27,865 : INFO : EPOCH 4: training on 58152 raw words (45627 effective words) took 0.1s, 398818 effective words/s\n", "2025-03-27 14:18:27,866 : INFO : Word2Vec lifecycle event {'msg': 'training on 290760 raw words (228515 effective words) took 0.6s, 377822 effective words/s', 'datetime': '2025-03-27T14:18:27.866304', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:18:27,866 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:18:27.866526', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "1374090.375\n" ] } ], "source": [ "# instantiating and training the Word2Vec model\n", "model_with_loss = gensim.models.Word2Vec(\n", " sentences,\n", " min_count=1,\n", " compute_loss=True,\n", " hs=0,\n", " sg=1,\n", " seed=42,\n", ")\n", "\n", "# getting the training loss value\n", "training_loss = model_with_loss.get_latest_training_loss()\n", "print(training_loss)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Benchmarks\n", "----------\n", "\n", "Let's run some benchmarks to see effect of the training loss computation code\n", "on training time.\n", "\n", "We'll use the following data for the benchmarks:\n", "\n", "#. Lee Background corpus: included in gensim's test data\n", "#. Text8 corpus. To demonstrate the effect of corpus size, we'll look at the\n", " first 1MB, 10MB, 50MB of the corpus, as well as the entire thing.\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "import io\n", "import os\n", "\n", "import gensim.models.word2vec\n", "import gensim.downloader as api\n", "import smart_open\n", "\n", "\n", "def head(path, size):\n", " with smart_open.open(path) as fin:\n", " return io.StringIO(fin.read(size))\n", "\n", "\n", "def generate_input_data():\n", " lee_path = datapath('lee_background.cor')\n", " ls = gensim.models.word2vec.LineSentence(lee_path)\n", " ls.name = '25kB'\n", " yield ls\n", "\n", " text8_path = api.load('text8').fn\n", " labels = ('1MB', '10MB', '50MB', '100MB')\n", " sizes = (1024 ** 2, 10 * 1024 ** 2, 50 * 1024 ** 2, 100 * 1024 ** 2)\n", " for l, s in zip(labels, sizes):\n", " ls = gensim.models.word2vec.LineSentence(head(text8_path, s))\n", " ls.name = l\n", " yield ls\n", "\n", "\n", "input_data = list(generate_input_data())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now compare the training time taken for different combinations of input\n", "data and model training parameters like ``hs`` and ``sg``.\n", "\n", "For each combination, we repeat the test several times to obtain the mean and\n", "standard deviation of the test duration.\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2025-03-27 14:18:51,580 : INFO : collecting all words and their counts\n", "2025-03-27 14:18:51,581 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:18:51,591 : INFO : collected 10781 word types from a corpus of 59890 raw words and 300 sentences\n", "2025-03-27 14:18:51,591 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:18:51,598 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 1762 unique words (16.34% of original 10781, drops 9019)', 'datetime': '2025-03-27T14:18:51.598350', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:51,599 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 46084 word corpus (76.95% of original 59890, drops 13806)', 'datetime': '2025-03-27T14:18:51.598999', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:51,603 : INFO : deleting the raw counts dictionary of 10781 items\n", "2025-03-27 14:18:51,603 : INFO : sample=0.001 downsamples 45 most-common words\n", "2025-03-27 14:18:51,604 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 32610.61883565215 word corpus (70.8%% of prior 46084)', 'datetime': '2025-03-27T14:18:51.604219', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:51,612 : INFO : estimated required memory for 1762 words and 100 dimensions: 2290600 bytes\n", "2025-03-27 14:18:51,612 : INFO : resetting layer weights\n", "2025-03-27 14:18:51,614 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:18:51.614136', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:18:51,614 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 1762 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:18:51.614380', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:18:51,640 : INFO : EPOCH 0: training on 59890 raw words (32668 effective words) took 0.0s, 1316453 effective words/s\n", "2025-03-27 14:18:51,662 : INFO : EPOCH 1: training on 59890 raw words (32652 effective words) took 0.0s, 1591555 effective words/s\n", "2025-03-27 14:18:51,685 : INFO : EPOCH 2: training on 59890 raw words (32568 effective words) took 0.0s, 1532284 effective words/s\n", "2025-03-27 14:18:51,709 : INFO : EPOCH 3: training on 59890 raw words (32654 effective words) took 0.0s, 1410812 effective words/s\n", "2025-03-27 14:18:51,730 : INFO : EPOCH 4: training on 59890 raw words (32562 effective words) took 0.0s, 1637718 effective words/s\n", "2025-03-27 14:18:51,730 : INFO : Word2Vec lifecycle event {'msg': 'training on 299450 raw words (163104 effective words) took 0.1s, 1403749 effective words/s', 'datetime': '2025-03-27T14:18:51.730871', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:18:51,731 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:18:51.731153', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:18:51,731 : INFO : collecting all words and their counts\n", "2025-03-27 14:18:51,732 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:18:51,740 : INFO : collected 10781 word types from a corpus of 59890 raw words and 300 sentences\n", "2025-03-27 14:18:51,741 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:18:51,745 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 1762 unique words (16.34% of original 10781, drops 9019)', 'datetime': '2025-03-27T14:18:51.745826', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:51,746 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 46084 word corpus (76.95% of original 59890, drops 13806)', 'datetime': '2025-03-27T14:18:51.746259', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:51,750 : INFO : deleting the raw counts dictionary of 10781 items\n", "2025-03-27 14:18:51,750 : INFO : sample=0.001 downsamples 45 most-common words\n", "2025-03-27 14:18:51,750 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 32610.61883565215 word corpus (70.8%% of prior 46084)', 'datetime': '2025-03-27T14:18:51.750835', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:51,757 : INFO : estimated required memory for 1762 words and 100 dimensions: 2290600 bytes\n", "2025-03-27 14:18:51,757 : INFO : resetting layer weights\n", "2025-03-27 14:18:51,758 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:18:51.758705', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:18:51,758 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 1762 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:18:51.758926', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:18:51,782 : INFO : EPOCH 0: training on 59890 raw words (32517 effective words) took 0.0s, 1420191 effective words/s\n", "2025-03-27 14:18:51,806 : INFO : EPOCH 1: training on 59890 raw words (32567 effective words) took 0.0s, 1432107 effective words/s\n", "2025-03-27 14:18:51,828 : INFO : EPOCH 2: training on 59890 raw words (32610 effective words) took 0.0s, 1753289 effective words/s\n", "2025-03-27 14:18:51,852 : INFO : EPOCH 3: training on 59890 raw words (32579 effective words) took 0.0s, 1473119 effective words/s\n", "2025-03-27 14:18:51,874 : INFO : EPOCH 4: training on 59890 raw words (32632 effective words) took 0.0s, 1533059 effective words/s\n", "2025-03-27 14:18:51,874 : INFO : Word2Vec lifecycle event {'msg': 'training on 299450 raw words (162905 effective words) took 0.1s, 1406975 effective words/s', 'datetime': '2025-03-27T14:18:51.874917', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:18:51,875 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:18:51.875212', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:18:51,875 : INFO : collecting all words and their counts\n", "2025-03-27 14:18:51,876 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:18:51,885 : INFO : collected 10781 word types from a corpus of 59890 raw words and 300 sentences\n", "2025-03-27 14:18:51,885 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:18:51,889 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 1762 unique words (16.34% of original 10781, drops 9019)', 'datetime': '2025-03-27T14:18:51.889966', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2025-03-27 14:18:51,890 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 46084 word corpus (76.95% of original 59890, drops 13806)', 'datetime': '2025-03-27T14:18:51.890416', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:51,894 : INFO : deleting the raw counts dictionary of 10781 items\n", "2025-03-27 14:18:51,894 : INFO : sample=0.001 downsamples 45 most-common words\n", "2025-03-27 14:18:51,895 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 32610.61883565215 word corpus (70.8%% of prior 46084)', 'datetime': '2025-03-27T14:18:51.895118', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:51,901 : INFO : estimated required memory for 1762 words and 100 dimensions: 2290600 bytes\n", "2025-03-27 14:18:51,901 : INFO : resetting layer weights\n", "2025-03-27 14:18:51,902 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:18:51.902722', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:18:51,902 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 1762 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:18:51.902937', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:18:51,925 : INFO : EPOCH 0: training on 59890 raw words (32543 effective words) took 0.0s, 1478911 effective words/s\n", "2025-03-27 14:18:51,946 : INFO : EPOCH 1: training on 59890 raw words (32552 effective words) took 0.0s, 1656404 effective words/s\n", "2025-03-27 14:18:51,968 : INFO : EPOCH 2: training on 59890 raw words (32603 effective words) took 0.0s, 1589053 effective words/s\n", "2025-03-27 14:18:51,991 : INFO : EPOCH 3: training on 59890 raw words (32598 effective words) took 0.0s, 1447401 effective words/s\n", "2025-03-27 14:18:52,013 : INFO : EPOCH 4: training on 59890 raw words (32581 effective words) took 0.0s, 1583229 effective words/s\n", "2025-03-27 14:18:52,014 : INFO : Word2Vec lifecycle event {'msg': 'training on 299450 raw words (162877 effective words) took 0.1s, 1468784 effective words/s', 'datetime': '2025-03-27T14:18:52.014013', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:18:52,014 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:18:52.014241', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:18:52,014 : INFO : collecting all words and their counts\n", "2025-03-27 14:18:52,015 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:18:52,024 : INFO : collected 10781 word types from a corpus of 59890 raw words and 300 sentences\n", "2025-03-27 14:18:52,024 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:18:52,028 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 1762 unique words (16.34% of original 10781, drops 9019)', 'datetime': '2025-03-27T14:18:52.028376', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:52,028 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 46084 word corpus (76.95% of original 59890, drops 13806)', 'datetime': '2025-03-27T14:18:52.028687', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:52,032 : INFO : deleting the raw counts dictionary of 10781 items\n", "2025-03-27 14:18:52,032 : INFO : sample=0.001 downsamples 45 most-common words\n", "2025-03-27 14:18:52,033 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 32610.61883565215 word corpus (70.8%% of prior 46084)', 'datetime': '2025-03-27T14:18:52.032997', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:52,038 : INFO : estimated required memory for 1762 words and 100 dimensions: 2290600 bytes\n", "2025-03-27 14:18:52,039 : INFO : resetting layer weights\n", "2025-03-27 14:18:52,040 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:18:52.040430', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:18:52,040 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 1762 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:18:52.040733', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:18:52,062 : INFO : EPOCH 0: training on 59890 raw words (32517 effective words) took 0.0s, 1917026 effective words/s\n", "2025-03-27 14:18:52,082 : INFO : EPOCH 1: training on 59890 raw words (32567 effective words) took 0.0s, 1651781 effective words/s\n", "2025-03-27 14:18:52,104 : INFO : EPOCH 2: training on 59890 raw words (32654 effective words) took 0.0s, 1589118 effective words/s\n", "2025-03-27 14:18:52,126 : INFO : EPOCH 3: training on 59890 raw words (32647 effective words) took 0.0s, 1528835 effective words/s\n", "2025-03-27 14:18:52,148 : INFO : EPOCH 4: training on 59890 raw words (32528 effective words) took 0.0s, 1532328 effective words/s\n", "2025-03-27 14:18:52,148 : INFO : Word2Vec lifecycle event {'msg': 'training on 299450 raw words (162913 effective words) took 0.1s, 1510800 effective words/s', 'datetime': '2025-03-27T14:18:52.148813', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:18:52,149 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:18:52.149018', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:18:52,149 : INFO : collecting all words and their counts\n", "2025-03-27 14:18:52,149 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:18:52,158 : INFO : collected 10781 word types from a corpus of 59890 raw words and 300 sentences\n", "2025-03-27 14:18:52,158 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:18:52,163 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 1762 unique words (16.34% of original 10781, drops 9019)', 'datetime': '2025-03-27T14:18:52.163143', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:52,163 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 46084 word corpus (76.95% of original 59890, drops 13806)', 'datetime': '2025-03-27T14:18:52.163573', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:52,167 : INFO : deleting the raw counts dictionary of 10781 items\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2025-03-27 14:18:52,167 : INFO : sample=0.001 downsamples 45 most-common words\n", "2025-03-27 14:18:52,168 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 32610.61883565215 word corpus (70.8%% of prior 46084)', 'datetime': '2025-03-27T14:18:52.168036', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:52,173 : INFO : estimated required memory for 1762 words and 100 dimensions: 2290600 bytes\n", "2025-03-27 14:18:52,174 : INFO : resetting layer weights\n", "2025-03-27 14:18:52,175 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:18:52.175021', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:18:52,175 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 1762 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:18:52.175210', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:18:52,195 : INFO : EPOCH 0: training on 59890 raw words (32543 effective words) took 0.0s, 1665825 effective words/s\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Word2vec model #0: {'train_data': '25kB', 'compute_loss': True, 'sg': 0, 'hs': 0, 'train_time_mean': 0.14458235104878744, 'train_time_std': 0.004779222957497731}\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2025-03-27 14:18:52,216 : INFO : EPOCH 1: training on 59890 raw words (32692 effective words) took 0.0s, 1654171 effective words/s\n", "2025-03-27 14:18:52,236 : INFO : EPOCH 2: training on 59890 raw words (32568 effective words) took 0.0s, 1656253 effective words/s\n", "2025-03-27 14:18:52,257 : INFO : EPOCH 3: training on 59890 raw words (32601 effective words) took 0.0s, 1662924 effective words/s\n", "2025-03-27 14:18:52,278 : INFO : EPOCH 4: training on 59890 raw words (32650 effective words) took 0.0s, 1573516 effective words/s\n", "2025-03-27 14:18:52,278 : INFO : Word2Vec lifecycle event {'msg': 'training on 299450 raw words (163054 effective words) took 0.1s, 1575040 effective words/s', 'datetime': '2025-03-27T14:18:52.278919', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:18:52,279 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:18:52.279157', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:18:52,279 : INFO : collecting all words and their counts\n", "2025-03-27 14:18:52,279 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:18:52,288 : INFO : collected 10781 word types from a corpus of 59890 raw words and 300 sentences\n", "2025-03-27 14:18:52,288 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:18:52,293 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 1762 unique words (16.34% of original 10781, drops 9019)', 'datetime': '2025-03-27T14:18:52.293132', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:52,293 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 46084 word corpus (76.95% of original 59890, drops 13806)', 'datetime': '2025-03-27T14:18:52.293604', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:52,297 : INFO : deleting the raw counts dictionary of 10781 items\n", "2025-03-27 14:18:52,298 : INFO : sample=0.001 downsamples 45 most-common words\n", "2025-03-27 14:18:52,298 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 32610.61883565215 word corpus (70.8%% of prior 46084)', 'datetime': '2025-03-27T14:18:52.298438', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:52,304 : INFO : estimated required memory for 1762 words and 100 dimensions: 2290600 bytes\n", "2025-03-27 14:18:52,304 : INFO : resetting layer weights\n", "2025-03-27 14:18:52,305 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:18:52.305559', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:18:52,305 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 1762 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:18:52.305752', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:18:52,326 : INFO : EPOCH 0: training on 59890 raw words (32543 effective words) took 0.0s, 1637926 effective words/s\n", "2025-03-27 14:18:52,345 : INFO : EPOCH 1: training on 59890 raw words (32552 effective words) took 0.0s, 1788007 effective words/s\n", "2025-03-27 14:18:52,365 : INFO : EPOCH 2: training on 59890 raw words (32603 effective words) took 0.0s, 1649198 effective words/s\n", "2025-03-27 14:18:52,387 : INFO : EPOCH 3: training on 59890 raw words (32587 effective words) took 0.0s, 1589041 effective words/s\n", "2025-03-27 14:18:52,409 : INFO : EPOCH 4: training on 59890 raw words (32592 effective words) took 0.0s, 1545692 effective words/s\n", "2025-03-27 14:18:52,409 : INFO : Word2Vec lifecycle event {'msg': 'training on 299450 raw words (162877 effective words) took 0.1s, 1572361 effective words/s', 'datetime': '2025-03-27T14:18:52.409519', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:18:52,409 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:18:52.409763', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:18:52,410 : INFO : collecting all words and their counts\n", "2025-03-27 14:18:52,410 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:18:52,419 : INFO : collected 10781 word types from a corpus of 59890 raw words and 300 sentences\n", "2025-03-27 14:18:52,419 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:18:52,423 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 1762 unique words (16.34% of original 10781, drops 9019)', 'datetime': '2025-03-27T14:18:52.423991', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:52,424 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 46084 word corpus (76.95% of original 59890, drops 13806)', 'datetime': '2025-03-27T14:18:52.424304', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:52,428 : INFO : deleting the raw counts dictionary of 10781 items\n", "2025-03-27 14:18:52,428 : INFO : sample=0.001 downsamples 45 most-common words\n", "2025-03-27 14:18:52,428 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 32610.61883565215 word corpus (70.8%% of prior 46084)', 'datetime': '2025-03-27T14:18:52.428605', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:52,429 : INFO : constructing a huffman tree from 1762 words\n", "2025-03-27 14:18:52,450 : INFO : built huffman tree with maximum node depth 13\n", "2025-03-27 14:18:52,456 : INFO : estimated required memory for 1762 words and 100 dimensions: 3347800 bytes\n", "2025-03-27 14:18:52,456 : INFO : resetting layer weights\n", "2025-03-27 14:18:52,457 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:18:52.457429', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:18:52,457 : WARNING : Both hierarchical softmax and negative sampling are activated. This is probably a mistake. You should set either 'hs=0' or 'negative=0' to disable one of them. \n", "2025-03-27 14:18:52,457 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 1762 vocabulary and 100 features, using sg=0 hs=1 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:18:52.457822', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:18:52,494 : INFO : EPOCH 0: training on 59890 raw words (32543 effective words) took 0.0s, 900347 effective words/s\n", "2025-03-27 14:18:52,531 : INFO : EPOCH 1: training on 59890 raw words (32552 effective words) took 0.0s, 995122 effective words/s\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2025-03-27 14:18:52,569 : INFO : EPOCH 2: training on 59890 raw words (32630 effective words) took 0.0s, 870937 effective words/s\n", "2025-03-27 14:18:52,603 : INFO : EPOCH 3: training on 59890 raw words (32560 effective words) took 0.0s, 971933 effective words/s\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Word2vec model #1: {'train_data': '25kB', 'compute_loss': False, 'sg': 0, 'hs': 0, 'train_time_mean': 0.131805419921875, 'train_time_std': 0.002030210634488772}\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2025-03-27 14:18:52,640 : INFO : EPOCH 4: training on 59890 raw words (32594 effective words) took 0.0s, 913596 effective words/s\n", "2025-03-27 14:18:52,640 : INFO : Word2Vec lifecycle event {'msg': 'training on 299450 raw words (162879 effective words) took 0.2s, 891681 effective words/s', 'datetime': '2025-03-27T14:18:52.640669', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:18:52,640 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:18:52.640935', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:18:52,641 : INFO : collecting all words and their counts\n", "2025-03-27 14:18:52,641 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:18:52,650 : INFO : collected 10781 word types from a corpus of 59890 raw words and 300 sentences\n", "2025-03-27 14:18:52,650 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:18:52,654 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 1762 unique words (16.34% of original 10781, drops 9019)', 'datetime': '2025-03-27T14:18:52.654733', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:52,655 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 46084 word corpus (76.95% of original 59890, drops 13806)', 'datetime': '2025-03-27T14:18:52.655048', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:52,658 : INFO : deleting the raw counts dictionary of 10781 items\n", "2025-03-27 14:18:52,659 : INFO : sample=0.001 downsamples 45 most-common words\n", "2025-03-27 14:18:52,659 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 32610.61883565215 word corpus (70.8%% of prior 46084)', 'datetime': '2025-03-27T14:18:52.659733', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:52,660 : INFO : constructing a huffman tree from 1762 words\n", "2025-03-27 14:18:52,735 : INFO : built huffman tree with maximum node depth 13\n", "2025-03-27 14:18:52,741 : INFO : estimated required memory for 1762 words and 100 dimensions: 3347800 bytes\n", "2025-03-27 14:18:52,741 : INFO : resetting layer weights\n", "2025-03-27 14:18:52,742 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:18:52.742647', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:18:52,742 : WARNING : Both hierarchical softmax and negative sampling are activated. This is probably a mistake. You should set either 'hs=0' or 'negative=0' to disable one of them. \n", "2025-03-27 14:18:52,743 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 1762 vocabulary and 100 features, using sg=0 hs=1 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:18:52.743075', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:18:52,782 : INFO : EPOCH 0: training on 59890 raw words (32517 effective words) took 0.0s, 852306 effective words/s\n", "2025-03-27 14:18:52,821 : INFO : EPOCH 1: training on 59890 raw words (32567 effective words) took 0.0s, 853252 effective words/s\n", "2025-03-27 14:18:52,856 : INFO : EPOCH 2: training on 59890 raw words (32654 effective words) took 0.0s, 943540 effective words/s\n", "2025-03-27 14:18:52,897 : INFO : EPOCH 3: training on 59890 raw words (32527 effective words) took 0.0s, 813490 effective words/s\n", "2025-03-27 14:18:52,937 : INFO : EPOCH 4: training on 59890 raw words (32643 effective words) took 0.0s, 873387 effective words/s\n", "2025-03-27 14:18:52,937 : INFO : Word2Vec lifecycle event {'msg': 'training on 299450 raw words (162908 effective words) took 0.2s, 837459 effective words/s', 'datetime': '2025-03-27T14:18:52.937817', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:18:52,938 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:18:52.938142', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:18:52,938 : INFO : collecting all words and their counts\n", "2025-03-27 14:18:52,939 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:18:52,947 : INFO : collected 10781 word types from a corpus of 59890 raw words and 300 sentences\n", "2025-03-27 14:18:52,948 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:18:52,952 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 1762 unique words (16.34% of original 10781, drops 9019)', 'datetime': '2025-03-27T14:18:52.952422', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:52,952 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 46084 word corpus (76.95% of original 59890, drops 13806)', 'datetime': '2025-03-27T14:18:52.952822', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:52,956 : INFO : deleting the raw counts dictionary of 10781 items\n", "2025-03-27 14:18:52,957 : INFO : sample=0.001 downsamples 45 most-common words\n", "2025-03-27 14:18:52,957 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 32610.61883565215 word corpus (70.8%% of prior 46084)', 'datetime': '2025-03-27T14:18:52.957368', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:52,958 : INFO : constructing a huffman tree from 1762 words\n", "2025-03-27 14:18:52,980 : INFO : built huffman tree with maximum node depth 13\n", "2025-03-27 14:18:52,986 : INFO : estimated required memory for 1762 words and 100 dimensions: 3347800 bytes\n", "2025-03-27 14:18:52,986 : INFO : resetting layer weights\n", "2025-03-27 14:18:52,987 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:18:52.987659', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:18:52,987 : WARNING : Both hierarchical softmax and negative sampling are activated. This is probably a mistake. You should set either 'hs=0' or 'negative=0' to disable one of them. \n", "2025-03-27 14:18:52,988 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 1762 vocabulary and 100 features, using sg=0 hs=1 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:18:52.988106', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:18:53,030 : INFO : EPOCH 0: training on 59890 raw words (32543 effective words) took 0.0s, 790667 effective words/s\n", "2025-03-27 14:18:53,069 : INFO : EPOCH 1: training on 59890 raw words (32692 effective words) took 0.0s, 851394 effective words/s\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2025-03-27 14:18:53,107 : INFO : EPOCH 2: training on 59890 raw words (32568 effective words) took 0.0s, 892553 effective words/s\n", "2025-03-27 14:18:53,145 : INFO : EPOCH 3: training on 59890 raw words (32745 effective words) took 0.0s, 868695 effective words/s\n", "2025-03-27 14:18:53,182 : INFO : EPOCH 4: training on 59890 raw words (32469 effective words) took 0.0s, 902046 effective words/s\n", "2025-03-27 14:18:53,183 : INFO : Word2Vec lifecycle event {'msg': 'training on 299450 raw words (163017 effective words) took 0.2s, 837489 effective words/s', 'datetime': '2025-03-27T14:18:53.182989', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:18:53,183 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:18:53.183281', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:18:53,184 : INFO : collecting all words and their counts\n", "2025-03-27 14:18:53,184 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:18:53,193 : INFO : collected 10781 word types from a corpus of 59890 raw words and 300 sentences\n", "2025-03-27 14:18:53,193 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:18:53,198 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 1762 unique words (16.34% of original 10781, drops 9019)', 'datetime': '2025-03-27T14:18:53.198678', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:53,199 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 46084 word corpus (76.95% of original 59890, drops 13806)', 'datetime': '2025-03-27T14:18:53.199080', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:53,203 : INFO : deleting the raw counts dictionary of 10781 items\n", "2025-03-27 14:18:53,203 : INFO : sample=0.001 downsamples 45 most-common words\n", "2025-03-27 14:18:53,204 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 32610.61883565215 word corpus (70.8%% of prior 46084)', 'datetime': '2025-03-27T14:18:53.203998', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:53,204 : INFO : constructing a huffman tree from 1762 words\n", "2025-03-27 14:18:53,226 : INFO : built huffman tree with maximum node depth 13\n", "2025-03-27 14:18:53,231 : INFO : estimated required memory for 1762 words and 100 dimensions: 3347800 bytes\n", "2025-03-27 14:18:53,232 : INFO : resetting layer weights\n", "2025-03-27 14:18:53,233 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:18:53.233346', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:18:53,233 : WARNING : Both hierarchical softmax and negative sampling are activated. This is probably a mistake. You should set either 'hs=0' or 'negative=0' to disable one of them. \n", "2025-03-27 14:18:53,233 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 1762 vocabulary and 100 features, using sg=0 hs=1 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:18:53.233771', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:18:53,273 : INFO : EPOCH 0: training on 59890 raw words (32668 effective words) took 0.0s, 845196 effective words/s\n", "2025-03-27 14:18:53,311 : INFO : EPOCH 1: training on 59890 raw words (32652 effective words) took 0.0s, 889269 effective words/s\n", "2025-03-27 14:18:53,347 : INFO : EPOCH 2: training on 59890 raw words (32568 effective words) took 0.0s, 913981 effective words/s\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Word2vec model #2: {'train_data': '25kB', 'compute_loss': True, 'sg': 0, 'hs': 1, 'train_time_mean': 0.25787798563639325, 'train_time_std': 0.02856204570085011}\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2025-03-27 14:18:53,385 : INFO : EPOCH 3: training on 59890 raw words (32585 effective words) took 0.0s, 891907 effective words/s\n", "2025-03-27 14:18:53,424 : INFO : EPOCH 4: training on 59890 raw words (32720 effective words) took 0.0s, 852773 effective words/s\n", "2025-03-27 14:18:53,424 : INFO : Word2Vec lifecycle event {'msg': 'training on 299450 raw words (163193 effective words) took 0.2s, 855372 effective words/s', 'datetime': '2025-03-27T14:18:53.424805', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:18:53,425 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:18:53.425072', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:18:53,425 : INFO : collecting all words and their counts\n", "2025-03-27 14:18:53,426 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:18:53,434 : INFO : collected 10781 word types from a corpus of 59890 raw words and 300 sentences\n", "2025-03-27 14:18:53,435 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:18:53,439 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 1762 unique words (16.34% of original 10781, drops 9019)', 'datetime': '2025-03-27T14:18:53.439390', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:53,439 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 46084 word corpus (76.95% of original 59890, drops 13806)', 'datetime': '2025-03-27T14:18:53.439738', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:53,443 : INFO : deleting the raw counts dictionary of 10781 items\n", "2025-03-27 14:18:53,444 : INFO : sample=0.001 downsamples 45 most-common words\n", "2025-03-27 14:18:53,444 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 32610.61883565215 word corpus (70.8%% of prior 46084)', 'datetime': '2025-03-27T14:18:53.444336', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:53,444 : INFO : constructing a huffman tree from 1762 words\n", "2025-03-27 14:18:53,465 : INFO : built huffman tree with maximum node depth 13\n", "2025-03-27 14:18:53,471 : INFO : estimated required memory for 1762 words and 100 dimensions: 3347800 bytes\n", "2025-03-27 14:18:53,471 : INFO : resetting layer weights\n", "2025-03-27 14:18:53,472 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:18:53.472810', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:18:53,473 : WARNING : Both hierarchical softmax and negative sampling are activated. This is probably a mistake. You should set either 'hs=0' or 'negative=0' to disable one of them. \n", "2025-03-27 14:18:53,473 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 1762 vocabulary and 100 features, using sg=0 hs=1 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:18:53.473253', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:18:53,510 : INFO : EPOCH 0: training on 59890 raw words (32517 effective words) took 0.0s, 1034826 effective words/s\n", "2025-03-27 14:18:53,547 : INFO : EPOCH 1: training on 59890 raw words (32617 effective words) took 0.0s, 892260 effective words/s\n", "2025-03-27 14:18:53,586 : INFO : EPOCH 2: training on 59890 raw words (32644 effective words) took 0.0s, 861061 effective words/s\n", "2025-03-27 14:18:53,623 : INFO : EPOCH 3: training on 59890 raw words (32565 effective words) took 0.0s, 908195 effective words/s\n", "2025-03-27 14:18:53,663 : INFO : EPOCH 4: training on 59890 raw words (32582 effective words) took 0.0s, 842844 effective words/s\n", "2025-03-27 14:18:53,663 : INFO : Word2Vec lifecycle event {'msg': 'training on 299450 raw words (162925 effective words) took 0.2s, 856415 effective words/s', 'datetime': '2025-03-27T14:18:53.663719', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:18:53,664 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:18:53.664055', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:18:53,664 : INFO : collecting all words and their counts\n", "2025-03-27 14:18:53,665 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:18:53,674 : INFO : collected 10781 word types from a corpus of 59890 raw words and 300 sentences\n", "2025-03-27 14:18:53,674 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:18:53,678 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 1762 unique words (16.34% of original 10781, drops 9019)', 'datetime': '2025-03-27T14:18:53.678634', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:53,679 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 46084 word corpus (76.95% of original 59890, drops 13806)', 'datetime': '2025-03-27T14:18:53.679058', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:53,683 : INFO : deleting the raw counts dictionary of 10781 items\n", "2025-03-27 14:18:53,684 : INFO : sample=0.001 downsamples 45 most-common words\n", "2025-03-27 14:18:53,684 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 32610.61883565215 word corpus (70.8%% of prior 46084)', 'datetime': '2025-03-27T14:18:53.684493', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:53,685 : INFO : constructing a huffman tree from 1762 words\n", "2025-03-27 14:18:53,706 : INFO : built huffman tree with maximum node depth 13\n", "2025-03-27 14:18:53,712 : INFO : estimated required memory for 1762 words and 100 dimensions: 3347800 bytes\n", "2025-03-27 14:18:53,713 : INFO : resetting layer weights\n", "2025-03-27 14:18:53,715 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:18:53.715007', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:18:53,715 : WARNING : Both hierarchical softmax and negative sampling are activated. This is probably a mistake. You should set either 'hs=0' or 'negative=0' to disable one of them. \n", "2025-03-27 14:18:53,715 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 1762 vocabulary and 100 features, using sg=0 hs=1 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:18:53.715595', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:18:53,757 : INFO : EPOCH 0: training on 59890 raw words (32543 effective words) took 0.0s, 800675 effective words/s\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2025-03-27 14:18:53,799 : INFO : EPOCH 1: training on 59890 raw words (32692 effective words) took 0.0s, 857711 effective words/s\n", "2025-03-27 14:18:53,836 : INFO : EPOCH 2: training on 59890 raw words (32568 effective words) took 0.0s, 910614 effective words/s\n", "2025-03-27 14:18:53,874 : INFO : EPOCH 3: training on 59890 raw words (32745 effective words) took 0.0s, 877257 effective words/s\n", "2025-03-27 14:18:53,912 : INFO : EPOCH 4: training on 59890 raw words (32469 effective words) took 0.0s, 894268 effective words/s\n", "2025-03-27 14:18:53,912 : INFO : Word2Vec lifecycle event {'msg': 'training on 299450 raw words (163017 effective words) took 0.2s, 829090 effective words/s', 'datetime': '2025-03-27T14:18:53.912462', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:18:53,912 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:18:53.912768', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:18:53,913 : INFO : collecting all words and their counts\n", "2025-03-27 14:18:53,913 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:18:53,922 : INFO : collected 10781 word types from a corpus of 59890 raw words and 300 sentences\n", "2025-03-27 14:18:53,923 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:18:53,927 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 1762 unique words (16.34% of original 10781, drops 9019)', 'datetime': '2025-03-27T14:18:53.927393', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:53,927 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 46084 word corpus (76.95% of original 59890, drops 13806)', 'datetime': '2025-03-27T14:18:53.927857', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:53,931 : INFO : deleting the raw counts dictionary of 10781 items\n", "2025-03-27 14:18:53,932 : INFO : sample=0.001 downsamples 45 most-common words\n", "2025-03-27 14:18:53,932 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 32610.61883565215 word corpus (70.8%% of prior 46084)', 'datetime': '2025-03-27T14:18:53.932671', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:53,938 : INFO : estimated required memory for 1762 words and 100 dimensions: 2290600 bytes\n", "2025-03-27 14:18:53,938 : INFO : resetting layer weights\n", "2025-03-27 14:18:53,939 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:18:53.939746', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:18:53,940 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 1762 vocabulary and 100 features, using sg=1 hs=0 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:18:53.940005', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:18:54,011 : INFO : EPOCH 0: training on 59890 raw words (32543 effective words) took 0.1s, 461777 effective words/s\n", "2025-03-27 14:18:54,084 : INFO : EPOCH 1: training on 59890 raw words (32692 effective words) took 0.1s, 453880 effective words/s\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Word2vec model #3: {'train_data': '25kB', 'compute_loss': False, 'sg': 0, 'hs': 1, 'train_time_mean': 0.24313569068908691, 'train_time_std': 0.004160162300211008}\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2025-03-27 14:18:54,155 : INFO : EPOCH 2: training on 59890 raw words (32568 effective words) took 0.1s, 463317 effective words/s\n", "2025-03-27 14:18:54,224 : INFO : EPOCH 3: training on 59890 raw words (32601 effective words) took 0.1s, 481066 effective words/s\n", "2025-03-27 14:18:54,294 : INFO : EPOCH 4: training on 59890 raw words (32650 effective words) took 0.1s, 476250 effective words/s\n", "2025-03-27 14:18:54,295 : INFO : Word2Vec lifecycle event {'msg': 'training on 299450 raw words (163054 effective words) took 0.4s, 459665 effective words/s', 'datetime': '2025-03-27T14:18:54.295090', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:18:54,295 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:18:54.295386', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:18:54,295 : INFO : collecting all words and their counts\n", "2025-03-27 14:18:54,296 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:18:54,305 : INFO : collected 10781 word types from a corpus of 59890 raw words and 300 sentences\n", "2025-03-27 14:18:54,305 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:18:54,309 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 1762 unique words (16.34% of original 10781, drops 9019)', 'datetime': '2025-03-27T14:18:54.309954', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:54,310 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 46084 word corpus (76.95% of original 59890, drops 13806)', 'datetime': '2025-03-27T14:18:54.310263', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:54,314 : INFO : deleting the raw counts dictionary of 10781 items\n", "2025-03-27 14:18:54,314 : INFO : sample=0.001 downsamples 45 most-common words\n", "2025-03-27 14:18:54,314 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 32610.61883565215 word corpus (70.8%% of prior 46084)', 'datetime': '2025-03-27T14:18:54.314929', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:54,320 : INFO : estimated required memory for 1762 words and 100 dimensions: 2290600 bytes\n", "2025-03-27 14:18:54,321 : INFO : resetting layer weights\n", "2025-03-27 14:18:54,321 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:18:54.321970', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:18:54,322 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 1762 vocabulary and 100 features, using sg=1 hs=0 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:18:54.322227', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:18:54,399 : INFO : EPOCH 0: training on 59890 raw words (32596 effective words) took 0.1s, 428191 effective words/s\n", "2025-03-27 14:18:54,467 : INFO : EPOCH 1: training on 59890 raw words (32570 effective words) took 0.1s, 491364 effective words/s\n", "2025-03-27 14:18:54,531 : INFO : EPOCH 2: training on 59890 raw words (32503 effective words) took 0.1s, 509023 effective words/s\n", "2025-03-27 14:18:54,593 : INFO : EPOCH 3: training on 59890 raw words (32593 effective words) took 0.1s, 535172 effective words/s\n", "2025-03-27 14:18:54,655 : INFO : EPOCH 4: training on 59890 raw words (32640 effective words) took 0.1s, 532929 effective words/s\n", "2025-03-27 14:18:54,656 : INFO : Word2Vec lifecycle event {'msg': 'training on 299450 raw words (162902 effective words) took 0.3s, 487995 effective words/s', 'datetime': '2025-03-27T14:18:54.656403', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:18:54,656 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:18:54.656699', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:18:54,657 : INFO : collecting all words and their counts\n", "2025-03-27 14:18:54,657 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:18:54,666 : INFO : collected 10781 word types from a corpus of 59890 raw words and 300 sentences\n", "2025-03-27 14:18:54,666 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:18:54,670 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 1762 unique words (16.34% of original 10781, drops 9019)', 'datetime': '2025-03-27T14:18:54.670832', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:54,671 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 46084 word corpus (76.95% of original 59890, drops 13806)', 'datetime': '2025-03-27T14:18:54.671227', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:54,675 : INFO : deleting the raw counts dictionary of 10781 items\n", "2025-03-27 14:18:54,676 : INFO : sample=0.001 downsamples 45 most-common words\n", "2025-03-27 14:18:54,676 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 32610.61883565215 word corpus (70.8%% of prior 46084)', 'datetime': '2025-03-27T14:18:54.676566', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:54,682 : INFO : estimated required memory for 1762 words and 100 dimensions: 2290600 bytes\n", "2025-03-27 14:18:54,682 : INFO : resetting layer weights\n", "2025-03-27 14:18:54,683 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:18:54.683469', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:18:54,683 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 1762 vocabulary and 100 features, using sg=1 hs=0 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:18:54.683891', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:18:54,753 : INFO : EPOCH 0: training on 59890 raw words (32517 effective words) took 0.1s, 474485 effective words/s\n", "2025-03-27 14:18:54,818 : INFO : EPOCH 1: training on 59890 raw words (32567 effective words) took 0.1s, 511538 effective words/s\n", "2025-03-27 14:18:54,885 : INFO : EPOCH 2: training on 59890 raw words (32610 effective words) took 0.1s, 491425 effective words/s\n", "2025-03-27 14:18:54,948 : INFO : EPOCH 3: training on 59890 raw words (32521 effective words) took 0.1s, 527326 effective words/s\n", "2025-03-27 14:18:55,013 : INFO : EPOCH 4: training on 59890 raw words (32622 effective words) took 0.1s, 514643 effective words/s\n", "2025-03-27 14:18:55,013 : INFO : Word2Vec lifecycle event {'msg': 'training on 299450 raw words (162837 effective words) took 0.3s, 494667 effective words/s', 'datetime': '2025-03-27T14:18:55.013575', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2025-03-27 14:18:55,013 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:18:55.013856', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:18:55,014 : INFO : collecting all words and their counts\n", "2025-03-27 14:18:55,015 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:18:55,023 : INFO : collected 10781 word types from a corpus of 59890 raw words and 300 sentences\n", "2025-03-27 14:18:55,024 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:18:55,028 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 1762 unique words (16.34% of original 10781, drops 9019)', 'datetime': '2025-03-27T14:18:55.028517', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:55,028 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 46084 word corpus (76.95% of original 59890, drops 13806)', 'datetime': '2025-03-27T14:18:55.028966', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:55,033 : INFO : deleting the raw counts dictionary of 10781 items\n", "2025-03-27 14:18:55,033 : INFO : sample=0.001 downsamples 45 most-common words\n", "2025-03-27 14:18:55,033 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 32610.61883565215 word corpus (70.8%% of prior 46084)', 'datetime': '2025-03-27T14:18:55.033786', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:55,040 : INFO : estimated required memory for 1762 words and 100 dimensions: 2290600 bytes\n", "2025-03-27 14:18:55,040 : INFO : resetting layer weights\n", "2025-03-27 14:18:55,041 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:18:55.041981', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:18:55,042 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 1762 vocabulary and 100 features, using sg=1 hs=0 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:18:55.042224', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:18:55,108 : INFO : EPOCH 0: training on 59890 raw words (32596 effective words) took 0.1s, 503421 effective words/s\n", "2025-03-27 14:18:55,172 : INFO : EPOCH 1: training on 59890 raw words (32570 effective words) took 0.1s, 518834 effective words/s\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Word2vec model #4: {'train_data': '25kB', 'compute_loss': True, 'sg': 1, 'hs': 0, 'train_time_mean': 0.3669389883677165, 'train_time_std': 0.011131627256245082}\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2025-03-27 14:18:55,241 : INFO : EPOCH 2: training on 59890 raw words (32549 effective words) took 0.1s, 482330 effective words/s\n", "2025-03-27 14:18:55,303 : INFO : EPOCH 3: training on 59890 raw words (32574 effective words) took 0.1s, 529224 effective words/s\n", "2025-03-27 14:18:55,367 : INFO : EPOCH 4: training on 59890 raw words (32594 effective words) took 0.1s, 521298 effective words/s\n", "2025-03-27 14:18:55,367 : INFO : Word2Vec lifecycle event {'msg': 'training on 299450 raw words (162883 effective words) took 0.3s, 500972 effective words/s', 'datetime': '2025-03-27T14:18:55.367731', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:18:55,368 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:18:55.368013', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:18:55,368 : INFO : collecting all words and their counts\n", "2025-03-27 14:18:55,368 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:18:55,377 : INFO : collected 10781 word types from a corpus of 59890 raw words and 300 sentences\n", "2025-03-27 14:18:55,377 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:18:55,382 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 1762 unique words (16.34% of original 10781, drops 9019)', 'datetime': '2025-03-27T14:18:55.382375', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:55,382 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 46084 word corpus (76.95% of original 59890, drops 13806)', 'datetime': '2025-03-27T14:18:55.382774', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:55,386 : INFO : deleting the raw counts dictionary of 10781 items\n", "2025-03-27 14:18:55,387 : INFO : sample=0.001 downsamples 45 most-common words\n", "2025-03-27 14:18:55,387 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 32610.61883565215 word corpus (70.8%% of prior 46084)', 'datetime': '2025-03-27T14:18:55.387607', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:55,393 : INFO : estimated required memory for 1762 words and 100 dimensions: 2290600 bytes\n", "2025-03-27 14:18:55,393 : INFO : resetting layer weights\n", "2025-03-27 14:18:55,394 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:18:55.394887', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:18:55,395 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 1762 vocabulary and 100 features, using sg=1 hs=0 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:18:55.395139', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:18:55,460 : INFO : EPOCH 0: training on 59890 raw words (32543 effective words) took 0.1s, 505604 effective words/s\n", "2025-03-27 14:18:55,522 : INFO : EPOCH 1: training on 59890 raw words (32692 effective words) took 0.1s, 538641 effective words/s\n", "2025-03-27 14:18:55,585 : INFO : EPOCH 2: training on 59890 raw words (32568 effective words) took 0.1s, 521698 effective words/s\n", "2025-03-27 14:18:55,648 : INFO : EPOCH 3: training on 59890 raw words (32637 effective words) took 0.1s, 523902 effective words/s\n", "2025-03-27 14:18:55,714 : INFO : EPOCH 4: training on 59890 raw words (32606 effective words) took 0.1s, 505209 effective words/s\n", "2025-03-27 14:18:55,714 : INFO : Word2Vec lifecycle event {'msg': 'training on 299450 raw words (163046 effective words) took 0.3s, 510619 effective words/s', 'datetime': '2025-03-27T14:18:55.714725', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:18:55,715 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:18:55.715046', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:18:55,715 : INFO : collecting all words and their counts\n", "2025-03-27 14:18:55,715 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:18:55,724 : INFO : collected 10781 word types from a corpus of 59890 raw words and 300 sentences\n", "2025-03-27 14:18:55,725 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:18:55,729 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 1762 unique words (16.34% of original 10781, drops 9019)', 'datetime': '2025-03-27T14:18:55.729398', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:55,729 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 46084 word corpus (76.95% of original 59890, drops 13806)', 'datetime': '2025-03-27T14:18:55.729804', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:55,733 : INFO : deleting the raw counts dictionary of 10781 items\n", "2025-03-27 14:18:55,734 : INFO : sample=0.001 downsamples 45 most-common words\n", "2025-03-27 14:18:55,734 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 32610.61883565215 word corpus (70.8%% of prior 46084)', 'datetime': '2025-03-27T14:18:55.734606', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:55,740 : INFO : estimated required memory for 1762 words and 100 dimensions: 2290600 bytes\n", "2025-03-27 14:18:55,740 : INFO : resetting layer weights\n", "2025-03-27 14:18:55,741 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:18:55.741813', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:18:55,742 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 1762 vocabulary and 100 features, using sg=1 hs=0 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:18:55.742140', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:18:55,807 : INFO : EPOCH 0: training on 59890 raw words (32596 effective words) took 0.1s, 503025 effective words/s\n", "2025-03-27 14:18:55,871 : INFO : EPOCH 1: training on 59890 raw words (32650 effective words) took 0.1s, 523388 effective words/s\n", "2025-03-27 14:18:55,934 : INFO : EPOCH 2: training on 59890 raw words (32617 effective words) took 0.1s, 526487 effective words/s\n", "2025-03-27 14:18:55,996 : INFO : EPOCH 3: training on 59890 raw words (32571 effective words) took 0.1s, 532672 effective words/s\n", "2025-03-27 14:18:56,059 : INFO : EPOCH 4: training on 59890 raw words (32548 effective words) took 0.1s, 523320 effective words/s\n", "2025-03-27 14:18:56,060 : INFO : Word2Vec lifecycle event {'msg': 'training on 299450 raw words (162982 effective words) took 0.3s, 512957 effective words/s', 'datetime': '2025-03-27T14:18:56.060180', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2025-03-27 14:18:56,060 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:18:56.060455', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:18:56,061 : INFO : collecting all words and their counts\n", "2025-03-27 14:18:56,061 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:18:56,070 : INFO : collected 10781 word types from a corpus of 59890 raw words and 300 sentences\n", "2025-03-27 14:18:56,070 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:18:56,075 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 1762 unique words (16.34% of original 10781, drops 9019)', 'datetime': '2025-03-27T14:18:56.075244', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:56,075 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 46084 word corpus (76.95% of original 59890, drops 13806)', 'datetime': '2025-03-27T14:18:56.075720', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:56,079 : INFO : deleting the raw counts dictionary of 10781 items\n", "2025-03-27 14:18:56,080 : INFO : sample=0.001 downsamples 45 most-common words\n", "2025-03-27 14:18:56,080 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 32610.61883565215 word corpus (70.8%% of prior 46084)', 'datetime': '2025-03-27T14:18:56.080727', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:56,081 : INFO : constructing a huffman tree from 1762 words\n", "2025-03-27 14:18:56,102 : INFO : built huffman tree with maximum node depth 13\n", "2025-03-27 14:18:56,107 : INFO : estimated required memory for 1762 words and 100 dimensions: 3347800 bytes\n", "2025-03-27 14:18:56,108 : INFO : resetting layer weights\n", "2025-03-27 14:18:56,109 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:18:56.109593', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:18:56,109 : WARNING : Both hierarchical softmax and negative sampling are activated. This is probably a mistake. You should set either 'hs=0' or 'negative=0' to disable one of them. \n", "2025-03-27 14:18:56,110 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 1762 vocabulary and 100 features, using sg=1 hs=1 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:18:56.110046', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:18:56,237 : INFO : EPOCH 0: training on 59890 raw words (32543 effective words) took 0.1s, 256919 effective words/s\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Word2vec model #5: {'train_data': '25kB', 'compute_loss': False, 'sg': 1, 'hs': 0, 'train_time_mean': 0.3488934834798177, 'train_time_std': 0.0037324655418704265}\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2025-03-27 14:18:56,381 : INFO : EPOCH 1: training on 59890 raw words (32552 effective words) took 0.1s, 227808 effective words/s\n", "2025-03-27 14:18:56,513 : INFO : EPOCH 2: training on 59890 raw words (32517 effective words) took 0.1s, 248850 effective words/s\n", "2025-03-27 14:18:56,642 : INFO : EPOCH 3: training on 59890 raw words (32769 effective words) took 0.1s, 255979 effective words/s\n", "2025-03-27 14:18:56,775 : INFO : EPOCH 4: training on 59890 raw words (32465 effective words) took 0.1s, 246746 effective words/s\n", "2025-03-27 14:18:56,775 : INFO : Word2Vec lifecycle event {'msg': 'training on 299450 raw words (162846 effective words) took 0.7s, 244642 effective words/s', 'datetime': '2025-03-27T14:18:56.775931', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:18:56,776 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:18:56.776274', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:18:56,776 : INFO : collecting all words and their counts\n", "2025-03-27 14:18:56,777 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:18:56,785 : INFO : collected 10781 word types from a corpus of 59890 raw words and 300 sentences\n", "2025-03-27 14:18:56,786 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:18:56,790 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 1762 unique words (16.34% of original 10781, drops 9019)', 'datetime': '2025-03-27T14:18:56.790655', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:56,791 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 46084 word corpus (76.95% of original 59890, drops 13806)', 'datetime': '2025-03-27T14:18:56.791200', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:56,795 : INFO : deleting the raw counts dictionary of 10781 items\n", "2025-03-27 14:18:56,795 : INFO : sample=0.001 downsamples 45 most-common words\n", "2025-03-27 14:18:56,795 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 32610.61883565215 word corpus (70.8%% of prior 46084)', 'datetime': '2025-03-27T14:18:56.795865', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:56,796 : INFO : constructing a huffman tree from 1762 words\n", "2025-03-27 14:18:56,817 : INFO : built huffman tree with maximum node depth 13\n", "2025-03-27 14:18:56,823 : INFO : estimated required memory for 1762 words and 100 dimensions: 3347800 bytes\n", "2025-03-27 14:18:56,823 : INFO : resetting layer weights\n", "2025-03-27 14:18:56,824 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:18:56.824623', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:18:56,824 : WARNING : Both hierarchical softmax and negative sampling are activated. This is probably a mistake. You should set either 'hs=0' or 'negative=0' to disable one of them. \n", "2025-03-27 14:18:56,825 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 1762 vocabulary and 100 features, using sg=1 hs=1 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:18:56.825101', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:18:56,951 : INFO : EPOCH 0: training on 59890 raw words (32543 effective words) took 0.1s, 258942 effective words/s\n", "2025-03-27 14:18:57,084 : INFO : EPOCH 1: training on 59890 raw words (32552 effective words) took 0.1s, 247796 effective words/s\n", "2025-03-27 14:18:57,215 : INFO : EPOCH 2: training on 59890 raw words (32630 effective words) took 0.1s, 250316 effective words/s\n", "2025-03-27 14:18:57,343 : INFO : EPOCH 3: training on 59890 raw words (32560 effective words) took 0.1s, 256571 effective words/s\n", "2025-03-27 14:18:57,474 : INFO : EPOCH 4: training on 59890 raw words (32583 effective words) took 0.1s, 251418 effective words/s\n", "2025-03-27 14:18:57,474 : INFO : Word2Vec lifecycle event {'msg': 'training on 299450 raw words (162868 effective words) took 0.6s, 250855 effective words/s', 'datetime': '2025-03-27T14:18:57.474621', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:18:57,474 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:18:57.474972', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:18:57,475 : INFO : collecting all words and their counts\n", "2025-03-27 14:18:57,476 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:18:57,484 : INFO : collected 10781 word types from a corpus of 59890 raw words and 300 sentences\n", "2025-03-27 14:18:57,484 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:18:57,489 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 1762 unique words (16.34% of original 10781, drops 9019)', 'datetime': '2025-03-27T14:18:57.489178', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:57,489 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 46084 word corpus (76.95% of original 59890, drops 13806)', 'datetime': '2025-03-27T14:18:57.489545', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:57,493 : INFO : deleting the raw counts dictionary of 10781 items\n", "2025-03-27 14:18:57,493 : INFO : sample=0.001 downsamples 45 most-common words\n", "2025-03-27 14:18:57,494 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 32610.61883565215 word corpus (70.8%% of prior 46084)', 'datetime': '2025-03-27T14:18:57.494512', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:57,495 : INFO : constructing a huffman tree from 1762 words\n", "2025-03-27 14:18:57,516 : INFO : built huffman tree with maximum node depth 13\n", "2025-03-27 14:18:57,522 : INFO : estimated required memory for 1762 words and 100 dimensions: 3347800 bytes\n", "2025-03-27 14:18:57,522 : INFO : resetting layer weights\n", "2025-03-27 14:18:57,524 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:18:57.524596', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:18:57,524 : WARNING : Both hierarchical softmax and negative sampling are activated. This is probably a mistake. You should set either 'hs=0' or 'negative=0' to disable one of them. \n", "2025-03-27 14:18:57,525 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 1762 vocabulary and 100 features, using sg=1 hs=1 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:18:57.525137', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2025-03-27 14:18:57,654 : INFO : EPOCH 0: training on 59890 raw words (32517 effective words) took 0.1s, 253960 effective words/s\n", "2025-03-27 14:18:57,784 : INFO : EPOCH 1: training on 59890 raw words (32567 effective words) took 0.1s, 252599 effective words/s\n", "2025-03-27 14:18:57,914 : INFO : EPOCH 2: training on 59890 raw words (32654 effective words) took 0.1s, 251962 effective words/s\n", "2025-03-27 14:18:58,043 : INFO : EPOCH 3: training on 59890 raw words (32527 effective words) took 0.1s, 254044 effective words/s\n", "2025-03-27 14:18:58,171 : INFO : EPOCH 4: training on 59890 raw words (32640 effective words) took 0.1s, 256364 effective words/s\n", "2025-03-27 14:18:58,172 : INFO : Word2Vec lifecycle event {'msg': 'training on 299450 raw words (162905 effective words) took 0.6s, 251806 effective words/s', 'datetime': '2025-03-27T14:18:58.172309', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:18:58,172 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:18:58.172581', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:18:58,173 : INFO : collecting all words and their counts\n", "2025-03-27 14:18:58,173 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:18:58,182 : INFO : collected 10781 word types from a corpus of 59890 raw words and 300 sentences\n", "2025-03-27 14:18:58,182 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:18:58,186 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 1762 unique words (16.34% of original 10781, drops 9019)', 'datetime': '2025-03-27T14:18:58.186777', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:58,187 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 46084 word corpus (76.95% of original 59890, drops 13806)', 'datetime': '2025-03-27T14:18:58.187133', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:58,190 : INFO : deleting the raw counts dictionary of 10781 items\n", "2025-03-27 14:18:58,191 : INFO : sample=0.001 downsamples 45 most-common words\n", "2025-03-27 14:18:58,191 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 32610.61883565215 word corpus (70.8%% of prior 46084)', 'datetime': '2025-03-27T14:18:58.191600', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:58,192 : INFO : constructing a huffman tree from 1762 words\n", "2025-03-27 14:18:58,212 : INFO : built huffman tree with maximum node depth 13\n", "2025-03-27 14:18:58,218 : INFO : estimated required memory for 1762 words and 100 dimensions: 3347800 bytes\n", "2025-03-27 14:18:58,219 : INFO : resetting layer weights\n", "2025-03-27 14:18:58,220 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:18:58.220697', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:18:58,221 : WARNING : Both hierarchical softmax and negative sampling are activated. This is probably a mistake. You should set either 'hs=0' or 'negative=0' to disable one of them. \n", "2025-03-27 14:18:58,221 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 1762 vocabulary and 100 features, using sg=1 hs=1 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:18:58.221645', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:18:58,348 : INFO : EPOCH 0: training on 59890 raw words (32543 effective words) took 0.1s, 259381 effective words/s\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Word2vec model #6: {'train_data': '25kB', 'compute_loss': True, 'sg': 1, 'hs': 1, 'train_time_mean': 0.7040122350056967, 'train_time_std': 0.00820130813052232}\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2025-03-27 14:18:58,483 : INFO : EPOCH 1: training on 59890 raw words (32692 effective words) took 0.1s, 244142 effective words/s\n", "2025-03-27 14:18:58,614 : INFO : EPOCH 2: training on 59890 raw words (32559 effective words) took 0.1s, 250152 effective words/s\n", "2025-03-27 14:18:58,740 : INFO : EPOCH 3: training on 59890 raw words (32637 effective words) took 0.1s, 260157 effective words/s\n", "2025-03-27 14:18:58,870 : INFO : EPOCH 4: training on 59890 raw words (32567 effective words) took 0.1s, 252201 effective words/s\n", "2025-03-27 14:18:58,871 : INFO : Word2Vec lifecycle event {'msg': 'training on 299450 raw words (162998 effective words) took 0.6s, 251178 effective words/s', 'datetime': '2025-03-27T14:18:58.871082', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:18:58,871 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:18:58.871448', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:18:58,872 : INFO : collecting all words and their counts\n", "2025-03-27 14:18:58,872 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:18:58,881 : INFO : collected 10781 word types from a corpus of 59890 raw words and 300 sentences\n", "2025-03-27 14:18:58,881 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:18:58,885 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 1762 unique words (16.34% of original 10781, drops 9019)', 'datetime': '2025-03-27T14:18:58.885848', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:58,886 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 46084 word corpus (76.95% of original 59890, drops 13806)', 'datetime': '2025-03-27T14:18:58.886244', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:58,890 : INFO : deleting the raw counts dictionary of 10781 items\n", "2025-03-27 14:18:58,890 : INFO : sample=0.001 downsamples 45 most-common words\n", "2025-03-27 14:18:58,890 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 32610.61883565215 word corpus (70.8%% of prior 46084)', 'datetime': '2025-03-27T14:18:58.890974', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:58,891 : INFO : constructing a huffman tree from 1762 words\n", "2025-03-27 14:18:58,911 : INFO : built huffman tree with maximum node depth 13\n", "2025-03-27 14:18:58,917 : INFO : estimated required memory for 1762 words and 100 dimensions: 3347800 bytes\n", "2025-03-27 14:18:58,918 : INFO : resetting layer weights\n", "2025-03-27 14:18:58,919 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:18:58.919192', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:18:58,919 : WARNING : Both hierarchical softmax and negative sampling are activated. This is probably a mistake. You should set either 'hs=0' or 'negative=0' to disable one of them. \n", "2025-03-27 14:18:58,919 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 1762 vocabulary and 100 features, using sg=1 hs=1 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:18:58.919622', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:18:59,047 : INFO : EPOCH 0: training on 59890 raw words (32543 effective words) took 0.1s, 261332 effective words/s\n", "2025-03-27 14:18:59,179 : INFO : EPOCH 1: training on 59890 raw words (32552 effective words) took 0.1s, 249947 effective words/s\n", "2025-03-27 14:18:59,307 : INFO : EPOCH 2: training on 59890 raw words (32603 effective words) took 0.1s, 256994 effective words/s\n", "2025-03-27 14:18:59,435 : INFO : EPOCH 3: training on 59890 raw words (32661 effective words) took 0.1s, 266036 effective words/s\n", "2025-03-27 14:18:59,568 : INFO : EPOCH 4: training on 59890 raw words (32544 effective words) took 0.1s, 247797 effective words/s\n", "2025-03-27 14:18:59,569 : INFO : Word2Vec lifecycle event {'msg': 'training on 299450 raw words (162903 effective words) took 0.6s, 250945 effective words/s', 'datetime': '2025-03-27T14:18:59.568984', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:18:59,569 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:18:59.569275', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:18:59,570 : INFO : collecting all words and their counts\n", "2025-03-27 14:18:59,570 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:18:59,579 : INFO : collected 10781 word types from a corpus of 59890 raw words and 300 sentences\n", "2025-03-27 14:18:59,579 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:18:59,583 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 1762 unique words (16.34% of original 10781, drops 9019)', 'datetime': '2025-03-27T14:18:59.583925', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:59,584 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 46084 word corpus (76.95% of original 59890, drops 13806)', 'datetime': '2025-03-27T14:18:59.584356', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:59,588 : INFO : deleting the raw counts dictionary of 10781 items\n", "2025-03-27 14:18:59,588 : INFO : sample=0.001 downsamples 45 most-common words\n", "2025-03-27 14:18:59,588 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 32610.61883565215 word corpus (70.8%% of prior 46084)', 'datetime': '2025-03-27T14:18:59.588755', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:18:59,589 : INFO : constructing a huffman tree from 1762 words\n", "2025-03-27 14:18:59,620 : INFO : built huffman tree with maximum node depth 13\n", "2025-03-27 14:18:59,626 : INFO : estimated required memory for 1762 words and 100 dimensions: 3347800 bytes\n", "2025-03-27 14:18:59,627 : INFO : resetting layer weights\n", "2025-03-27 14:18:59,628 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:18:59.628545', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:18:59,628 : WARNING : Both hierarchical softmax and negative sampling are activated. This is probably a mistake. You should set either 'hs=0' or 'negative=0' to disable one of them. \n", "2025-03-27 14:18:59,629 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 1762 vocabulary and 100 features, using sg=1 hs=1 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:18:59.629050', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2025-03-27 14:18:59,767 : INFO : EPOCH 0: training on 59890 raw words (32648 effective words) took 0.1s, 237265 effective words/s\n", "2025-03-27 14:18:59,897 : INFO : EPOCH 1: training on 59890 raw words (32591 effective words) took 0.1s, 258327 effective words/s\n", "2025-03-27 14:19:00,040 : INFO : EPOCH 2: training on 59890 raw words (32623 effective words) took 0.1s, 229898 effective words/s\n", "2025-03-27 14:19:00,169 : INFO : EPOCH 3: training on 59890 raw words (32622 effective words) took 0.1s, 254299 effective words/s\n", "2025-03-27 14:19:00,296 : INFO : EPOCH 4: training on 59890 raw words (32707 effective words) took 0.1s, 260828 effective words/s\n", "2025-03-27 14:19:00,296 : INFO : Word2Vec lifecycle event {'msg': 'training on 299450 raw words (163191 effective words) took 0.7s, 244578 effective words/s', 'datetime': '2025-03-27T14:19:00.296512', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:19:00,296 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:19:00.296800', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:19:00,297 : INFO : collecting all words and their counts\n", "2025-03-27 14:19:00,304 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:19:00,320 : INFO : collected 17251 word types from a corpus of 175599 raw words and 18 sentences\n", "2025-03-27 14:19:00,320 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:19:00,329 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 4125 unique words (23.91% of original 17251, drops 13126)', 'datetime': '2025-03-27T14:19:00.329847', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:00,330 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 154201 word corpus (87.81% of original 175599, drops 21398)', 'datetime': '2025-03-27T14:19:00.330408', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:00,339 : INFO : deleting the raw counts dictionary of 17251 items\n", "2025-03-27 14:19:00,340 : INFO : sample=0.001 downsamples 40 most-common words\n", "2025-03-27 14:19:00,340 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 110199.4281334271 word corpus (71.5%% of prior 154201)', 'datetime': '2025-03-27T14:19:00.340499', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:00,353 : INFO : estimated required memory for 4125 words and 100 dimensions: 5362500 bytes\n", "2025-03-27 14:19:00,353 : INFO : resetting layer weights\n", "2025-03-27 14:19:00,356 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:19:00.356071', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:19:00,356 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 4125 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:19:00.356414', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:19:00,417 : INFO : EPOCH 0: training on 175599 raw words (109994 effective words) took 0.1s, 2101477 effective words/s\n", "2025-03-27 14:19:00,483 : INFO : EPOCH 1: training on 175599 raw words (110178 effective words) took 0.1s, 1852705 effective words/s\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Word2vec model #7: {'train_data': '25kB', 'compute_loss': False, 'sg': 1, 'hs': 1, 'train_time_mean': 0.7080506483713785, 'train_time_std': 0.01365019201133514}\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2025-03-27 14:19:00,550 : INFO : EPOCH 2: training on 175599 raw words (110145 effective words) took 0.1s, 1855558 effective words/s\n", "2025-03-27 14:19:00,612 : INFO : EPOCH 3: training on 175599 raw words (110095 effective words) took 0.1s, 2034363 effective words/s\n", "2025-03-27 14:19:00,676 : INFO : EPOCH 4: training on 175599 raw words (110334 effective words) took 0.1s, 1927943 effective words/s\n", "2025-03-27 14:19:00,677 : INFO : Word2Vec lifecycle event {'msg': 'training on 877995 raw words (550746 effective words) took 0.3s, 1716942 effective words/s', 'datetime': '2025-03-27T14:19:00.677439', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:19:00,677 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:19:00.677720', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:19:00,678 : INFO : collecting all words and their counts\n", "2025-03-27 14:19:00,684 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:19:00,700 : INFO : collected 17251 word types from a corpus of 175599 raw words and 18 sentences\n", "2025-03-27 14:19:00,700 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:19:00,709 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 4125 unique words (23.91% of original 17251, drops 13126)', 'datetime': '2025-03-27T14:19:00.709099', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:00,709 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 154201 word corpus (87.81% of original 175599, drops 21398)', 'datetime': '2025-03-27T14:19:00.709526', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:00,718 : INFO : deleting the raw counts dictionary of 17251 items\n", "2025-03-27 14:19:00,718 : INFO : sample=0.001 downsamples 40 most-common words\n", "2025-03-27 14:19:00,718 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 110199.4281334271 word corpus (71.5%% of prior 154201)', 'datetime': '2025-03-27T14:19:00.718921', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:00,732 : INFO : estimated required memory for 4125 words and 100 dimensions: 5362500 bytes\n", "2025-03-27 14:19:00,732 : INFO : resetting layer weights\n", "2025-03-27 14:19:00,734 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:19:00.734448', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:19:00,734 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 4125 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:19:00.734656', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:19:00,795 : INFO : EPOCH 0: training on 175599 raw words (109994 effective words) took 0.1s, 2083994 effective words/s\n", "2025-03-27 14:19:00,858 : INFO : EPOCH 1: training on 175599 raw words (110105 effective words) took 0.1s, 1943798 effective words/s\n", "2025-03-27 14:19:00,920 : INFO : EPOCH 2: training on 175599 raw words (110001 effective words) took 0.1s, 1985407 effective words/s\n", "2025-03-27 14:19:00,981 : INFO : EPOCH 3: training on 175599 raw words (110063 effective words) took 0.1s, 2057278 effective words/s\n", "2025-03-27 14:19:01,044 : INFO : EPOCH 4: training on 175599 raw words (110309 effective words) took 0.1s, 1950752 effective words/s\n", "2025-03-27 14:19:01,045 : INFO : Word2Vec lifecycle event {'msg': 'training on 877995 raw words (550472 effective words) took 0.3s, 1775259 effective words/s', 'datetime': '2025-03-27T14:19:01.044993', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:19:01,045 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:19:01.045325', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:19:01,045 : INFO : collecting all words and their counts\n", "2025-03-27 14:19:01,051 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:19:01,068 : INFO : collected 17251 word types from a corpus of 175599 raw words and 18 sentences\n", "2025-03-27 14:19:01,069 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:19:01,078 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 4125 unique words (23.91% of original 17251, drops 13126)', 'datetime': '2025-03-27T14:19:01.078487', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:01,078 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 154201 word corpus (87.81% of original 175599, drops 21398)', 'datetime': '2025-03-27T14:19:01.078922', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:01,088 : INFO : deleting the raw counts dictionary of 17251 items\n", "2025-03-27 14:19:01,089 : INFO : sample=0.001 downsamples 40 most-common words\n", "2025-03-27 14:19:01,089 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 110199.4281334271 word corpus (71.5%% of prior 154201)', 'datetime': '2025-03-27T14:19:01.089788', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:01,103 : INFO : estimated required memory for 4125 words and 100 dimensions: 5362500 bytes\n", "2025-03-27 14:19:01,104 : INFO : resetting layer weights\n", "2025-03-27 14:19:01,106 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:19:01.106982', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:19:01,107 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 4125 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:19:01.107438', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:19:01,166 : INFO : EPOCH 0: training on 175599 raw words (109994 effective words) took 0.1s, 2134222 effective words/s\n", "2025-03-27 14:19:01,226 : INFO : EPOCH 1: training on 175599 raw words (110178 effective words) took 0.1s, 2082488 effective words/s\n", "2025-03-27 14:19:01,291 : INFO : EPOCH 2: training on 175599 raw words (110145 effective words) took 0.1s, 1889708 effective words/s\n", "2025-03-27 14:19:01,356 : INFO : EPOCH 3: training on 175599 raw words (110095 effective words) took 0.1s, 1916653 effective words/s\n", "2025-03-27 14:19:01,422 : INFO : EPOCH 4: training on 175599 raw words (110334 effective words) took 0.1s, 1884164 effective words/s\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2025-03-27 14:19:01,423 : INFO : Word2Vec lifecycle event {'msg': 'training on 877995 raw words (550746 effective words) took 0.3s, 1747075 effective words/s', 'datetime': '2025-03-27T14:19:01.423061', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:19:01,423 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:19:01.423399', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:19:01,424 : INFO : collecting all words and their counts\n", "2025-03-27 14:19:01,430 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:19:01,447 : INFO : collected 17251 word types from a corpus of 175599 raw words and 18 sentences\n", "2025-03-27 14:19:01,448 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:19:01,456 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 4125 unique words (23.91% of original 17251, drops 13126)', 'datetime': '2025-03-27T14:19:01.456865', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:01,457 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 154201 word corpus (87.81% of original 175599, drops 21398)', 'datetime': '2025-03-27T14:19:01.457355', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:01,466 : INFO : deleting the raw counts dictionary of 17251 items\n", "2025-03-27 14:19:01,466 : INFO : sample=0.001 downsamples 40 most-common words\n", "2025-03-27 14:19:01,467 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 110199.4281334271 word corpus (71.5%% of prior 154201)', 'datetime': '2025-03-27T14:19:01.467360', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:01,480 : INFO : estimated required memory for 4125 words and 100 dimensions: 5362500 bytes\n", "2025-03-27 14:19:01,481 : INFO : resetting layer weights\n", "2025-03-27 14:19:01,483 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:19:01.483991', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:19:01,484 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 4125 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:19:01.484306', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:19:01,551 : INFO : EPOCH 0: training on 175599 raw words (110284 effective words) took 0.1s, 1859671 effective words/s\n", "2025-03-27 14:19:01,615 : INFO : EPOCH 1: training on 175599 raw words (110214 effective words) took 0.1s, 1945305 effective words/s\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Word2vec model #8: {'train_data': '1MB', 'compute_loss': True, 'sg': 0, 'hs': 0, 'train_time_mean': 0.37556807200113934, 'train_time_std': 0.005800017401025353}\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2025-03-27 14:19:01,679 : INFO : EPOCH 2: training on 175599 raw words (110137 effective words) took 0.1s, 2000347 effective words/s\n", "2025-03-27 14:19:01,740 : INFO : EPOCH 3: training on 175599 raw words (110323 effective words) took 0.1s, 2022381 effective words/s\n", "2025-03-27 14:19:01,808 : INFO : EPOCH 4: training on 175599 raw words (110174 effective words) took 0.1s, 1803673 effective words/s\n", "2025-03-27 14:19:01,809 : INFO : Word2Vec lifecycle event {'msg': 'training on 877995 raw words (551132 effective words) took 0.3s, 1696299 effective words/s', 'datetime': '2025-03-27T14:19:01.809462', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:19:01,809 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:19:01.809798', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:19:01,810 : INFO : collecting all words and their counts\n", "2025-03-27 14:19:01,816 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:19:01,833 : INFO : collected 17251 word types from a corpus of 175599 raw words and 18 sentences\n", "2025-03-27 14:19:01,833 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:19:01,841 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 4125 unique words (23.91% of original 17251, drops 13126)', 'datetime': '2025-03-27T14:19:01.841810', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:01,842 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 154201 word corpus (87.81% of original 175599, drops 21398)', 'datetime': '2025-03-27T14:19:01.842316', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:01,851 : INFO : deleting the raw counts dictionary of 17251 items\n", "2025-03-27 14:19:01,851 : INFO : sample=0.001 downsamples 40 most-common words\n", "2025-03-27 14:19:01,852 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 110199.4281334271 word corpus (71.5%% of prior 154201)', 'datetime': '2025-03-27T14:19:01.852166', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:01,865 : INFO : estimated required memory for 4125 words and 100 dimensions: 5362500 bytes\n", "2025-03-27 14:19:01,866 : INFO : resetting layer weights\n", "2025-03-27 14:19:01,868 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:19:01.868154', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:19:01,868 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 4125 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:19:01.868398', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:19:01,932 : INFO : EPOCH 0: training on 175599 raw words (110344 effective words) took 0.1s, 1945680 effective words/s\n", "2025-03-27 14:19:01,991 : INFO : EPOCH 1: training on 175599 raw words (110214 effective words) took 0.1s, 2100445 effective words/s\n", "2025-03-27 14:19:02,069 : INFO : EPOCH 2: training on 175599 raw words (110315 effective words) took 0.1s, 1542830 effective words/s\n", "2025-03-27 14:19:02,135 : INFO : EPOCH 3: training on 175599 raw words (110291 effective words) took 0.1s, 1910001 effective words/s\n", "2025-03-27 14:19:02,201 : INFO : EPOCH 4: training on 175599 raw words (110335 effective words) took 0.1s, 1878404 effective words/s\n", "2025-03-27 14:19:02,201 : INFO : Word2Vec lifecycle event {'msg': 'training on 877995 raw words (551499 effective words) took 0.3s, 1657378 effective words/s', 'datetime': '2025-03-27T14:19:02.201566', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:19:02,201 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:19:02.201874', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:19:02,202 : INFO : collecting all words and their counts\n", "2025-03-27 14:19:02,208 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:19:02,226 : INFO : collected 17251 word types from a corpus of 175599 raw words and 18 sentences\n", "2025-03-27 14:19:02,226 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:19:02,235 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 4125 unique words (23.91% of original 17251, drops 13126)', 'datetime': '2025-03-27T14:19:02.235730', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:02,236 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 154201 word corpus (87.81% of original 175599, drops 21398)', 'datetime': '2025-03-27T14:19:02.236460', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:02,246 : INFO : deleting the raw counts dictionary of 17251 items\n", "2025-03-27 14:19:02,246 : INFO : sample=0.001 downsamples 40 most-common words\n", "2025-03-27 14:19:02,247 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 110199.4281334271 word corpus (71.5%% of prior 154201)', 'datetime': '2025-03-27T14:19:02.247052', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:02,261 : INFO : estimated required memory for 4125 words and 100 dimensions: 5362500 bytes\n", "2025-03-27 14:19:02,261 : INFO : resetting layer weights\n", "2025-03-27 14:19:02,264 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:19:02.264156', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:19:02,264 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 4125 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:19:02.264409', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:19:02,329 : INFO : EPOCH 0: training on 175599 raw words (109994 effective words) took 0.1s, 1936984 effective words/s\n", "2025-03-27 14:19:02,391 : INFO : EPOCH 1: training on 175599 raw words (110178 effective words) took 0.1s, 1958318 effective words/s\n", "2025-03-27 14:19:02,452 : INFO : EPOCH 2: training on 175599 raw words (110145 effective words) took 0.1s, 2118558 effective words/s\n", "2025-03-27 14:19:02,512 : INFO : EPOCH 3: training on 175599 raw words (110284 effective words) took 0.1s, 2062050 effective words/s\n", "2025-03-27 14:19:02,572 : INFO : EPOCH 4: training on 175599 raw words (110256 effective words) took 0.1s, 2135420 effective words/s\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2025-03-27 14:19:02,573 : INFO : Word2Vec lifecycle event {'msg': 'training on 877995 raw words (550857 effective words) took 0.3s, 1784250 effective words/s', 'datetime': '2025-03-27T14:19:02.573503', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:19:02,573 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:19:02.573850', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:19:02,574 : INFO : collecting all words and their counts\n", "2025-03-27 14:19:02,580 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:19:02,597 : INFO : collected 17251 word types from a corpus of 175599 raw words and 18 sentences\n", "2025-03-27 14:19:02,597 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:19:02,606 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 4125 unique words (23.91% of original 17251, drops 13126)', 'datetime': '2025-03-27T14:19:02.606667', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:02,607 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 154201 word corpus (87.81% of original 175599, drops 21398)', 'datetime': '2025-03-27T14:19:02.607298', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:02,616 : INFO : deleting the raw counts dictionary of 17251 items\n", "2025-03-27 14:19:02,617 : INFO : sample=0.001 downsamples 40 most-common words\n", "2025-03-27 14:19:02,617 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 110199.4281334271 word corpus (71.5%% of prior 154201)', 'datetime': '2025-03-27T14:19:02.617463', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:02,618 : INFO : constructing a huffman tree from 4125 words\n", "2025-03-27 14:19:02,669 : INFO : built huffman tree with maximum node depth 15\n", "2025-03-27 14:19:02,682 : INFO : estimated required memory for 4125 words and 100 dimensions: 7837500 bytes\n", "2025-03-27 14:19:02,683 : INFO : resetting layer weights\n", "2025-03-27 14:19:02,685 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:19:02.685069', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:19:02,685 : WARNING : Both hierarchical softmax and negative sampling are activated. This is probably a mistake. You should set either 'hs=0' or 'negative=0' to disable one of them. \n", "2025-03-27 14:19:02,685 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 4125 vocabulary and 100 features, using sg=0 hs=1 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:19:02.685539', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Word2vec model #9: {'train_data': '1MB', 'compute_loss': False, 'sg': 0, 'hs': 0, 'train_time_mean': 0.3833769162495931, 'train_time_std': 0.00846858103415652}\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2025-03-27 14:19:02,800 : INFO : EPOCH 0: training on 175599 raw words (110284 effective words) took 0.1s, 1025704 effective words/s\n", "2025-03-27 14:19:02,919 : INFO : EPOCH 1: training on 175599 raw words (110008 effective words) took 0.1s, 979062 effective words/s\n", "2025-03-27 14:19:03,032 : INFO : EPOCH 2: training on 175599 raw words (110417 effective words) took 0.1s, 1048598 effective words/s\n", "2025-03-27 14:19:03,147 : INFO : EPOCH 3: training on 175599 raw words (110226 effective words) took 0.1s, 1017945 effective words/s\n", "2025-03-27 14:19:03,263 : INFO : EPOCH 4: training on 175599 raw words (110314 effective words) took 0.1s, 1015572 effective words/s\n", "2025-03-27 14:19:03,263 : INFO : Word2Vec lifecycle event {'msg': 'training on 877995 raw words (551249 effective words) took 0.6s, 953703 effective words/s', 'datetime': '2025-03-27T14:19:03.263766', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:19:03,264 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:19:03.264060', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:19:03,264 : INFO : collecting all words and their counts\n", "2025-03-27 14:19:03,270 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:19:03,289 : INFO : collected 17251 word types from a corpus of 175599 raw words and 18 sentences\n", "2025-03-27 14:19:03,289 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:19:03,298 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 4125 unique words (23.91% of original 17251, drops 13126)', 'datetime': '2025-03-27T14:19:03.298333', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:03,298 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 154201 word corpus (87.81% of original 175599, drops 21398)', 'datetime': '2025-03-27T14:19:03.298872', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:03,307 : INFO : deleting the raw counts dictionary of 17251 items\n", "2025-03-27 14:19:03,308 : INFO : sample=0.001 downsamples 40 most-common words\n", "2025-03-27 14:19:03,309 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 110199.4281334271 word corpus (71.5%% of prior 154201)', 'datetime': '2025-03-27T14:19:03.309155', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:03,310 : INFO : constructing a huffman tree from 4125 words\n", "2025-03-27 14:19:03,360 : INFO : built huffman tree with maximum node depth 15\n", "2025-03-27 14:19:03,373 : INFO : estimated required memory for 4125 words and 100 dimensions: 7837500 bytes\n", "2025-03-27 14:19:03,374 : INFO : resetting layer weights\n", "2025-03-27 14:19:03,377 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:19:03.377743', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:19:03,378 : WARNING : Both hierarchical softmax and negative sampling are activated. This is probably a mistake. You should set either 'hs=0' or 'negative=0' to disable one of them. \n", "2025-03-27 14:19:03,378 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 4125 vocabulary and 100 features, using sg=0 hs=1 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:19:03.378439', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:19:03,510 : INFO : EPOCH 0: training on 175599 raw words (110202 effective words) took 0.1s, 881600 effective words/s\n", "2025-03-27 14:19:03,634 : INFO : EPOCH 1: training on 175599 raw words (110115 effective words) took 0.1s, 965178 effective words/s\n", "2025-03-27 14:19:03,750 : INFO : EPOCH 2: training on 175599 raw words (110317 effective words) took 0.1s, 1007070 effective words/s\n", "2025-03-27 14:19:03,864 : INFO : EPOCH 3: training on 175599 raw words (110240 effective words) took 0.1s, 1028502 effective words/s\n", "2025-03-27 14:19:03,977 : INFO : EPOCH 4: training on 175599 raw words (110127 effective words) took 0.1s, 1033100 effective words/s\n", "2025-03-27 14:19:03,978 : INFO : Word2Vec lifecycle event {'msg': 'training on 877995 raw words (551001 effective words) took 0.6s, 918528 effective words/s', 'datetime': '2025-03-27T14:19:03.978566', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:19:03,978 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:19:03.978861', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:19:03,979 : INFO : collecting all words and their counts\n", "2025-03-27 14:19:03,985 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:19:04,002 : INFO : collected 17251 word types from a corpus of 175599 raw words and 18 sentences\n", "2025-03-27 14:19:04,003 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:19:04,011 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 4125 unique words (23.91% of original 17251, drops 13126)', 'datetime': '2025-03-27T14:19:04.011463', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:04,011 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 154201 word corpus (87.81% of original 175599, drops 21398)', 'datetime': '2025-03-27T14:19:04.011908', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:04,021 : INFO : deleting the raw counts dictionary of 17251 items\n", "2025-03-27 14:19:04,022 : INFO : sample=0.001 downsamples 40 most-common words\n", "2025-03-27 14:19:04,022 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 110199.4281334271 word corpus (71.5%% of prior 154201)', 'datetime': '2025-03-27T14:19:04.022813', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:04,023 : INFO : constructing a huffman tree from 4125 words\n", "2025-03-27 14:19:04,073 : INFO : built huffman tree with maximum node depth 15\n", "2025-03-27 14:19:04,086 : INFO : estimated required memory for 4125 words and 100 dimensions: 7837500 bytes\n", "2025-03-27 14:19:04,086 : INFO : resetting layer weights\n", "2025-03-27 14:19:04,088 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:19:04.088818', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:19:04,089 : WARNING : Both hierarchical softmax and negative sampling are activated. This is probably a mistake. You should set either 'hs=0' or 'negative=0' to disable one of them. \n", "2025-03-27 14:19:04,089 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 4125 vocabulary and 100 features, using sg=0 hs=1 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:19:04.089273', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2025-03-27 14:19:04,210 : INFO : EPOCH 0: training on 175599 raw words (110086 effective words) took 0.1s, 973066 effective words/s\n", "2025-03-27 14:19:04,322 : INFO : EPOCH 1: training on 175599 raw words (110235 effective words) took 0.1s, 1045975 effective words/s\n", "2025-03-27 14:19:04,435 : INFO : EPOCH 2: training on 175599 raw words (109859 effective words) took 0.1s, 1034610 effective words/s\n", "2025-03-27 14:19:04,548 : INFO : EPOCH 3: training on 175599 raw words (110239 effective words) took 0.1s, 1036759 effective words/s\n", "2025-03-27 14:19:04,668 : INFO : EPOCH 4: training on 175599 raw words (109999 effective words) took 0.1s, 977604 effective words/s\n", "2025-03-27 14:19:04,668 : INFO : Word2Vec lifecycle event {'msg': 'training on 877995 raw words (550418 effective words) took 0.6s, 949905 effective words/s', 'datetime': '2025-03-27T14:19:04.668977', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:19:04,669 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:19:04.669386', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:19:04,670 : INFO : collecting all words and their counts\n", "2025-03-27 14:19:04,676 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:19:04,692 : INFO : collected 17251 word types from a corpus of 175599 raw words and 18 sentences\n", "2025-03-27 14:19:04,693 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:19:04,702 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 4125 unique words (23.91% of original 17251, drops 13126)', 'datetime': '2025-03-27T14:19:04.702622', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:04,703 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 154201 word corpus (87.81% of original 175599, drops 21398)', 'datetime': '2025-03-27T14:19:04.703170', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:04,711 : INFO : deleting the raw counts dictionary of 17251 items\n", "2025-03-27 14:19:04,712 : INFO : sample=0.001 downsamples 40 most-common words\n", "2025-03-27 14:19:04,713 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 110199.4281334271 word corpus (71.5%% of prior 154201)', 'datetime': '2025-03-27T14:19:04.713021', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:04,714 : INFO : constructing a huffman tree from 4125 words\n", "2025-03-27 14:19:04,765 : INFO : built huffman tree with maximum node depth 15\n", "2025-03-27 14:19:04,779 : INFO : estimated required memory for 4125 words and 100 dimensions: 7837500 bytes\n", "2025-03-27 14:19:04,779 : INFO : resetting layer weights\n", "2025-03-27 14:19:04,781 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:19:04.781980', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:19:04,782 : WARNING : Both hierarchical softmax and negative sampling are activated. This is probably a mistake. You should set either 'hs=0' or 'negative=0' to disable one of them. \n", "2025-03-27 14:19:04,782 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 4125 vocabulary and 100 features, using sg=0 hs=1 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:19:04.782606', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Word2vec model #10: {'train_data': '1MB', 'compute_loss': True, 'sg': 0, 'hs': 1, 'train_time_mean': 0.6986416975657145, 'train_time_std': 0.011644824822215011}\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2025-03-27 14:19:04,915 : INFO : EPOCH 0: training on 175599 raw words (110284 effective words) took 0.1s, 889652 effective words/s\n", "2025-03-27 14:19:05,031 : INFO : EPOCH 1: training on 175599 raw words (110008 effective words) took 0.1s, 1012357 effective words/s\n", "2025-03-27 14:19:05,146 : INFO : EPOCH 2: training on 175599 raw words (110417 effective words) took 0.1s, 1024185 effective words/s\n", "2025-03-27 14:19:05,258 : INFO : EPOCH 3: training on 175599 raw words (110426 effective words) took 0.1s, 1046856 effective words/s\n", "2025-03-27 14:19:05,376 : INFO : EPOCH 4: training on 175599 raw words (110382 effective words) took 0.1s, 1010185 effective words/s\n", "2025-03-27 14:19:05,377 : INFO : Word2Vec lifecycle event {'msg': 'training on 877995 raw words (551517 effective words) took 0.6s, 928410 effective words/s', 'datetime': '2025-03-27T14:19:05.377353', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:19:05,377 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:19:05.377648', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:19:05,378 : INFO : collecting all words and their counts\n", "2025-03-27 14:19:05,384 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:19:05,402 : INFO : collected 17251 word types from a corpus of 175599 raw words and 18 sentences\n", "2025-03-27 14:19:05,402 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:19:05,411 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 4125 unique words (23.91% of original 17251, drops 13126)', 'datetime': '2025-03-27T14:19:05.411923', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:05,412 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 154201 word corpus (87.81% of original 175599, drops 21398)', 'datetime': '2025-03-27T14:19:05.412388', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:05,421 : INFO : deleting the raw counts dictionary of 17251 items\n", "2025-03-27 14:19:05,422 : INFO : sample=0.001 downsamples 40 most-common words\n", "2025-03-27 14:19:05,423 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 110199.4281334271 word corpus (71.5%% of prior 154201)', 'datetime': '2025-03-27T14:19:05.423181', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:05,424 : INFO : constructing a huffman tree from 4125 words\n", "2025-03-27 14:19:05,475 : INFO : built huffman tree with maximum node depth 15\n", "2025-03-27 14:19:05,488 : INFO : estimated required memory for 4125 words and 100 dimensions: 7837500 bytes\n", "2025-03-27 14:19:05,489 : INFO : resetting layer weights\n", "2025-03-27 14:19:05,492 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:19:05.492115', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:19:05,492 : WARNING : Both hierarchical softmax and negative sampling are activated. This is probably a mistake. You should set either 'hs=0' or 'negative=0' to disable one of them. \n", "2025-03-27 14:19:05,492 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 4125 vocabulary and 100 features, using sg=0 hs=1 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:19:05.492718', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:19:05,614 : INFO : EPOCH 0: training on 175599 raw words (109994 effective words) took 0.1s, 959719 effective words/s\n", "2025-03-27 14:19:05,728 : INFO : EPOCH 1: training on 175599 raw words (110178 effective words) took 0.1s, 1043077 effective words/s\n", "2025-03-27 14:19:05,850 : INFO : EPOCH 2: training on 175599 raw words (110145 effective words) took 0.1s, 954730 effective words/s\n", "2025-03-27 14:19:05,965 : INFO : EPOCH 3: training on 175599 raw words (110095 effective words) took 0.1s, 1023340 effective words/s\n", "2025-03-27 14:19:06,081 : INFO : EPOCH 4: training on 175599 raw words (110138 effective words) took 0.1s, 1032725 effective words/s\n", "2025-03-27 14:19:06,081 : INFO : Word2Vec lifecycle event {'msg': 'training on 877995 raw words (550550 effective words) took 0.6s, 934938 effective words/s', 'datetime': '2025-03-27T14:19:06.081811', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:19:06,082 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:19:06.082146', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:19:06,083 : INFO : collecting all words and their counts\n", "2025-03-27 14:19:06,089 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:19:06,105 : INFO : collected 17251 word types from a corpus of 175599 raw words and 18 sentences\n", "2025-03-27 14:19:06,106 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:19:06,114 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 4125 unique words (23.91% of original 17251, drops 13126)', 'datetime': '2025-03-27T14:19:06.114534', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:06,114 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 154201 word corpus (87.81% of original 175599, drops 21398)', 'datetime': '2025-03-27T14:19:06.114934', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:06,123 : INFO : deleting the raw counts dictionary of 17251 items\n", "2025-03-27 14:19:06,123 : INFO : sample=0.001 downsamples 40 most-common words\n", "2025-03-27 14:19:06,124 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 110199.4281334271 word corpus (71.5%% of prior 154201)', 'datetime': '2025-03-27T14:19:06.124273', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:06,125 : INFO : constructing a huffman tree from 4125 words\n", "2025-03-27 14:19:06,176 : INFO : built huffman tree with maximum node depth 15\n", "2025-03-27 14:19:06,190 : INFO : estimated required memory for 4125 words and 100 dimensions: 7837500 bytes\n", "2025-03-27 14:19:06,190 : INFO : resetting layer weights\n", "2025-03-27 14:19:06,192 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:19:06.192266', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:19:06,192 : WARNING : Both hierarchical softmax and negative sampling are activated. This is probably a mistake. You should set either 'hs=0' or 'negative=0' to disable one of them. \n", "2025-03-27 14:19:06,192 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 4125 vocabulary and 100 features, using sg=0 hs=1 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:19:06.192764', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2025-03-27 14:19:06,308 : INFO : EPOCH 0: training on 175599 raw words (109994 effective words) took 0.1s, 1016225 effective words/s\n", "2025-03-27 14:19:06,425 : INFO : EPOCH 1: training on 175599 raw words (110178 effective words) took 0.1s, 999736 effective words/s\n", "2025-03-27 14:19:06,539 : INFO : EPOCH 2: training on 175599 raw words (110145 effective words) took 0.1s, 1053245 effective words/s\n", "2025-03-27 14:19:06,662 : INFO : EPOCH 3: training on 175599 raw words (110095 effective words) took 0.1s, 958349 effective words/s\n", "2025-03-27 14:19:06,784 : INFO : EPOCH 4: training on 175599 raw words (110334 effective words) took 0.1s, 951374 effective words/s\n", "2025-03-27 14:19:06,785 : INFO : Word2Vec lifecycle event {'msg': 'training on 877995 raw words (550746 effective words) took 0.6s, 929721 effective words/s', 'datetime': '2025-03-27T14:19:06.785431', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:19:06,785 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:19:06.785762', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:19:06,787 : INFO : collecting all words and their counts\n", "2025-03-27 14:19:06,793 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:19:06,810 : INFO : collected 17251 word types from a corpus of 175599 raw words and 18 sentences\n", "2025-03-27 14:19:06,810 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:19:06,819 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 4125 unique words (23.91% of original 17251, drops 13126)', 'datetime': '2025-03-27T14:19:06.819499', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:06,819 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 154201 word corpus (87.81% of original 175599, drops 21398)', 'datetime': '2025-03-27T14:19:06.819966', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:06,828 : INFO : deleting the raw counts dictionary of 17251 items\n", "2025-03-27 14:19:06,829 : INFO : sample=0.001 downsamples 40 most-common words\n", "2025-03-27 14:19:06,829 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 110199.4281334271 word corpus (71.5%% of prior 154201)', 'datetime': '2025-03-27T14:19:06.829593', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:06,843 : INFO : estimated required memory for 4125 words and 100 dimensions: 5362500 bytes\n", "2025-03-27 14:19:06,843 : INFO : resetting layer weights\n", "2025-03-27 14:19:06,845 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:19:06.845464', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:19:06,845 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 4125 vocabulary and 100 features, using sg=1 hs=0 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:19:06.845695', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Word2vec model #11: {'train_data': '1MB', 'compute_loss': False, 'sg': 0, 'hs': 1, 'train_time_mean': 0.7054657141367594, 'train_time_std': 0.001835862447610594}\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2025-03-27 14:19:07,074 : INFO : EPOCH 0: training on 175599 raw words (110284 effective words) took 0.2s, 500077 effective words/s\n", "2025-03-27 14:19:07,283 : INFO : EPOCH 1: training on 175599 raw words (110008 effective words) took 0.2s, 546161 effective words/s\n", "2025-03-27 14:19:07,509 : INFO : EPOCH 2: training on 175599 raw words (110417 effective words) took 0.2s, 505302 effective words/s\n", "2025-03-27 14:19:07,721 : INFO : EPOCH 3: training on 175599 raw words (110001 effective words) took 0.2s, 536428 effective words/s\n", "2025-03-27 14:19:07,926 : INFO : EPOCH 4: training on 175599 raw words (110184 effective words) took 0.2s, 556727 effective words/s\n", "2025-03-27 14:19:07,927 : INFO : Word2Vec lifecycle event {'msg': 'training on 877995 raw words (550894 effective words) took 1.1s, 509344 effective words/s', 'datetime': '2025-03-27T14:19:07.927484', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:19:07,927 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:19:07.927772', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:19:07,929 : INFO : collecting all words and their counts\n", "2025-03-27 14:19:07,935 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:19:07,952 : INFO : collected 17251 word types from a corpus of 175599 raw words and 18 sentences\n", "2025-03-27 14:19:07,952 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:19:07,961 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 4125 unique words (23.91% of original 17251, drops 13126)', 'datetime': '2025-03-27T14:19:07.961552', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:07,962 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 154201 word corpus (87.81% of original 175599, drops 21398)', 'datetime': '2025-03-27T14:19:07.962073', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:07,970 : INFO : deleting the raw counts dictionary of 17251 items\n", "2025-03-27 14:19:07,971 : INFO : sample=0.001 downsamples 40 most-common words\n", "2025-03-27 14:19:07,972 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 110199.4281334271 word corpus (71.5%% of prior 154201)', 'datetime': '2025-03-27T14:19:07.972034', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:07,985 : INFO : estimated required memory for 4125 words and 100 dimensions: 5362500 bytes\n", "2025-03-27 14:19:07,986 : INFO : resetting layer weights\n", "2025-03-27 14:19:07,987 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:19:07.987794', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:19:07,988 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 4125 vocabulary and 100 features, using sg=1 hs=0 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:19:07.988008', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:19:08,196 : INFO : EPOCH 0: training on 175599 raw words (109994 effective words) took 0.2s, 551191 effective words/s\n", "2025-03-27 14:19:08,405 : INFO : EPOCH 1: training on 175599 raw words (110054 effective words) took 0.2s, 544563 effective words/s\n", "2025-03-27 14:19:08,612 : INFO : EPOCH 2: training on 175599 raw words (110371 effective words) took 0.2s, 553603 effective words/s\n", "2025-03-27 14:19:08,818 : INFO : EPOCH 3: training on 175599 raw words (110429 effective words) took 0.2s, 555476 effective words/s\n", "2025-03-27 14:19:09,028 : INFO : EPOCH 4: training on 175599 raw words (109966 effective words) took 0.2s, 543189 effective words/s\n", "2025-03-27 14:19:09,028 : INFO : Word2Vec lifecycle event {'msg': 'training on 877995 raw words (550814 effective words) took 1.0s, 529333 effective words/s', 'datetime': '2025-03-27T14:19:09.028821', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:19:09,029 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:19:09.029108', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:19:09,029 : INFO : collecting all words and their counts\n", "2025-03-27 14:19:09,036 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:19:09,052 : INFO : collected 17251 word types from a corpus of 175599 raw words and 18 sentences\n", "2025-03-27 14:19:09,053 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:19:09,061 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 4125 unique words (23.91% of original 17251, drops 13126)', 'datetime': '2025-03-27T14:19:09.061925', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:09,062 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 154201 word corpus (87.81% of original 175599, drops 21398)', 'datetime': '2025-03-27T14:19:09.062502', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:09,071 : INFO : deleting the raw counts dictionary of 17251 items\n", "2025-03-27 14:19:09,072 : INFO : sample=0.001 downsamples 40 most-common words\n", "2025-03-27 14:19:09,073 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 110199.4281334271 word corpus (71.5%% of prior 154201)', 'datetime': '2025-03-27T14:19:09.073258', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:09,087 : INFO : estimated required memory for 4125 words and 100 dimensions: 5362500 bytes\n", "2025-03-27 14:19:09,087 : INFO : resetting layer weights\n", "2025-03-27 14:19:09,090 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:19:09.090310', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:19:09,090 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 4125 vocabulary and 100 features, using sg=1 hs=0 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:19:09.090923', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:19:09,300 : INFO : EPOCH 0: training on 175599 raw words (110086 effective words) took 0.2s, 543498 effective words/s\n", "2025-03-27 14:19:09,506 : INFO : EPOCH 1: training on 175599 raw words (110309 effective words) took 0.2s, 557322 effective words/s\n", "2025-03-27 14:19:09,713 : INFO : EPOCH 2: training on 175599 raw words (110151 effective words) took 0.2s, 549361 effective words/s\n", "2025-03-27 14:19:09,928 : INFO : EPOCH 3: training on 175599 raw words (110293 effective words) took 0.2s, 528290 effective words/s\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2025-03-27 14:19:10,135 : INFO : EPOCH 4: training on 175599 raw words (110127 effective words) took 0.2s, 554604 effective words/s\n", "2025-03-27 14:19:10,135 : INFO : Word2Vec lifecycle event {'msg': 'training on 877995 raw words (550966 effective words) took 1.0s, 527338 effective words/s', 'datetime': '2025-03-27T14:19:10.135916', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:19:10,136 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:19:10.136207', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:19:10,136 : INFO : collecting all words and their counts\n", "2025-03-27 14:19:10,143 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:19:10,159 : INFO : collected 17251 word types from a corpus of 175599 raw words and 18 sentences\n", "2025-03-27 14:19:10,160 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:19:10,169 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 4125 unique words (23.91% of original 17251, drops 13126)', 'datetime': '2025-03-27T14:19:10.169759', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:10,170 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 154201 word corpus (87.81% of original 175599, drops 21398)', 'datetime': '2025-03-27T14:19:10.170158', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:10,179 : INFO : deleting the raw counts dictionary of 17251 items\n", "2025-03-27 14:19:10,180 : INFO : sample=0.001 downsamples 40 most-common words\n", "2025-03-27 14:19:10,180 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 110199.4281334271 word corpus (71.5%% of prior 154201)', 'datetime': '2025-03-27T14:19:10.180287', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:10,193 : INFO : estimated required memory for 4125 words and 100 dimensions: 5362500 bytes\n", "2025-03-27 14:19:10,193 : INFO : resetting layer weights\n", "2025-03-27 14:19:10,195 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:19:10.195457', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:19:10,195 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 4125 vocabulary and 100 features, using sg=1 hs=0 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:19:10.195659', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Word2vec model #12: {'train_data': '1MB', 'compute_loss': True, 'sg': 1, 'hs': 0, 'train_time_mean': 1.1165948708852131, 'train_time_std': 0.018046651526726992}\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2025-03-27 14:19:10,404 : INFO : EPOCH 0: training on 175599 raw words (110086 effective words) took 0.2s, 544621 effective words/s\n", "2025-03-27 14:19:10,611 : INFO : EPOCH 1: training on 175599 raw words (110309 effective words) took 0.2s, 554978 effective words/s\n", "2025-03-27 14:19:10,823 : INFO : EPOCH 2: training on 175599 raw words (110151 effective words) took 0.2s, 543245 effective words/s\n", "2025-03-27 14:19:11,036 : INFO : EPOCH 3: training on 175599 raw words (110252 effective words) took 0.2s, 538233 effective words/s\n", "2025-03-27 14:19:11,264 : INFO : EPOCH 4: training on 175599 raw words (110114 effective words) took 0.2s, 501008 effective words/s\n", "2025-03-27 14:19:11,265 : INFO : Word2Vec lifecycle event {'msg': 'training on 877995 raw words (550912 effective words) took 1.1s, 515013 effective words/s', 'datetime': '2025-03-27T14:19:11.265598', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:19:11,265 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:19:11.265944', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:19:11,266 : INFO : collecting all words and their counts\n", "2025-03-27 14:19:11,272 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:19:11,289 : INFO : collected 17251 word types from a corpus of 175599 raw words and 18 sentences\n", "2025-03-27 14:19:11,289 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:19:11,298 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 4125 unique words (23.91% of original 17251, drops 13126)', 'datetime': '2025-03-27T14:19:11.298064', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:11,298 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 154201 word corpus (87.81% of original 175599, drops 21398)', 'datetime': '2025-03-27T14:19:11.298530', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:11,307 : INFO : deleting the raw counts dictionary of 17251 items\n", "2025-03-27 14:19:11,308 : INFO : sample=0.001 downsamples 40 most-common words\n", "2025-03-27 14:19:11,308 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 110199.4281334271 word corpus (71.5%% of prior 154201)', 'datetime': '2025-03-27T14:19:11.308507', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:11,323 : INFO : estimated required memory for 4125 words and 100 dimensions: 5362500 bytes\n", "2025-03-27 14:19:11,323 : INFO : resetting layer weights\n", "2025-03-27 14:19:11,325 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:19:11.325814', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:19:11,326 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 4125 vocabulary and 100 features, using sg=1 hs=0 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:19:11.326102', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:19:11,544 : INFO : EPOCH 0: training on 175599 raw words (110284 effective words) took 0.2s, 523900 effective words/s\n", "2025-03-27 14:19:11,750 : INFO : EPOCH 1: training on 175599 raw words (110008 effective words) took 0.2s, 553140 effective words/s\n", "2025-03-27 14:19:11,956 : INFO : EPOCH 2: training on 175599 raw words (110566 effective words) took 0.2s, 555504 effective words/s\n", "2025-03-27 14:19:12,161 : INFO : EPOCH 3: training on 175599 raw words (110199 effective words) took 0.2s, 561021 effective words/s\n", "2025-03-27 14:19:12,377 : INFO : EPOCH 4: training on 175599 raw words (110181 effective words) took 0.2s, 526644 effective words/s\n", "2025-03-27 14:19:12,378 : INFO : Word2Vec lifecycle event {'msg': 'training on 877995 raw words (551238 effective words) took 1.1s, 524113 effective words/s', 'datetime': '2025-03-27T14:19:12.378152', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:19:12,378 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:19:12.378459', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:19:12,379 : INFO : collecting all words and their counts\n", "2025-03-27 14:19:12,385 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:19:12,402 : INFO : collected 17251 word types from a corpus of 175599 raw words and 18 sentences\n", "2025-03-27 14:19:12,403 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:19:12,411 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 4125 unique words (23.91% of original 17251, drops 13126)', 'datetime': '2025-03-27T14:19:12.411563', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:12,412 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 154201 word corpus (87.81% of original 175599, drops 21398)', 'datetime': '2025-03-27T14:19:12.412011', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:12,421 : INFO : deleting the raw counts dictionary of 17251 items\n", "2025-03-27 14:19:12,422 : INFO : sample=0.001 downsamples 40 most-common words\n", "2025-03-27 14:19:12,422 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 110199.4281334271 word corpus (71.5%% of prior 154201)', 'datetime': '2025-03-27T14:19:12.422331', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:12,435 : INFO : estimated required memory for 4125 words and 100 dimensions: 5362500 bytes\n", "2025-03-27 14:19:12,435 : INFO : resetting layer weights\n", "2025-03-27 14:19:12,437 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:19:12.437582', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:19:12,437 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 4125 vocabulary and 100 features, using sg=1 hs=0 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:19:12.437816', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:19:12,644 : INFO : EPOCH 0: training on 175599 raw words (109994 effective words) took 0.2s, 550841 effective words/s\n", "2025-03-27 14:19:12,849 : INFO : EPOCH 1: training on 175599 raw words (110177 effective words) took 0.2s, 554464 effective words/s\n", "2025-03-27 14:19:13,058 : INFO : EPOCH 2: training on 175599 raw words (110354 effective words) took 0.2s, 551496 effective words/s\n", "2025-03-27 14:19:13,264 : INFO : EPOCH 3: training on 175599 raw words (110208 effective words) took 0.2s, 559034 effective words/s\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2025-03-27 14:19:13,466 : INFO : EPOCH 4: training on 175599 raw words (110212 effective words) took 0.2s, 567565 effective words/s\n", "2025-03-27 14:19:13,467 : INFO : Word2Vec lifecycle event {'msg': 'training on 877995 raw words (550945 effective words) took 1.0s, 535412 effective words/s', 'datetime': '2025-03-27T14:19:13.467077', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:19:13,467 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:19:13.467361', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:19:13,468 : INFO : collecting all words and their counts\n", "2025-03-27 14:19:13,474 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:19:13,491 : INFO : collected 17251 word types from a corpus of 175599 raw words and 18 sentences\n", "2025-03-27 14:19:13,492 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:19:13,501 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 4125 unique words (23.91% of original 17251, drops 13126)', 'datetime': '2025-03-27T14:19:13.501364', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:13,501 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 154201 word corpus (87.81% of original 175599, drops 21398)', 'datetime': '2025-03-27T14:19:13.501680', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:13,510 : INFO : deleting the raw counts dictionary of 17251 items\n", "2025-03-27 14:19:13,510 : INFO : sample=0.001 downsamples 40 most-common words\n", "2025-03-27 14:19:13,511 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 110199.4281334271 word corpus (71.5%% of prior 154201)', 'datetime': '2025-03-27T14:19:13.511120', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:13,512 : INFO : constructing a huffman tree from 4125 words\n", "2025-03-27 14:19:13,564 : INFO : built huffman tree with maximum node depth 15\n", "2025-03-27 14:19:13,577 : INFO : estimated required memory for 4125 words and 100 dimensions: 7837500 bytes\n", "2025-03-27 14:19:13,577 : INFO : resetting layer weights\n", "2025-03-27 14:19:13,579 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:19:13.579412', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:19:13,579 : WARNING : Both hierarchical softmax and negative sampling are activated. This is probably a mistake. You should set either 'hs=0' or 'negative=0' to disable one of them. \n", "2025-03-27 14:19:13,579 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 4125 vocabulary and 100 features, using sg=1 hs=1 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:19:13.579859', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Word2vec model #13: {'train_data': '1MB', 'compute_loss': False, 'sg': 1, 'hs': 0, 'train_time_mean': 1.1103806495666504, 'train_time_std': 0.016719505657097004}\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2025-03-27 14:19:14,057 : INFO : EPOCH 0: training on 175599 raw words (110086 effective words) took 0.5s, 234531 effective words/s\n", "2025-03-27 14:19:14,541 : INFO : EPOCH 1: training on 175599 raw words (110147 effective words) took 0.5s, 231500 effective words/s\n", "2025-03-27 14:19:15,022 : INFO : EPOCH 2: training on 175599 raw words (110141 effective words) took 0.5s, 232828 effective words/s\n", "2025-03-27 14:19:15,479 : INFO : EPOCH 3: training on 175599 raw words (109999 effective words) took 0.4s, 245978 effective words/s\n", "2025-03-27 14:19:15,963 : INFO : EPOCH 4: training on 175599 raw words (110137 effective words) took 0.5s, 232465 effective words/s\n", "2025-03-27 14:19:15,963 : INFO : Word2Vec lifecycle event {'msg': 'training on 877995 raw words (550510 effective words) took 2.4s, 230954 effective words/s', 'datetime': '2025-03-27T14:19:15.963700', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:19:15,964 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:19:15.964041', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:19:15,964 : INFO : collecting all words and their counts\n", "2025-03-27 14:19:15,970 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:19:15,987 : INFO : collected 17251 word types from a corpus of 175599 raw words and 18 sentences\n", "2025-03-27 14:19:15,987 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:19:15,995 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 4125 unique words (23.91% of original 17251, drops 13126)', 'datetime': '2025-03-27T14:19:15.995907', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:15,996 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 154201 word corpus (87.81% of original 175599, drops 21398)', 'datetime': '2025-03-27T14:19:15.996303', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:16,005 : INFO : deleting the raw counts dictionary of 17251 items\n", "2025-03-27 14:19:16,006 : INFO : sample=0.001 downsamples 40 most-common words\n", "2025-03-27 14:19:16,007 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 110199.4281334271 word corpus (71.5%% of prior 154201)', 'datetime': '2025-03-27T14:19:16.007291', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:16,008 : INFO : constructing a huffman tree from 4125 words\n", "2025-03-27 14:19:16,061 : INFO : built huffman tree with maximum node depth 15\n", "2025-03-27 14:19:16,074 : INFO : estimated required memory for 4125 words and 100 dimensions: 7837500 bytes\n", "2025-03-27 14:19:16,074 : INFO : resetting layer weights\n", "2025-03-27 14:19:16,076 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:19:16.076876', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:19:16,077 : WARNING : Both hierarchical softmax and negative sampling are activated. This is probably a mistake. You should set either 'hs=0' or 'negative=0' to disable one of them. \n", "2025-03-27 14:19:16,077 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 4125 vocabulary and 100 features, using sg=1 hs=1 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:19:16.077339', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:19:16,575 : INFO : EPOCH 0: training on 175599 raw words (110086 effective words) took 0.5s, 224865 effective words/s\n", "2025-03-27 14:19:17,099 : INFO : EPOCH 1: training on 175599 raw words (110274 effective words) took 0.5s, 213216 effective words/s\n", "2025-03-27 14:19:17,611 : INFO : EPOCH 2: training on 175599 raw words (110178 effective words) took 0.5s, 218952 effective words/s\n", "2025-03-27 14:19:18,094 : INFO : EPOCH 3: training on 175599 raw words (110164 effective words) took 0.5s, 231315 effective words/s\n", "2025-03-27 14:19:18,610 : INFO : EPOCH 4: training on 175599 raw words (110248 effective words) took 0.5s, 217261 effective words/s\n", "2025-03-27 14:19:18,610 : INFO : Word2Vec lifecycle event {'msg': 'training on 877995 raw words (550950 effective words) took 2.5s, 217498 effective words/s', 'datetime': '2025-03-27T14:19:18.610674', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:19:18,611 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:19:18.611020', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:19:18,612 : INFO : collecting all words and their counts\n", "2025-03-27 14:19:18,618 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:19:18,637 : INFO : collected 17251 word types from a corpus of 175599 raw words and 18 sentences\n", "2025-03-27 14:19:18,638 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:19:18,648 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 4125 unique words (23.91% of original 17251, drops 13126)', 'datetime': '2025-03-27T14:19:18.648049', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:18,648 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 154201 word corpus (87.81% of original 175599, drops 21398)', 'datetime': '2025-03-27T14:19:18.648902', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:18,664 : INFO : deleting the raw counts dictionary of 17251 items\n", "2025-03-27 14:19:18,666 : INFO : sample=0.001 downsamples 40 most-common words\n", "2025-03-27 14:19:18,667 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 110199.4281334271 word corpus (71.5%% of prior 154201)', 'datetime': '2025-03-27T14:19:18.667730', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:18,669 : INFO : constructing a huffman tree from 4125 words\n", "2025-03-27 14:19:18,724 : INFO : built huffman tree with maximum node depth 15\n", "2025-03-27 14:19:18,739 : INFO : estimated required memory for 4125 words and 100 dimensions: 7837500 bytes\n", "2025-03-27 14:19:18,739 : INFO : resetting layer weights\n", "2025-03-27 14:19:18,741 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:19:18.741928', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:19:18,742 : WARNING : Both hierarchical softmax and negative sampling are activated. This is probably a mistake. You should set either 'hs=0' or 'negative=0' to disable one of them. \n", "2025-03-27 14:19:18,742 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 4125 vocabulary and 100 features, using sg=1 hs=1 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:19:18.742796', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2025-03-27 14:19:19,295 : INFO : EPOCH 0: training on 175599 raw words (110135 effective words) took 0.5s, 202153 effective words/s\n", "2025-03-27 14:19:19,780 : INFO : EPOCH 1: training on 175599 raw words (110254 effective words) took 0.5s, 230955 effective words/s\n", "2025-03-27 14:19:20,265 : INFO : EPOCH 2: training on 175599 raw words (110128 effective words) took 0.5s, 230609 effective words/s\n", "2025-03-27 14:19:20,765 : INFO : EPOCH 3: training on 175599 raw words (110331 effective words) took 0.5s, 223910 effective words/s\n", "2025-03-27 14:19:21,272 : INFO : EPOCH 4: training on 175599 raw words (110240 effective words) took 0.5s, 220804 effective words/s\n", "2025-03-27 14:19:21,273 : INFO : Word2Vec lifecycle event {'msg': 'training on 877995 raw words (551088 effective words) took 2.5s, 217823 effective words/s', 'datetime': '2025-03-27T14:19:21.273235', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:19:21,273 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:19:21.273626', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:19:21,275 : INFO : collecting all words and their counts\n", "2025-03-27 14:19:21,281 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:19:21,299 : INFO : collected 17251 word types from a corpus of 175599 raw words and 18 sentences\n", "2025-03-27 14:19:21,300 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:19:21,308 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 4125 unique words (23.91% of original 17251, drops 13126)', 'datetime': '2025-03-27T14:19:21.308329', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:21,309 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 154201 word corpus (87.81% of original 175599, drops 21398)', 'datetime': '2025-03-27T14:19:21.309045', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:21,318 : INFO : deleting the raw counts dictionary of 17251 items\n", "2025-03-27 14:19:21,319 : INFO : sample=0.001 downsamples 40 most-common words\n", "2025-03-27 14:19:21,319 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 110199.4281334271 word corpus (71.5%% of prior 154201)', 'datetime': '2025-03-27T14:19:21.319423', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:21,320 : INFO : constructing a huffman tree from 4125 words\n", "2025-03-27 14:19:21,372 : INFO : built huffman tree with maximum node depth 15\n", "2025-03-27 14:19:21,386 : INFO : estimated required memory for 4125 words and 100 dimensions: 7837500 bytes\n", "2025-03-27 14:19:21,386 : INFO : resetting layer weights\n", "2025-03-27 14:19:21,388 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:19:21.388349', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:19:21,388 : WARNING : Both hierarchical softmax and negative sampling are activated. This is probably a mistake. You should set either 'hs=0' or 'negative=0' to disable one of them. \n", "2025-03-27 14:19:21,388 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 4125 vocabulary and 100 features, using sg=1 hs=1 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:19:21.388933', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Word2vec model #14: {'train_data': '1MB', 'compute_loss': True, 'sg': 1, 'hs': 1, 'train_time_mean': 2.602240562438965, 'train_time_std': 0.07494918217447978}\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2025-03-27 14:19:21,878 : INFO : EPOCH 0: training on 175599 raw words (110284 effective words) took 0.5s, 228829 effective words/s\n", "2025-03-27 14:19:22,345 : INFO : EPOCH 1: training on 175599 raw words (110008 effective words) took 0.5s, 239658 effective words/s\n", "2025-03-27 14:19:22,863 : INFO : EPOCH 2: training on 175599 raw words (110417 effective words) took 0.5s, 217073 effective words/s\n", "2025-03-27 14:19:23,380 : INFO : EPOCH 3: training on 175599 raw words (110226 effective words) took 0.5s, 216231 effective words/s\n", "2025-03-27 14:19:23,894 : INFO : EPOCH 4: training on 175599 raw words (110223 effective words) took 0.5s, 217576 effective words/s\n", "2025-03-27 14:19:23,894 : INFO : Word2Vec lifecycle event {'msg': 'training on 877995 raw words (551158 effective words) took 2.5s, 219976 effective words/s', 'datetime': '2025-03-27T14:19:23.894697', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:19:23,895 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:19:23.895041', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:19:23,896 : INFO : collecting all words and their counts\n", "2025-03-27 14:19:23,902 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:19:23,919 : INFO : collected 17251 word types from a corpus of 175599 raw words and 18 sentences\n", "2025-03-27 14:19:23,920 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:19:23,928 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 4125 unique words (23.91% of original 17251, drops 13126)', 'datetime': '2025-03-27T14:19:23.928886', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:23,929 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 154201 word corpus (87.81% of original 175599, drops 21398)', 'datetime': '2025-03-27T14:19:23.929798', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:23,938 : INFO : deleting the raw counts dictionary of 17251 items\n", "2025-03-27 14:19:23,939 : INFO : sample=0.001 downsamples 40 most-common words\n", "2025-03-27 14:19:23,940 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 110199.4281334271 word corpus (71.5%% of prior 154201)', 'datetime': '2025-03-27T14:19:23.940030', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:23,941 : INFO : constructing a huffman tree from 4125 words\n", "2025-03-27 14:19:24,046 : INFO : built huffman tree with maximum node depth 15\n", "2025-03-27 14:19:24,060 : INFO : estimated required memory for 4125 words and 100 dimensions: 7837500 bytes\n", "2025-03-27 14:19:24,060 : INFO : resetting layer weights\n", "2025-03-27 14:19:24,062 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:19:24.062748', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:19:24,063 : WARNING : Both hierarchical softmax and negative sampling are activated. This is probably a mistake. You should set either 'hs=0' or 'negative=0' to disable one of them. \n", "2025-03-27 14:19:24,063 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 4125 vocabulary and 100 features, using sg=1 hs=1 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:19:24.063323', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:19:24,532 : INFO : EPOCH 0: training on 175599 raw words (110284 effective words) took 0.5s, 238898 effective words/s\n", "2025-03-27 14:19:24,982 : INFO : EPOCH 1: training on 175599 raw words (110209 effective words) took 0.4s, 249078 effective words/s\n", "2025-03-27 14:19:25,436 : INFO : EPOCH 2: training on 175599 raw words (110051 effective words) took 0.4s, 246134 effective words/s\n", "2025-03-27 14:19:25,880 : INFO : EPOCH 3: training on 175599 raw words (110192 effective words) took 0.4s, 252304 effective words/s\n", "2025-03-27 14:19:26,350 : INFO : EPOCH 4: training on 175599 raw words (110120 effective words) took 0.5s, 238913 effective words/s\n", "2025-03-27 14:19:26,351 : INFO : Word2Vec lifecycle event {'msg': 'training on 877995 raw words (550856 effective words) took 2.3s, 240799 effective words/s', 'datetime': '2025-03-27T14:19:26.351171', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:19:26,351 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:19:26.351551', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:19:26,352 : INFO : collecting all words and their counts\n", "2025-03-27 14:19:26,358 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:19:26,376 : INFO : collected 17251 word types from a corpus of 175599 raw words and 18 sentences\n", "2025-03-27 14:19:26,377 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:19:26,386 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 4125 unique words (23.91% of original 17251, drops 13126)', 'datetime': '2025-03-27T14:19:26.386266', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:26,386 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 154201 word corpus (87.81% of original 175599, drops 21398)', 'datetime': '2025-03-27T14:19:26.386769', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:26,395 : INFO : deleting the raw counts dictionary of 17251 items\n", "2025-03-27 14:19:26,396 : INFO : sample=0.001 downsamples 40 most-common words\n", "2025-03-27 14:19:26,396 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 110199.4281334271 word corpus (71.5%% of prior 154201)', 'datetime': '2025-03-27T14:19:26.396473', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:26,397 : INFO : constructing a huffman tree from 4125 words\n", "2025-03-27 14:19:26,448 : INFO : built huffman tree with maximum node depth 15\n", "2025-03-27 14:19:26,462 : INFO : estimated required memory for 4125 words and 100 dimensions: 7837500 bytes\n", "2025-03-27 14:19:26,463 : INFO : resetting layer weights\n", "2025-03-27 14:19:26,465 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:19:26.465868', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:19:26,466 : WARNING : Both hierarchical softmax and negative sampling are activated. This is probably a mistake. You should set either 'hs=0' or 'negative=0' to disable one of them. \n", "2025-03-27 14:19:26,466 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 4125 vocabulary and 100 features, using sg=1 hs=1 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:19:26.466460', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2025-03-27 14:19:26,988 : INFO : EPOCH 0: training on 175599 raw words (110242 effective words) took 0.5s, 214359 effective words/s\n", "2025-03-27 14:19:27,477 : INFO : EPOCH 1: training on 175599 raw words (110156 effective words) took 0.5s, 229004 effective words/s\n", "2025-03-27 14:19:27,978 : INFO : EPOCH 2: training on 175599 raw words (110017 effective words) took 0.5s, 222960 effective words/s\n", "2025-03-27 14:19:28,476 : INFO : EPOCH 3: training on 175599 raw words (109948 effective words) took 0.5s, 223899 effective words/s\n", "2025-03-27 14:19:28,995 : INFO : EPOCH 4: training on 175599 raw words (110492 effective words) took 0.5s, 216496 effective words/s\n", "2025-03-27 14:19:28,995 : INFO : Word2Vec lifecycle event {'msg': 'training on 877995 raw words (550855 effective words) took 2.5s, 217820 effective words/s', 'datetime': '2025-03-27T14:19:28.995607', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:19:28,995 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:19:28.995979', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:19:28,997 : INFO : collecting all words and their counts\n", "2025-03-27 14:19:29,060 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Word2vec model #15: {'train_data': '1MB', 'compute_loss': False, 'sg': 1, 'hs': 1, 'train_time_mean': 2.5740624268849692, 'train_time_std': 0.08392968022586558}\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2025-03-27 14:19:29,231 : INFO : collected 73167 word types from a corpus of 1788017 raw words and 179 sentences\n", "2025-03-27 14:19:29,231 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:19:29,276 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 20167 unique words (27.56% of original 73167, drops 53000)', 'datetime': '2025-03-27T14:19:29.276958', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:29,277 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 1703716 word corpus (95.29% of original 1788017, drops 84301)', 'datetime': '2025-03-27T14:19:29.277529', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:29,323 : INFO : deleting the raw counts dictionary of 73167 items\n", "2025-03-27 14:19:29,326 : INFO : sample=0.001 downsamples 38 most-common words\n", "2025-03-27 14:19:29,326 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 1242287.3013176506 word corpus (72.9%% of prior 1703716)', 'datetime': '2025-03-27T14:19:29.326667', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:29,392 : INFO : estimated required memory for 20167 words and 100 dimensions: 26217100 bytes\n", "2025-03-27 14:19:29,393 : INFO : resetting layer weights\n", "2025-03-27 14:19:29,399 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:19:29.399444', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:19:29,400 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 20167 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:19:29.400060', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:19:30,133 : INFO : EPOCH 0: training on 1788017 raw words (1241447 effective words) took 0.7s, 1712771 effective words/s\n", "2025-03-27 14:19:30,875 : INFO : EPOCH 1: training on 1788017 raw words (1241711 effective words) took 0.7s, 1692236 effective words/s\n", "2025-03-27 14:19:31,618 : INFO : EPOCH 2: training on 1788017 raw words (1242767 effective words) took 0.7s, 1692335 effective words/s\n", "2025-03-27 14:19:32,366 : INFO : EPOCH 3: training on 1788017 raw words (1242596 effective words) took 0.7s, 1817617 effective words/s\n", "2025-03-27 14:19:33,097 : INFO : EPOCH 4: training on 1788017 raw words (1242016 effective words) took 0.7s, 1718767 effective words/s\n", "2025-03-27 14:19:33,098 : INFO : Word2Vec lifecycle event {'msg': 'training on 8940085 raw words (6210537 effective words) took 3.7s, 1679452 effective words/s', 'datetime': '2025-03-27T14:19:33.098221', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:19:33,098 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:19:33.098524', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:19:33,099 : INFO : collecting all words and their counts\n", "2025-03-27 14:19:33,161 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:19:33,348 : INFO : collected 73167 word types from a corpus of 1788017 raw words and 179 sentences\n", "2025-03-27 14:19:33,349 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:19:33,395 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 20167 unique words (27.56% of original 73167, drops 53000)', 'datetime': '2025-03-27T14:19:33.395792', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:33,396 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 1703716 word corpus (95.29% of original 1788017, drops 84301)', 'datetime': '2025-03-27T14:19:33.396342', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:33,440 : INFO : deleting the raw counts dictionary of 73167 items\n", "2025-03-27 14:19:33,442 : INFO : sample=0.001 downsamples 38 most-common words\n", "2025-03-27 14:19:33,443 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 1242287.3013176506 word corpus (72.9%% of prior 1703716)', 'datetime': '2025-03-27T14:19:33.443114', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:33,510 : INFO : estimated required memory for 20167 words and 100 dimensions: 26217100 bytes\n", "2025-03-27 14:19:33,510 : INFO : resetting layer weights\n", "2025-03-27 14:19:33,517 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:19:33.517153', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:19:33,517 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 20167 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:19:33.517730', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:19:34,228 : INFO : EPOCH 0: training on 1788017 raw words (1242847 effective words) took 0.6s, 1922778 effective words/s\n", "2025-03-27 14:19:35,007 : INFO : EPOCH 1: training on 1788017 raw words (1241921 effective words) took 0.8s, 1613012 effective words/s\n", "2025-03-27 14:19:35,716 : INFO : EPOCH 2: training on 1788017 raw words (1242513 effective words) took 0.7s, 1774629 effective words/s\n", "2025-03-27 14:19:36,502 : INFO : EPOCH 3: training on 1788017 raw words (1241965 effective words) took 0.7s, 1719632 effective words/s\n", "2025-03-27 14:19:37,229 : INFO : EPOCH 4: training on 1788017 raw words (1242250 effective words) took 0.7s, 1728949 effective words/s\n", "2025-03-27 14:19:37,230 : INFO : Word2Vec lifecycle event {'msg': 'training on 8940085 raw words (6211496 effective words) took 3.7s, 1673197 effective words/s', 'datetime': '2025-03-27T14:19:37.230318', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:19:37,230 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:19:37.230556', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:19:37,232 : INFO : collecting all words and their counts\n", "2025-03-27 14:19:37,294 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:19:37,495 : INFO : collected 73167 word types from a corpus of 1788017 raw words and 179 sentences\n", "2025-03-27 14:19:37,495 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:19:37,544 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 20167 unique words (27.56% of original 73167, drops 53000)', 'datetime': '2025-03-27T14:19:37.544365', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2025-03-27 14:19:37,544 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 1703716 word corpus (95.29% of original 1788017, drops 84301)', 'datetime': '2025-03-27T14:19:37.544797', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:37,591 : INFO : deleting the raw counts dictionary of 73167 items\n", "2025-03-27 14:19:37,593 : INFO : sample=0.001 downsamples 38 most-common words\n", "2025-03-27 14:19:37,594 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 1242287.3013176506 word corpus (72.9%% of prior 1703716)', 'datetime': '2025-03-27T14:19:37.594148', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:37,660 : INFO : estimated required memory for 20167 words and 100 dimensions: 26217100 bytes\n", "2025-03-27 14:19:37,661 : INFO : resetting layer weights\n", "2025-03-27 14:19:37,667 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:19:37.667935', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:19:37,668 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 20167 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:19:37.668368', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:19:38,432 : INFO : EPOCH 0: training on 1788017 raw words (1241447 effective words) took 0.7s, 1656861 effective words/s\n", "2025-03-27 14:19:39,215 : INFO : EPOCH 1: training on 1788017 raw words (1241586 effective words) took 0.7s, 1729149 effective words/s\n", "2025-03-27 14:19:39,929 : INFO : EPOCH 2: training on 1788017 raw words (1241929 effective words) took 0.7s, 1762595 effective words/s\n", "2025-03-27 14:19:40,652 : INFO : EPOCH 3: training on 1788017 raw words (1243222 effective words) took 0.7s, 1892546 effective words/s\n", "2025-03-27 14:19:41,352 : INFO : EPOCH 4: training on 1788017 raw words (1242392 effective words) took 0.7s, 1797675 effective words/s\n", "2025-03-27 14:19:41,352 : INFO : Word2Vec lifecycle event {'msg': 'training on 8940085 raw words (6210576 effective words) took 3.7s, 1685834 effective words/s', 'datetime': '2025-03-27T14:19:41.352623', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:19:41,352 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:19:41.352849', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:19:41,354 : INFO : collecting all words and their counts\n", "2025-03-27 14:19:41,419 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Word2vec model #16: {'train_data': '10MB', 'compute_loss': True, 'sg': 0, 'hs': 0, 'train_time_mean': 4.119072596232097, 'train_time_std': 0.012591949927743307}\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2025-03-27 14:19:41,599 : INFO : collected 73167 word types from a corpus of 1788017 raw words and 179 sentences\n", "2025-03-27 14:19:41,600 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:19:41,645 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 20167 unique words (27.56% of original 73167, drops 53000)', 'datetime': '2025-03-27T14:19:41.645198', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:41,645 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 1703716 word corpus (95.29% of original 1788017, drops 84301)', 'datetime': '2025-03-27T14:19:41.645665', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:41,690 : INFO : deleting the raw counts dictionary of 73167 items\n", "2025-03-27 14:19:41,693 : INFO : sample=0.001 downsamples 38 most-common words\n", "2025-03-27 14:19:41,693 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 1242287.3013176506 word corpus (72.9%% of prior 1703716)', 'datetime': '2025-03-27T14:19:41.693432', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:41,759 : INFO : estimated required memory for 20167 words and 100 dimensions: 26217100 bytes\n", "2025-03-27 14:19:41,759 : INFO : resetting layer weights\n", "2025-03-27 14:19:41,766 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:19:41.765986', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:19:41,766 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 20167 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:19:41.766503', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:19:42,445 : INFO : EPOCH 0: training on 1788017 raw words (1241493 effective words) took 0.6s, 2014390 effective words/s\n", "2025-03-27 14:19:43,124 : INFO : EPOCH 1: training on 1788017 raw words (1242860 effective words) took 0.7s, 1855865 effective words/s\n", "2025-03-27 14:19:43,807 : INFO : EPOCH 2: training on 1788017 raw words (1242204 effective words) took 0.7s, 1842147 effective words/s\n", "2025-03-27 14:19:44,502 : INFO : EPOCH 3: training on 1788017 raw words (1241876 effective words) took 0.7s, 1807066 effective words/s\n", "2025-03-27 14:19:45,180 : INFO : EPOCH 4: training on 1788017 raw words (1242449 effective words) took 0.6s, 2039931 effective words/s\n", "2025-03-27 14:19:45,181 : INFO : Word2Vec lifecycle event {'msg': 'training on 8940085 raw words (6210882 effective words) took 3.4s, 1819108 effective words/s', 'datetime': '2025-03-27T14:19:45.181010', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:19:45,181 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:19:45.181248', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:19:45,182 : INFO : collecting all words and their counts\n", "2025-03-27 14:19:45,249 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:19:45,432 : INFO : collected 73167 word types from a corpus of 1788017 raw words and 179 sentences\n", "2025-03-27 14:19:45,432 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:19:45,478 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 20167 unique words (27.56% of original 73167, drops 53000)', 'datetime': '2025-03-27T14:19:45.478328', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:45,478 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 1703716 word corpus (95.29% of original 1788017, drops 84301)', 'datetime': '2025-03-27T14:19:45.478783', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:45,520 : INFO : deleting the raw counts dictionary of 73167 items\n", "2025-03-27 14:19:45,523 : INFO : sample=0.001 downsamples 38 most-common words\n", "2025-03-27 14:19:45,523 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 1242287.3013176506 word corpus (72.9%% of prior 1703716)', 'datetime': '2025-03-27T14:19:45.523597', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:45,588 : INFO : estimated required memory for 20167 words and 100 dimensions: 26217100 bytes\n", "2025-03-27 14:19:45,588 : INFO : resetting layer weights\n", "2025-03-27 14:19:45,595 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:19:45.595256', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:19:45,595 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 20167 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:19:45.595682', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:19:46,284 : INFO : EPOCH 0: training on 1788017 raw words (1241493 effective words) took 0.6s, 2006833 effective words/s\n", "2025-03-27 14:19:46,965 : INFO : EPOCH 1: training on 1788017 raw words (1242965 effective words) took 0.7s, 1847411 effective words/s\n", "2025-03-27 14:19:47,649 : INFO : EPOCH 2: training on 1788017 raw words (1242312 effective words) took 0.6s, 2022617 effective words/s\n", "2025-03-27 14:19:48,369 : INFO : EPOCH 3: training on 1788017 raw words (1242593 effective words) took 0.7s, 1748243 effective words/s\n", "2025-03-27 14:19:49,124 : INFO : EPOCH 4: training on 1788017 raw words (1242481 effective words) took 0.7s, 1663930 effective words/s\n", "2025-03-27 14:19:49,125 : INFO : Word2Vec lifecycle event {'msg': 'training on 8940085 raw words (6211844 effective words) took 3.5s, 1759933 effective words/s', 'datetime': '2025-03-27T14:19:49.125496', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:19:49,125 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:19:49.125767', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:19:49,128 : INFO : collecting all words and their counts\n", "2025-03-27 14:19:49,193 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:19:49,376 : INFO : collected 73167 word types from a corpus of 1788017 raw words and 179 sentences\n", "2025-03-27 14:19:49,377 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:19:49,423 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 20167 unique words (27.56% of original 73167, drops 53000)', 'datetime': '2025-03-27T14:19:49.423081', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2025-03-27 14:19:49,423 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 1703716 word corpus (95.29% of original 1788017, drops 84301)', 'datetime': '2025-03-27T14:19:49.423635', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:49,466 : INFO : deleting the raw counts dictionary of 73167 items\n", "2025-03-27 14:19:49,468 : INFO : sample=0.001 downsamples 38 most-common words\n", "2025-03-27 14:19:49,469 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 1242287.3013176506 word corpus (72.9%% of prior 1703716)', 'datetime': '2025-03-27T14:19:49.469070', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:49,535 : INFO : estimated required memory for 20167 words and 100 dimensions: 26217100 bytes\n", "2025-03-27 14:19:49,535 : INFO : resetting layer weights\n", "2025-03-27 14:19:49,542 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:19:49.542067', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:19:49,542 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 20167 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:19:49.542354', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:19:50,225 : INFO : EPOCH 0: training on 1788017 raw words (1241493 effective words) took 0.6s, 2021864 effective words/s\n", "2025-03-27 14:19:50,900 : INFO : EPOCH 1: training on 1788017 raw words (1242857 effective words) took 0.7s, 1866103 effective words/s\n", "2025-03-27 14:19:51,578 : INFO : EPOCH 2: training on 1788017 raw words (1242430 effective words) took 0.6s, 2035898 effective words/s\n", "2025-03-27 14:19:52,256 : INFO : EPOCH 3: training on 1788017 raw words (1242244 effective words) took 0.7s, 1856255 effective words/s\n", "2025-03-27 14:19:52,933 : INFO : EPOCH 4: training on 1788017 raw words (1242861 effective words) took 0.7s, 1858738 effective words/s\n", "2025-03-27 14:19:52,934 : INFO : Word2Vec lifecycle event {'msg': 'training on 8940085 raw words (6211885 effective words) took 3.4s, 1831479 effective words/s', 'datetime': '2025-03-27T14:19:52.934257', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:19:52,934 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:19:52.934507', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:19:52,936 : INFO : collecting all words and their counts\n", "2025-03-27 14:19:53,002 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Word2vec model #17: {'train_data': '10MB', 'compute_loss': False, 'sg': 0, 'hs': 0, 'train_time_mean': 3.860673983891805, 'train_time_std': 0.06030935425040049}\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2025-03-27 14:19:53,182 : INFO : collected 73167 word types from a corpus of 1788017 raw words and 179 sentences\n", "2025-03-27 14:19:53,182 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:19:53,227 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 20167 unique words (27.56% of original 73167, drops 53000)', 'datetime': '2025-03-27T14:19:53.227706', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:53,228 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 1703716 word corpus (95.29% of original 1788017, drops 84301)', 'datetime': '2025-03-27T14:19:53.228055', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:53,271 : INFO : deleting the raw counts dictionary of 73167 items\n", "2025-03-27 14:19:53,273 : INFO : sample=0.001 downsamples 38 most-common words\n", "2025-03-27 14:19:53,274 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 1242287.3013176506 word corpus (72.9%% of prior 1703716)', 'datetime': '2025-03-27T14:19:53.274065', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:19:53,279 : INFO : constructing a huffman tree from 20167 words\n", "2025-03-27 14:19:53,533 : INFO : built huffman tree with maximum node depth 18\n", "2025-03-27 14:19:53,595 : INFO : estimated required memory for 20167 words and 100 dimensions: 38317300 bytes\n", "2025-03-27 14:19:53,596 : INFO : resetting layer weights\n", "2025-03-27 14:19:53,603 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:19:53.603341', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:19:53,603 : WARNING : Both hierarchical softmax and negative sampling are activated. This is probably a mistake. You should set either 'hs=0' or 'negative=0' to disable one of them. \n", "2025-03-27 14:19:53,603 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 20167 vocabulary and 100 features, using sg=0 hs=1 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:19:53.603847', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:19:54,620 : INFO : EPOCH 0 - PROGRESS: at 71.51% examples, 881653 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:19:55,004 : INFO : EPOCH 0: training on 1788017 raw words (1241493 effective words) took 1.4s, 891457 effective words/s\n", "2025-03-27 14:19:56,080 : INFO : EPOCH 1 - PROGRESS: at 76.54% examples, 944255 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:19:56,392 : INFO : EPOCH 1: training on 1788017 raw words (1242965 effective words) took 1.3s, 940851 effective words/s\n", "2025-03-27 14:19:57,401 : INFO : EPOCH 2 - PROGRESS: at 71.51% examples, 888622 words/s, in_qsize 6, out_qsize 0\n", "2025-03-27 14:19:57,779 : INFO : EPOCH 2: training on 1788017 raw words (1242312 effective words) took 1.4s, 901340 effective words/s\n", "2025-03-27 14:19:58,791 : INFO : EPOCH 3 - PROGRESS: at 61.45% examples, 761651 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:19:59,381 : INFO : EPOCH 3: training on 1788017 raw words (1242593 effective words) took 1.6s, 779903 effective words/s\n", "2025-03-27 14:20:00,451 : INFO : EPOCH 4 - PROGRESS: at 72.63% examples, 900985 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:20:00,849 : INFO : EPOCH 4: training on 1788017 raw words (1242481 effective words) took 1.4s, 887509 effective words/s\n", "2025-03-27 14:20:00,850 : INFO : Word2Vec lifecycle event {'msg': 'training on 8940085 raw words (6211844 effective words) took 7.2s, 857230 effective words/s', 'datetime': '2025-03-27T14:20:00.850377', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:20:00,850 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:20:00.850686', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:20:00,852 : INFO : collecting all words and their counts\n", "2025-03-27 14:20:00,918 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:20:01,106 : INFO : collected 73167 word types from a corpus of 1788017 raw words and 179 sentences\n", "2025-03-27 14:20:01,107 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:20:01,155 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 20167 unique words (27.56% of original 73167, drops 53000)', 'datetime': '2025-03-27T14:20:01.155452', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:20:01,155 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 1703716 word corpus (95.29% of original 1788017, drops 84301)', 'datetime': '2025-03-27T14:20:01.155869', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:20:01,205 : INFO : deleting the raw counts dictionary of 73167 items\n", "2025-03-27 14:20:01,208 : INFO : sample=0.001 downsamples 38 most-common words\n", "2025-03-27 14:20:01,208 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 1242287.3013176506 word corpus (72.9%% of prior 1703716)', 'datetime': '2025-03-27T14:20:01.208476', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:20:01,213 : INFO : constructing a huffman tree from 20167 words\n", "2025-03-27 14:20:01,484 : INFO : built huffman tree with maximum node depth 18\n", "2025-03-27 14:20:01,546 : INFO : estimated required memory for 20167 words and 100 dimensions: 38317300 bytes\n", "2025-03-27 14:20:01,547 : INFO : resetting layer weights\n", "2025-03-27 14:20:01,554 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:20:01.554519', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:20:01,554 : WARNING : Both hierarchical softmax and negative sampling are activated. This is probably a mistake. You should set either 'hs=0' or 'negative=0' to disable one of them. \n", "2025-03-27 14:20:01,555 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 20167 vocabulary and 100 features, using sg=0 hs=1 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:20:01.555162', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:20:02,570 : INFO : EPOCH 0 - PROGRESS: at 71.51% examples, 883374 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:20:02,947 : INFO : EPOCH 0: training on 1788017 raw words (1241447 effective words) took 1.4s, 896840 effective words/s\n", "2025-03-27 14:20:03,960 : INFO : EPOCH 1 - PROGRESS: at 71.51% examples, 885495 words/s, in_qsize 6, out_qsize 0\n", "2025-03-27 14:20:04,347 : INFO : EPOCH 1: training on 1788017 raw words (1241621 effective words) took 1.4s, 892560 effective words/s\n", "2025-03-27 14:20:05,361 : INFO : EPOCH 2 - PROGRESS: at 72.07% examples, 891273 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:20:05,733 : INFO : EPOCH 2: training on 1788017 raw words (1242229 effective words) took 1.4s, 902241 effective words/s\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2025-03-27 14:20:06,804 : INFO : EPOCH 3 - PROGRESS: at 76.54% examples, 949470 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:20:07,116 : INFO : EPOCH 3: training on 1788017 raw words (1242362 effective words) took 1.3s, 944861 effective words/s\n", "2025-03-27 14:20:08,137 : INFO : EPOCH 4 - PROGRESS: at 71.51% examples, 877142 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:20:08,526 : INFO : EPOCH 4: training on 1788017 raw words (1241770 effective words) took 1.4s, 886335 effective words/s\n", "2025-03-27 14:20:08,526 : INFO : Word2Vec lifecycle event {'msg': 'training on 8940085 raw words (6209429 effective words) took 7.0s, 890689 effective words/s', 'datetime': '2025-03-27T14:20:08.526802', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:20:08,527 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:20:08.527024', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:20:08,530 : INFO : collecting all words and their counts\n", "2025-03-27 14:20:08,592 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:20:08,773 : INFO : collected 73167 word types from a corpus of 1788017 raw words and 179 sentences\n", "2025-03-27 14:20:08,774 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:20:08,819 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 20167 unique words (27.56% of original 73167, drops 53000)', 'datetime': '2025-03-27T14:20:08.819386', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:20:08,819 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 1703716 word corpus (95.29% of original 1788017, drops 84301)', 'datetime': '2025-03-27T14:20:08.819791', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:20:08,864 : INFO : deleting the raw counts dictionary of 73167 items\n", "2025-03-27 14:20:08,866 : INFO : sample=0.001 downsamples 38 most-common words\n", "2025-03-27 14:20:08,866 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 1242287.3013176506 word corpus (72.9%% of prior 1703716)', 'datetime': '2025-03-27T14:20:08.866643', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:20:08,873 : INFO : constructing a huffman tree from 20167 words\n", "2025-03-27 14:20:09,197 : INFO : built huffman tree with maximum node depth 18\n", "2025-03-27 14:20:09,259 : INFO : estimated required memory for 20167 words and 100 dimensions: 38317300 bytes\n", "2025-03-27 14:20:09,260 : INFO : resetting layer weights\n", "2025-03-27 14:20:09,266 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:20:09.266823', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:20:09,267 : WARNING : Both hierarchical softmax and negative sampling are activated. This is probably a mistake. You should set either 'hs=0' or 'negative=0' to disable one of them. \n", "2025-03-27 14:20:09,267 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 20167 vocabulary and 100 features, using sg=0 hs=1 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:20:09.267398', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:20:10,279 : INFO : EPOCH 0 - PROGRESS: at 70.39% examples, 871453 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:20:10,681 : INFO : EPOCH 0: training on 1788017 raw words (1241447 effective words) took 1.4s, 883162 effective words/s\n", "2025-03-27 14:20:11,692 : INFO : EPOCH 1 - PROGRESS: at 72.07% examples, 893668 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:20:12,063 : INFO : EPOCH 1: training on 1788017 raw words (1241621 effective words) took 1.4s, 904231 effective words/s\n", "2025-03-27 14:20:13,135 : INFO : EPOCH 2 - PROGRESS: at 75.98% examples, 940826 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:20:13,457 : INFO : EPOCH 2: training on 1788017 raw words (1242229 effective words) took 1.3s, 936103 effective words/s\n", "2025-03-27 14:20:14,471 : INFO : EPOCH 3 - PROGRESS: at 72.07% examples, 891517 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:20:14,842 : INFO : EPOCH 3: training on 1788017 raw words (1242362 effective words) took 1.4s, 902774 effective words/s\n", "2025-03-27 14:20:15,858 : INFO : EPOCH 4 - PROGRESS: at 71.51% examples, 882074 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:20:16,235 : INFO : EPOCH 4: training on 1788017 raw words (1241770 effective words) took 1.4s, 896748 effective words/s\n", "2025-03-27 14:20:16,236 : INFO : Word2Vec lifecycle event {'msg': 'training on 8940085 raw words (6209429 effective words) took 7.0s, 891022 effective words/s', 'datetime': '2025-03-27T14:20:16.236387', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:20:16,236 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:20:16.236633', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:20:16,240 : INFO : collecting all words and their counts\n", "2025-03-27 14:20:16,304 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Word2vec model #18: {'train_data': '10MB', 'compute_loss': True, 'sg': 0, 'hs': 1, 'train_time_mean': 7.767791589101155, 'train_time_std': 0.10568085894457598}\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2025-03-27 14:20:16,486 : INFO : collected 73167 word types from a corpus of 1788017 raw words and 179 sentences\n", "2025-03-27 14:20:16,487 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:20:16,534 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 20167 unique words (27.56% of original 73167, drops 53000)', 'datetime': '2025-03-27T14:20:16.534551', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:20:16,535 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 1703716 word corpus (95.29% of original 1788017, drops 84301)', 'datetime': '2025-03-27T14:20:16.535011', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:20:16,577 : INFO : deleting the raw counts dictionary of 73167 items\n", "2025-03-27 14:20:16,580 : INFO : sample=0.001 downsamples 38 most-common words\n", "2025-03-27 14:20:16,580 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 1242287.3013176506 word corpus (72.9%% of prior 1703716)', 'datetime': '2025-03-27T14:20:16.580978', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:20:16,587 : INFO : constructing a huffman tree from 20167 words\n", "2025-03-27 14:20:16,847 : INFO : built huffman tree with maximum node depth 18\n", "2025-03-27 14:20:16,910 : INFO : estimated required memory for 20167 words and 100 dimensions: 38317300 bytes\n", "2025-03-27 14:20:16,911 : INFO : resetting layer weights\n", "2025-03-27 14:20:16,918 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:20:16.918096', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:20:16,918 : WARNING : Both hierarchical softmax and negative sampling are activated. This is probably a mistake. You should set either 'hs=0' or 'negative=0' to disable one of them. \n", "2025-03-27 14:20:16,918 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 20167 vocabulary and 100 features, using sg=0 hs=1 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:20:16.918601', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:20:17,993 : INFO : EPOCH 0 - PROGRESS: at 76.54% examples, 943390 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:20:18,316 : INFO : EPOCH 0: training on 1788017 raw words (1241447 effective words) took 1.3s, 931950 effective words/s\n", "2025-03-27 14:20:19,386 : INFO : EPOCH 1 - PROGRESS: at 75.98% examples, 939100 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:20:19,708 : INFO : EPOCH 1: training on 1788017 raw words (1241621 effective words) took 1.3s, 934820 effective words/s\n", "2025-03-27 14:20:20,721 : INFO : EPOCH 2 - PROGRESS: at 70.39% examples, 870504 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:20:21,115 : INFO : EPOCH 2: training on 1788017 raw words (1242229 effective words) took 1.4s, 888447 effective words/s\n", "2025-03-27 14:20:22,125 : INFO : EPOCH 3 - PROGRESS: at 72.63% examples, 901411 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:20:22,495 : INFO : EPOCH 3: training on 1788017 raw words (1242394 effective words) took 1.4s, 905562 effective words/s\n", "2025-03-27 14:20:23,563 : INFO : EPOCH 4 - PROGRESS: at 76.54% examples, 952091 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:20:23,874 : INFO : EPOCH 4: training on 1788017 raw words (1242564 effective words) took 1.3s, 947046 effective words/s\n", "2025-03-27 14:20:23,874 : INFO : Word2Vec lifecycle event {'msg': 'training on 8940085 raw words (6210255 effective words) took 7.0s, 892792 effective words/s', 'datetime': '2025-03-27T14:20:23.874702', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:20:23,874 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:20:23.874959', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:20:23,879 : INFO : collecting all words and their counts\n", "2025-03-27 14:20:23,943 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:20:24,130 : INFO : collected 73167 word types from a corpus of 1788017 raw words and 179 sentences\n", "2025-03-27 14:20:24,131 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:20:24,174 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 20167 unique words (27.56% of original 73167, drops 53000)', 'datetime': '2025-03-27T14:20:24.174799', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:20:24,175 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 1703716 word corpus (95.29% of original 1788017, drops 84301)', 'datetime': '2025-03-27T14:20:24.175234', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:20:24,218 : INFO : deleting the raw counts dictionary of 73167 items\n", "2025-03-27 14:20:24,220 : INFO : sample=0.001 downsamples 38 most-common words\n", "2025-03-27 14:20:24,220 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 1242287.3013176506 word corpus (72.9%% of prior 1703716)', 'datetime': '2025-03-27T14:20:24.220409', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:20:24,225 : INFO : constructing a huffman tree from 20167 words\n", "2025-03-27 14:20:24,549 : INFO : built huffman tree with maximum node depth 18\n", "2025-03-27 14:20:24,611 : INFO : estimated required memory for 20167 words and 100 dimensions: 38317300 bytes\n", "2025-03-27 14:20:24,612 : INFO : resetting layer weights\n", "2025-03-27 14:20:24,619 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:20:24.619421', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:20:24,619 : WARNING : Both hierarchical softmax and negative sampling are activated. This is probably a mistake. You should set either 'hs=0' or 'negative=0' to disable one of them. \n", "2025-03-27 14:20:24,619 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 20167 vocabulary and 100 features, using sg=0 hs=1 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:20:24.619964', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:20:25,640 : INFO : EPOCH 0 - PROGRESS: at 71.51% examples, 878285 words/s, in_qsize 6, out_qsize 0\n", "2025-03-27 14:20:26,016 : INFO : EPOCH 0: training on 1788017 raw words (1241493 effective words) took 1.4s, 894407 effective words/s\n", "2025-03-27 14:20:27,025 : INFO : EPOCH 1 - PROGRESS: at 71.51% examples, 888768 words/s, in_qsize 6, out_qsize 0\n", "2025-03-27 14:20:27,414 : INFO : EPOCH 1: training on 1788017 raw words (1242966 effective words) took 1.4s, 894323 effective words/s\n", "2025-03-27 14:20:28,427 : INFO : EPOCH 2 - PROGRESS: at 72.07% examples, 891779 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:20:28,804 : INFO : EPOCH 2: training on 1788017 raw words (1242465 effective words) took 1.4s, 899439 effective words/s\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2025-03-27 14:20:29,814 : INFO : EPOCH 3 - PROGRESS: at 70.95% examples, 880699 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:20:30,205 : INFO : EPOCH 3: training on 1788017 raw words (1242391 effective words) took 1.4s, 892229 effective words/s\n", "2025-03-27 14:20:31,218 : INFO : EPOCH 4 - PROGRESS: at 69.83% examples, 863486 words/s, in_qsize 6, out_qsize 0\n", "2025-03-27 14:20:31,624 : INFO : EPOCH 4: training on 1788017 raw words (1242515 effective words) took 1.4s, 880880 effective words/s\n", "2025-03-27 14:20:31,624 : INFO : Word2Vec lifecycle event {'msg': 'training on 8940085 raw words (6211830 effective words) took 7.0s, 886790 effective words/s', 'datetime': '2025-03-27T14:20:31.624913', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:20:31,625 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:20:31.625143', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:20:31,628 : INFO : collecting all words and their counts\n", "2025-03-27 14:20:31,692 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:20:31,875 : INFO : collected 73167 word types from a corpus of 1788017 raw words and 179 sentences\n", "2025-03-27 14:20:31,875 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:20:31,921 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 20167 unique words (27.56% of original 73167, drops 53000)', 'datetime': '2025-03-27T14:20:31.921690', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:20:31,922 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 1703716 word corpus (95.29% of original 1788017, drops 84301)', 'datetime': '2025-03-27T14:20:31.922110', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:20:31,965 : INFO : deleting the raw counts dictionary of 73167 items\n", "2025-03-27 14:20:31,967 : INFO : sample=0.001 downsamples 38 most-common words\n", "2025-03-27 14:20:31,967 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 1242287.3013176506 word corpus (72.9%% of prior 1703716)', 'datetime': '2025-03-27T14:20:31.967826', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:20:31,972 : INFO : constructing a huffman tree from 20167 words\n", "2025-03-27 14:20:32,228 : INFO : built huffman tree with maximum node depth 18\n", "2025-03-27 14:20:32,291 : INFO : estimated required memory for 20167 words and 100 dimensions: 38317300 bytes\n", "2025-03-27 14:20:32,292 : INFO : resetting layer weights\n", "2025-03-27 14:20:32,298 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:20:32.298847', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:20:32,299 : WARNING : Both hierarchical softmax and negative sampling are activated. This is probably a mistake. You should set either 'hs=0' or 'negative=0' to disable one of them. \n", "2025-03-27 14:20:32,299 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 20167 vocabulary and 100 features, using sg=0 hs=1 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:20:32.299428', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:20:33,372 : INFO : EPOCH 0 - PROGRESS: at 76.54% examples, 945487 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:20:33,696 : INFO : EPOCH 0: training on 1788017 raw words (1241493 effective words) took 1.3s, 931962 effective words/s\n", "2025-03-27 14:20:34,716 : INFO : EPOCH 1 - PROGRESS: at 73.18% examples, 900479 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:20:35,076 : INFO : EPOCH 1: training on 1788017 raw words (1242857 effective words) took 1.4s, 906717 effective words/s\n", "2025-03-27 14:20:36,089 : INFO : EPOCH 2 - PROGRESS: at 71.51% examples, 885375 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:20:36,465 : INFO : EPOCH 2: training on 1788017 raw words (1242586 effective words) took 1.4s, 899904 effective words/s\n", "2025-03-27 14:20:37,543 : INFO : EPOCH 3 - PROGRESS: at 74.86% examples, 922150 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:20:37,867 : INFO : EPOCH 3: training on 1788017 raw words (1242391 effective words) took 1.3s, 931478 effective words/s\n", "2025-03-27 14:20:38,880 : INFO : EPOCH 4 - PROGRESS: at 70.39% examples, 870743 words/s, in_qsize 6, out_qsize 0\n", "2025-03-27 14:20:39,290 : INFO : EPOCH 4: training on 1788017 raw words (1242515 effective words) took 1.4s, 878375 effective words/s\n", "2025-03-27 14:20:39,291 : INFO : Word2Vec lifecycle event {'msg': 'training on 8940085 raw words (6211842 effective words) took 7.0s, 888473 effective words/s', 'datetime': '2025-03-27T14:20:39.291133', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:20:39,291 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:20:39.291370', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:20:39,295 : INFO : collecting all words and their counts\n", "2025-03-27 14:20:39,358 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Word2vec model #19: {'train_data': '10MB', 'compute_loss': False, 'sg': 0, 'hs': 1, 'train_time_mean': 7.6848710378011065, 'train_time_std': 0.04688368226648027}\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2025-03-27 14:20:39,543 : INFO : collected 73167 word types from a corpus of 1788017 raw words and 179 sentences\n", "2025-03-27 14:20:39,544 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:20:39,590 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 20167 unique words (27.56% of original 73167, drops 53000)', 'datetime': '2025-03-27T14:20:39.590675', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:20:39,591 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 1703716 word corpus (95.29% of original 1788017, drops 84301)', 'datetime': '2025-03-27T14:20:39.591104', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:20:39,636 : INFO : deleting the raw counts dictionary of 73167 items\n", "2025-03-27 14:20:39,639 : INFO : sample=0.001 downsamples 38 most-common words\n", "2025-03-27 14:20:39,639 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 1242287.3013176506 word corpus (72.9%% of prior 1703716)', 'datetime': '2025-03-27T14:20:39.639480', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:20:39,705 : INFO : estimated required memory for 20167 words and 100 dimensions: 26217100 bytes\n", "2025-03-27 14:20:39,705 : INFO : resetting layer weights\n", "2025-03-27 14:20:39,711 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:20:39.711555', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:20:39,711 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 20167 vocabulary and 100 features, using sg=1 hs=0 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:20:39.711964', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:20:40,728 : INFO : EPOCH 0 - PROGRESS: at 37.43% examples, 467615 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:20:41,733 : INFO : EPOCH 0 - PROGRESS: at 79.33% examples, 490302 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:20:42,235 : INFO : EPOCH 0: training on 1788017 raw words (1242524 effective words) took 2.5s, 494105 effective words/s\n", "2025-03-27 14:20:43,246 : INFO : EPOCH 1 - PROGRESS: at 37.99% examples, 476926 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:20:44,253 : INFO : EPOCH 1 - PROGRESS: at 80.45% examples, 498135 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:20:44,727 : INFO : EPOCH 1: training on 1788017 raw words (1242992 effective words) took 2.5s, 500277 effective words/s\n", "2025-03-27 14:20:45,816 : INFO : EPOCH 2 - PROGRESS: at 40.78% examples, 501766 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:20:46,821 : INFO : EPOCH 2 - PROGRESS: at 83.24% examples, 510392 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:20:47,231 : INFO : EPOCH 2: training on 1788017 raw words (1241482 effective words) took 2.4s, 509267 effective words/s\n", "2025-03-27 14:20:48,264 : INFO : EPOCH 3 - PROGRESS: at 35.75% examples, 439991 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:20:49,266 : INFO : EPOCH 3 - PROGRESS: at 76.54% examples, 470409 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:20:49,832 : INFO : EPOCH 3: training on 1788017 raw words (1242989 effective words) took 2.6s, 479684 effective words/s\n", "2025-03-27 14:20:50,844 : INFO : EPOCH 4 - PROGRESS: at 35.75% examples, 448303 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:20:51,846 : INFO : EPOCH 4 - PROGRESS: at 77.09% examples, 478541 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:20:52,456 : INFO : EPOCH 4: training on 1788017 raw words (1242012 effective words) took 2.6s, 474853 effective words/s\n", "2025-03-27 14:20:52,457 : INFO : Word2Vec lifecycle event {'msg': 'training on 8940085 raw words (6211999 effective words) took 12.7s, 487409 effective words/s', 'datetime': '2025-03-27T14:20:52.457156', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:20:52,457 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:20:52.457422', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:20:52,460 : INFO : collecting all words and their counts\n", "2025-03-27 14:20:52,529 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:20:52,723 : INFO : collected 73167 word types from a corpus of 1788017 raw words and 179 sentences\n", "2025-03-27 14:20:52,724 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:20:52,770 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 20167 unique words (27.56% of original 73167, drops 53000)', 'datetime': '2025-03-27T14:20:52.770170', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:20:52,770 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 1703716 word corpus (95.29% of original 1788017, drops 84301)', 'datetime': '2025-03-27T14:20:52.770638', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:20:52,814 : INFO : deleting the raw counts dictionary of 73167 items\n", "2025-03-27 14:20:52,816 : INFO : sample=0.001 downsamples 38 most-common words\n", "2025-03-27 14:20:52,816 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 1242287.3013176506 word corpus (72.9%% of prior 1703716)', 'datetime': '2025-03-27T14:20:52.816968', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:20:52,886 : INFO : estimated required memory for 20167 words and 100 dimensions: 26217100 bytes\n", "2025-03-27 14:20:52,886 : INFO : resetting layer weights\n", "2025-03-27 14:20:52,893 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:20:52.893368', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:20:52,893 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 20167 vocabulary and 100 features, using sg=1 hs=0 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:20:52.893748', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:20:53,903 : INFO : EPOCH 0 - PROGRESS: at 34.64% examples, 436387 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:20:54,913 : INFO : EPOCH 0 - PROGRESS: at 73.74% examples, 456498 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:20:55,596 : INFO : EPOCH 0: training on 1788017 raw words (1242954 effective words) took 2.7s, 461574 effective words/s\n", "2025-03-27 14:20:56,679 : INFO : EPOCH 1 - PROGRESS: at 37.99% examples, 471254 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:20:57,680 : INFO : EPOCH 1 - PROGRESS: at 74.86% examples, 462127 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:20:58,355 : INFO : EPOCH 1: training on 1788017 raw words (1242313 effective words) took 2.7s, 461687 effective words/s\n", "2025-03-27 14:20:59,388 : INFO : EPOCH 2 - PROGRESS: at 35.75% examples, 439228 words/s, in_qsize 5, out_qsize 0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2025-03-27 14:21:00,399 : INFO : EPOCH 2 - PROGRESS: at 76.54% examples, 468108 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:21:01,018 : INFO : EPOCH 2: training on 1788017 raw words (1242458 effective words) took 2.7s, 468111 effective words/s\n", "2025-03-27 14:21:02,029 : INFO : EPOCH 3 - PROGRESS: at 35.20% examples, 442624 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:21:03,043 : INFO : EPOCH 3 - PROGRESS: at 76.54% examples, 472871 words/s, in_qsize 6, out_qsize 0\n", "2025-03-27 14:21:03,601 : INFO : EPOCH 3: training on 1788017 raw words (1242659 effective words) took 2.6s, 482680 effective words/s\n", "2025-03-27 14:21:04,613 : INFO : EPOCH 4 - PROGRESS: at 37.43% examples, 469315 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:21:05,620 : INFO : EPOCH 4 - PROGRESS: at 77.65% examples, 480692 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:21:06,165 : INFO : EPOCH 4: training on 1788017 raw words (1242362 effective words) took 2.6s, 486060 effective words/s\n", "2025-03-27 14:21:06,166 : INFO : Word2Vec lifecycle event {'msg': 'training on 8940085 raw words (6212746 effective words) took 13.3s, 468087 effective words/s', 'datetime': '2025-03-27T14:21:06.166504', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:21:06,166 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:21:06.166724', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:21:06,169 : INFO : collecting all words and their counts\n", "2025-03-27 14:21:06,234 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:21:06,412 : INFO : collected 73167 word types from a corpus of 1788017 raw words and 179 sentences\n", "2025-03-27 14:21:06,413 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:21:06,457 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 20167 unique words (27.56% of original 73167, drops 53000)', 'datetime': '2025-03-27T14:21:06.457098', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:21:06,457 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 1703716 word corpus (95.29% of original 1788017, drops 84301)', 'datetime': '2025-03-27T14:21:06.457469', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:21:06,497 : INFO : deleting the raw counts dictionary of 73167 items\n", "2025-03-27 14:21:06,500 : INFO : sample=0.001 downsamples 38 most-common words\n", "2025-03-27 14:21:06,500 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 1242287.3013176506 word corpus (72.9%% of prior 1703716)', 'datetime': '2025-03-27T14:21:06.500513', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:21:06,566 : INFO : estimated required memory for 20167 words and 100 dimensions: 26217100 bytes\n", "2025-03-27 14:21:06,566 : INFO : resetting layer weights\n", "2025-03-27 14:21:06,572 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:21:06.572631', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:21:06,572 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 20167 vocabulary and 100 features, using sg=1 hs=0 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:21:06.572905', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:21:07,588 : INFO : EPOCH 0 - PROGRESS: at 37.43% examples, 467351 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:21:08,600 : INFO : EPOCH 0 - PROGRESS: at 78.21% examples, 481809 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:21:09,180 : INFO : EPOCH 0: training on 1788017 raw words (1241447 effective words) took 2.6s, 477530 effective words/s\n", "2025-03-27 14:21:10,195 : INFO : EPOCH 1 - PROGRESS: at 37.99% examples, 475539 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:21:11,199 : INFO : EPOCH 1 - PROGRESS: at 79.89% examples, 494299 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:21:11,735 : INFO : EPOCH 1: training on 1788017 raw words (1242045 effective words) took 2.5s, 487829 effective words/s\n", "2025-03-27 14:21:12,806 : INFO : EPOCH 2 - PROGRESS: at 40.22% examples, 504571 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:21:13,806 : INFO : EPOCH 2 - PROGRESS: at 82.12% examples, 509572 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:21:14,245 : INFO : EPOCH 2: training on 1788017 raw words (1242434 effective words) took 2.4s, 508671 effective words/s\n", "2025-03-27 14:21:15,322 : INFO : EPOCH 3 - PROGRESS: at 37.43% examples, 465975 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:21:16,334 : INFO : EPOCH 3 - PROGRESS: at 78.21% examples, 481105 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:21:16,863 : INFO : EPOCH 3: training on 1788017 raw words (1241664 effective words) took 2.6s, 486904 effective words/s\n", "2025-03-27 14:21:17,879 : INFO : EPOCH 4 - PROGRESS: at 37.99% examples, 474989 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:21:18,890 : INFO : EPOCH 4 - PROGRESS: at 77.65% examples, 478825 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:21:19,455 : INFO : EPOCH 4: training on 1788017 raw words (1242591 effective words) took 2.6s, 480933 effective words/s\n", "2025-03-27 14:21:19,456 : INFO : Word2Vec lifecycle event {'msg': 'training on 8940085 raw words (6210181 effective words) took 12.9s, 482034 effective words/s', 'datetime': '2025-03-27T14:21:19.456224', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:21:19,456 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:21:19.456453', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:21:19,458 : INFO : collecting all words and their counts\n", "2025-03-27 14:21:19,524 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Word2vec model #20: {'train_data': '10MB', 'compute_loss': True, 'sg': 1, 'hs': 0, 'train_time_mean': 13.387713034947714, 'train_time_std': 0.23221319096515025}\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2025-03-27 14:21:19,704 : INFO : collected 73167 word types from a corpus of 1788017 raw words and 179 sentences\n", "2025-03-27 14:21:19,704 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:21:19,748 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 20167 unique words (27.56% of original 73167, drops 53000)', 'datetime': '2025-03-27T14:21:19.748657', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:21:19,748 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 1703716 word corpus (95.29% of original 1788017, drops 84301)', 'datetime': '2025-03-27T14:21:19.748985', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:21:19,789 : INFO : deleting the raw counts dictionary of 73167 items\n", "2025-03-27 14:21:19,792 : INFO : sample=0.001 downsamples 38 most-common words\n", "2025-03-27 14:21:19,792 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 1242287.3013176506 word corpus (72.9%% of prior 1703716)', 'datetime': '2025-03-27T14:21:19.792379', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:21:19,857 : INFO : estimated required memory for 20167 words and 100 dimensions: 26217100 bytes\n", "2025-03-27 14:21:19,858 : INFO : resetting layer weights\n", "2025-03-27 14:21:19,864 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:21:19.864097', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:21:19,864 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 20167 vocabulary and 100 features, using sg=1 hs=0 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:21:19.864377', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:21:20,903 : INFO : EPOCH 0 - PROGRESS: at 36.87% examples, 450488 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:21:21,919 : INFO : EPOCH 0 - PROGRESS: at 73.18% examples, 444579 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:21:22,684 : INFO : EPOCH 0: training on 1788017 raw words (1241447 effective words) took 2.8s, 441624 effective words/s\n", "2025-03-27 14:21:23,700 : INFO : EPOCH 1 - PROGRESS: at 37.99% examples, 474836 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:21:24,702 : INFO : EPOCH 1 - PROGRESS: at 80.45% examples, 497906 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:21:25,172 : INFO : EPOCH 1: training on 1788017 raw words (1242045 effective words) took 2.5s, 500791 effective words/s\n", "2025-03-27 14:21:26,181 : INFO : EPOCH 2 - PROGRESS: at 37.99% examples, 478385 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:21:27,184 : INFO : EPOCH 2 - PROGRESS: at 80.45% examples, 500075 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:21:27,660 : INFO : EPOCH 2: training on 1788017 raw words (1242786 effective words) took 2.5s, 501285 effective words/s\n", "2025-03-27 14:21:28,674 : INFO : EPOCH 3 - PROGRESS: at 36.31% examples, 454572 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:21:29,675 : INFO : EPOCH 3 - PROGRESS: at 76.54% examples, 474916 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:21:30,238 : INFO : EPOCH 3: training on 1788017 raw words (1242773 effective words) took 2.6s, 483636 effective words/s\n", "2025-03-27 14:21:31,256 : INFO : EPOCH 4 - PROGRESS: at 34.08% examples, 425030 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:21:32,267 : INFO : EPOCH 4 - PROGRESS: at 74.30% examples, 457382 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:21:32,897 : INFO : EPOCH 4: training on 1788017 raw words (1242769 effective words) took 2.7s, 468785 effective words/s\n", "2025-03-27 14:21:32,897 : INFO : Word2Vec lifecycle event {'msg': 'training on 8940085 raw words (6211820 effective words) took 13.0s, 476605 effective words/s', 'datetime': '2025-03-27T14:21:32.897908', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:21:32,898 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:21:32.898163', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:21:32,899 : INFO : collecting all words and their counts\n", "2025-03-27 14:21:32,966 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:21:33,173 : INFO : collected 73167 word types from a corpus of 1788017 raw words and 179 sentences\n", "2025-03-27 14:21:33,174 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:21:33,218 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 20167 unique words (27.56% of original 73167, drops 53000)', 'datetime': '2025-03-27T14:21:33.218171', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:21:33,218 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 1703716 word corpus (95.29% of original 1788017, drops 84301)', 'datetime': '2025-03-27T14:21:33.218609', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:21:33,259 : INFO : deleting the raw counts dictionary of 73167 items\n", "2025-03-27 14:21:33,261 : INFO : sample=0.001 downsamples 38 most-common words\n", "2025-03-27 14:21:33,261 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 1242287.3013176506 word corpus (72.9%% of prior 1703716)', 'datetime': '2025-03-27T14:21:33.261906', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:21:33,326 : INFO : estimated required memory for 20167 words and 100 dimensions: 26217100 bytes\n", "2025-03-27 14:21:33,327 : INFO : resetting layer weights\n", "2025-03-27 14:21:33,333 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:21:33.333482', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:21:33,333 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 20167 vocabulary and 100 features, using sg=1 hs=0 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:21:33.333772', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:21:34,411 : INFO : EPOCH 0 - PROGRESS: at 37.99% examples, 474109 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:21:35,414 : INFO : EPOCH 0 - PROGRESS: at 74.86% examples, 462905 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:21:36,030 : INFO : EPOCH 0: training on 1788017 raw words (1242954 effective words) took 2.6s, 472908 effective words/s\n", "2025-03-27 14:21:37,040 : INFO : EPOCH 1 - PROGRESS: at 37.99% examples, 477448 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:21:38,053 : INFO : EPOCH 1 - PROGRESS: at 79.33% examples, 490126 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:21:38,647 : INFO : EPOCH 1: training on 1788017 raw words (1242483 effective words) took 2.6s, 476274 effective words/s\n", "2025-03-27 14:21:39,718 : INFO : EPOCH 2 - PROGRESS: at 34.64% examples, 435034 words/s, in_qsize 5, out_qsize 0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2025-03-27 14:21:40,722 : INFO : EPOCH 2 - PROGRESS: at 76.54% examples, 474509 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:21:41,288 : INFO : EPOCH 2: training on 1788017 raw words (1241607 effective words) took 2.6s, 482611 effective words/s\n", "2025-03-27 14:21:42,361 : INFO : EPOCH 3 - PROGRESS: at 39.11% examples, 490200 words/s, in_qsize 6, out_qsize 0\n", "2025-03-27 14:21:43,375 : INFO : EPOCH 3 - PROGRESS: at 78.21% examples, 481950 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:21:43,930 : INFO : EPOCH 3: training on 1788017 raw words (1242530 effective words) took 2.6s, 482588 effective words/s\n", "2025-03-27 14:21:44,953 : INFO : EPOCH 4 - PROGRESS: at 37.43% examples, 463825 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:21:45,961 : INFO : EPOCH 4 - PROGRESS: at 78.77% examples, 484527 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:21:46,480 : INFO : EPOCH 4: training on 1788017 raw words (1242091 effective words) took 2.5s, 488735 effective words/s\n", "2025-03-27 14:21:46,481 : INFO : Word2Vec lifecycle event {'msg': 'training on 8940085 raw words (6211665 effective words) took 13.1s, 472471 effective words/s', 'datetime': '2025-03-27T14:21:46.481039', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:21:46,481 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:21:46.481255', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:21:46,482 : INFO : collecting all words and their counts\n", "2025-03-27 14:21:46,548 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:21:46,728 : INFO : collected 73167 word types from a corpus of 1788017 raw words and 179 sentences\n", "2025-03-27 14:21:46,728 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:21:46,772 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 20167 unique words (27.56% of original 73167, drops 53000)', 'datetime': '2025-03-27T14:21:46.772794', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:21:46,773 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 1703716 word corpus (95.29% of original 1788017, drops 84301)', 'datetime': '2025-03-27T14:21:46.773165', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:21:46,814 : INFO : deleting the raw counts dictionary of 73167 items\n", "2025-03-27 14:21:46,816 : INFO : sample=0.001 downsamples 38 most-common words\n", "2025-03-27 14:21:46,816 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 1242287.3013176506 word corpus (72.9%% of prior 1703716)', 'datetime': '2025-03-27T14:21:46.816398', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:21:46,880 : INFO : estimated required memory for 20167 words and 100 dimensions: 26217100 bytes\n", "2025-03-27 14:21:46,880 : INFO : resetting layer weights\n", "2025-03-27 14:21:46,887 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:21:46.887353', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:21:46,887 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 20167 vocabulary and 100 features, using sg=1 hs=0 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:21:46.887649', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:21:47,976 : INFO : EPOCH 0 - PROGRESS: at 40.78% examples, 503211 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:21:48,986 : INFO : EPOCH 0 - PROGRESS: at 79.89% examples, 489488 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:21:49,502 : INFO : EPOCH 0: training on 1788017 raw words (1242954 effective words) took 2.5s, 487864 effective words/s\n", "2025-03-27 14:21:50,579 : INFO : EPOCH 1 - PROGRESS: at 40.78% examples, 508745 words/s, in_qsize 6, out_qsize 0\n", "2025-03-27 14:21:51,582 : INFO : EPOCH 1 - PROGRESS: at 83.24% examples, 514709 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:21:52,001 : INFO : EPOCH 1: training on 1788017 raw words (1242313 effective words) took 2.4s, 511109 effective words/s\n", "2025-03-27 14:21:53,086 : INFO : EPOCH 2 - PROGRESS: at 40.78% examples, 504113 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:21:54,097 : INFO : EPOCH 2 - PROGRESS: at 83.80% examples, 513482 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:21:54,490 : INFO : EPOCH 2: training on 1788017 raw words (1241885 effective words) took 2.4s, 512810 effective words/s\n", "2025-03-27 14:21:55,524 : INFO : EPOCH 3 - PROGRESS: at 39.11% examples, 480273 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:21:56,534 : INFO : EPOCH 3 - PROGRESS: at 82.12% examples, 501594 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:21:56,969 : INFO : EPOCH 3: training on 1788017 raw words (1242417 effective words) took 2.5s, 502780 effective words/s\n", "2025-03-27 14:21:57,978 : INFO : EPOCH 4 - PROGRESS: at 37.99% examples, 477682 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:21:58,981 : INFO : EPOCH 4 - PROGRESS: at 77.09% examples, 478946 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:21:59,558 : INFO : EPOCH 4: training on 1788017 raw words (1241899 effective words) took 2.6s, 481345 effective words/s\n", "2025-03-27 14:21:59,558 : INFO : Word2Vec lifecycle event {'msg': 'training on 8940085 raw words (6211468 effective words) took 12.7s, 490221 effective words/s', 'datetime': '2025-03-27T14:21:59.558472', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:21:59,558 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:21:59.558749', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:21:59,560 : INFO : collecting all words and their counts\n", "2025-03-27 14:21:59,627 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Word2vec model #21: {'train_data': '10MB', 'compute_loss': False, 'sg': 1, 'hs': 0, 'train_time_mean': 13.367362340291342, 'train_time_std': 0.21309599541612576}\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2025-03-27 14:21:59,807 : INFO : collected 73167 word types from a corpus of 1788017 raw words and 179 sentences\n", "2025-03-27 14:21:59,807 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:21:59,851 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 20167 unique words (27.56% of original 73167, drops 53000)', 'datetime': '2025-03-27T14:21:59.851860', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:21:59,852 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 1703716 word corpus (95.29% of original 1788017, drops 84301)', 'datetime': '2025-03-27T14:21:59.852268', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:21:59,893 : INFO : deleting the raw counts dictionary of 73167 items\n", "2025-03-27 14:21:59,894 : INFO : sample=0.001 downsamples 38 most-common words\n", "2025-03-27 14:21:59,895 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 1242287.3013176506 word corpus (72.9%% of prior 1703716)', 'datetime': '2025-03-27T14:21:59.895322', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:21:59,901 : INFO : constructing a huffman tree from 20167 words\n", "2025-03-27 14:22:00,230 : INFO : built huffman tree with maximum node depth 18\n", "2025-03-27 14:22:00,292 : INFO : estimated required memory for 20167 words and 100 dimensions: 38317300 bytes\n", "2025-03-27 14:22:00,292 : INFO : resetting layer weights\n", "2025-03-27 14:22:00,299 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:22:00.299574', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:22:00,300 : WARNING : Both hierarchical softmax and negative sampling are activated. This is probably a mistake. You should set either 'hs=0' or 'negative=0' to disable one of them. \n", "2025-03-27 14:22:00,300 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 20167 vocabulary and 100 features, using sg=1 hs=1 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:22:00.300246', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:22:01,309 : INFO : EPOCH 0 - PROGRESS: at 16.76% examples, 212756 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:22:02,314 : INFO : EPOCH 0 - PROGRESS: at 35.20% examples, 220941 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:22:03,347 : INFO : EPOCH 0 - PROGRESS: at 52.51% examples, 216307 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:22:04,380 : INFO : EPOCH 0 - PROGRESS: at 70.95% examples, 216654 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:22:05,382 : INFO : EPOCH 0 - PROGRESS: at 88.83% examples, 217864 words/s, in_qsize 6, out_qsize 0\n", "2025-03-27 14:22:06,005 : INFO : EPOCH 0: training on 1788017 raw words (1242982 effective words) took 5.7s, 218179 effective words/s\n", "2025-03-27 14:22:07,088 : INFO : EPOCH 1 - PROGRESS: at 17.32% examples, 204627 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:22:08,093 : INFO : EPOCH 1 - PROGRESS: at 35.75% examples, 216434 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:22:09,165 : INFO : EPOCH 1 - PROGRESS: at 52.51% examples, 208606 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:22:10,212 : INFO : EPOCH 1 - PROGRESS: at 70.95% examples, 210008 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:22:11,214 : INFO : EPOCH 1 - PROGRESS: at 88.27% examples, 211222 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:22:11,883 : INFO : EPOCH 1: training on 1788017 raw words (1242552 effective words) took 5.9s, 211678 effective words/s\n", "2025-03-27 14:22:12,916 : INFO : EPOCH 2 - PROGRESS: at 15.64% examples, 193892 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:22:13,924 : INFO : EPOCH 2 - PROGRESS: at 34.64% examples, 214454 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:22:14,984 : INFO : EPOCH 2 - PROGRESS: at 54.19% examples, 218786 words/s, in_qsize 6, out_qsize 0\n", "2025-03-27 14:22:16,002 : INFO : EPOCH 2 - PROGRESS: at 73.74% examples, 222888 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:22:17,044 : INFO : EPOCH 2 - PROGRESS: at 92.74% examples, 223718 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:22:17,420 : INFO : EPOCH 2: training on 1788017 raw words (1241603 effective words) took 5.5s, 224579 effective words/s\n", "2025-03-27 14:22:18,464 : INFO : EPOCH 3 - PROGRESS: at 15.64% examples, 191662 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:22:19,478 : INFO : EPOCH 3 - PROGRESS: at 32.96% examples, 202937 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:22:20,544 : INFO : EPOCH 3 - PROGRESS: at 52.51% examples, 210943 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:22:21,546 : INFO : EPOCH 3 - PROGRESS: at 72.07% examples, 217611 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:22:22,580 : INFO : EPOCH 3 - PROGRESS: at 89.39% examples, 215890 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:22:23,174 : INFO : EPOCH 3: training on 1788017 raw words (1242417 effective words) took 5.7s, 216249 effective words/s\n", "2025-03-27 14:22:24,234 : INFO : EPOCH 4 - PROGRESS: at 15.64% examples, 188292 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:22:25,288 : INFO : EPOCH 4 - PROGRESS: at 34.08% examples, 203667 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:22:26,327 : INFO : EPOCH 4 - PROGRESS: at 52.51% examples, 208781 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:22:27,380 : INFO : EPOCH 4 - PROGRESS: at 70.95% examples, 209982 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:22:28,391 : INFO : EPOCH 4 - PROGRESS: at 88.83% examples, 212123 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:22:29,087 : INFO : EPOCH 4: training on 1788017 raw words (1242479 effective words) took 5.9s, 210388 effective words/s\n", "2025-03-27 14:22:29,088 : INFO : Word2Vec lifecycle event {'msg': 'training on 8940085 raw words (6212033 effective words) took 28.8s, 215782 effective words/s', 'datetime': '2025-03-27T14:22:29.088544', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:22:29,088 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:22:29.088811', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:22:29,091 : INFO : collecting all words and their counts\n", "2025-03-27 14:22:29,156 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:22:29,336 : INFO : collected 73167 word types from a corpus of 1788017 raw words and 179 sentences\n", "2025-03-27 14:22:29,336 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:22:29,382 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 20167 unique words (27.56% of original 73167, drops 53000)', 'datetime': '2025-03-27T14:22:29.382032', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:22:29,382 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 1703716 word corpus (95.29% of original 1788017, drops 84301)', 'datetime': '2025-03-27T14:22:29.382533', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:22:29,423 : INFO : deleting the raw counts dictionary of 73167 items\n", "2025-03-27 14:22:29,425 : INFO : sample=0.001 downsamples 38 most-common words\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2025-03-27 14:22:29,425 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 1242287.3013176506 word corpus (72.9%% of prior 1703716)', 'datetime': '2025-03-27T14:22:29.425987', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:22:29,430 : INFO : constructing a huffman tree from 20167 words\n", "2025-03-27 14:22:29,686 : INFO : built huffman tree with maximum node depth 18\n", "2025-03-27 14:22:29,749 : INFO : estimated required memory for 20167 words and 100 dimensions: 38317300 bytes\n", "2025-03-27 14:22:29,749 : INFO : resetting layer weights\n", "2025-03-27 14:22:29,756 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:22:29.756842', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:22:29,757 : WARNING : Both hierarchical softmax and negative sampling are activated. This is probably a mistake. You should set either 'hs=0' or 'negative=0' to disable one of them. \n", "2025-03-27 14:22:29,757 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 20167 vocabulary and 100 features, using sg=1 hs=1 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:22:29.757381', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:22:30,857 : INFO : EPOCH 0 - PROGRESS: at 17.32% examples, 212898 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:22:31,859 : INFO : EPOCH 0 - PROGRESS: at 36.31% examples, 224622 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:22:32,918 : INFO : EPOCH 0 - PROGRESS: at 55.87% examples, 225521 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:22:33,928 : INFO : EPOCH 0 - PROGRESS: at 75.42% examples, 228766 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:22:34,974 : INFO : EPOCH 0 - PROGRESS: at 94.41% examples, 228102 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:22:35,258 : INFO : EPOCH 0: training on 1788017 raw words (1242954 effective words) took 5.4s, 228691 effective words/s\n", "2025-03-27 14:22:36,267 : INFO : EPOCH 1 - PROGRESS: at 16.76% examples, 212735 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:22:37,273 : INFO : EPOCH 1 - PROGRESS: at 35.20% examples, 220834 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:22:38,298 : INFO : EPOCH 1 - PROGRESS: at 52.51% examples, 216800 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:22:39,308 : INFO : EPOCH 1 - PROGRESS: at 69.83% examples, 214685 words/s, in_qsize 6, out_qsize 0\n", "2025-03-27 14:22:40,335 : INFO : EPOCH 1 - PROGRESS: at 87.71% examples, 215294 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:22:41,032 : INFO : EPOCH 1: training on 1788017 raw words (1242313 effective words) took 5.8s, 215481 effective words/s\n", "2025-03-27 14:22:42,061 : INFO : EPOCH 2 - PROGRESS: at 15.64% examples, 194291 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:22:43,068 : INFO : EPOCH 2 - PROGRESS: at 32.40% examples, 201414 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:22:44,124 : INFO : EPOCH 2 - PROGRESS: at 50.84% examples, 206380 words/s, in_qsize 6, out_qsize 1\n", "2025-03-27 14:22:45,139 : INFO : EPOCH 2 - PROGRESS: at 68.72% examples, 208141 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:22:46,189 : INFO : EPOCH 2 - PROGRESS: at 87.15% examples, 210585 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:22:46,911 : INFO : EPOCH 2: training on 1788017 raw words (1242242 effective words) took 5.9s, 211585 effective words/s\n", "2025-03-27 14:22:48,012 : INFO : EPOCH 3 - PROGRESS: at 17.32% examples, 212992 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:22:49,025 : INFO : EPOCH 3 - PROGRESS: at 34.64% examples, 213254 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:22:50,031 : INFO : EPOCH 3 - PROGRESS: at 53.63% examples, 219764 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:22:51,055 : INFO : EPOCH 3 - PROGRESS: at 72.63% examples, 221533 words/s, in_qsize 6, out_qsize 0\n", "2025-03-27 14:22:52,072 : INFO : EPOCH 3 - PROGRESS: at 91.06% examples, 222518 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:22:52,533 : INFO : EPOCH 3: training on 1788017 raw words (1242739 effective words) took 5.6s, 223710 effective words/s\n", "2025-03-27 14:22:53,626 : INFO : EPOCH 4 - PROGRESS: at 17.32% examples, 213953 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:22:54,674 : INFO : EPOCH 4 - PROGRESS: at 34.08% examples, 206905 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:22:55,743 : INFO : EPOCH 4 - PROGRESS: at 52.51% examples, 208930 words/s, in_qsize 6, out_qsize 0\n", "2025-03-27 14:22:56,753 : INFO : EPOCH 4 - PROGRESS: at 70.39% examples, 210546 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:22:57,776 : INFO : EPOCH 4 - PROGRESS: at 87.15% examples, 209398 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:22:58,575 : INFO : EPOCH 4: training on 1788017 raw words (1242497 effective words) took 6.0s, 207863 effective words/s\n", "2025-03-27 14:22:58,576 : INFO : Word2Vec lifecycle event {'msg': 'training on 8940085 raw words (6212745 effective words) took 28.8s, 215574 effective words/s', 'datetime': '2025-03-27T14:22:58.576174', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:22:58,576 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:22:58.576394', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:22:58,579 : INFO : collecting all words and their counts\n", "2025-03-27 14:22:58,644 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:22:58,848 : INFO : collected 73167 word types from a corpus of 1788017 raw words and 179 sentences\n", "2025-03-27 14:22:58,849 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:22:58,900 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 20167 unique words (27.56% of original 73167, drops 53000)', 'datetime': '2025-03-27T14:22:58.900091', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:22:58,900 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 1703716 word corpus (95.29% of original 1788017, drops 84301)', 'datetime': '2025-03-27T14:22:58.900613', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:22:58,946 : INFO : deleting the raw counts dictionary of 73167 items\n", "2025-03-27 14:22:58,949 : INFO : sample=0.001 downsamples 38 most-common words\n", "2025-03-27 14:22:58,949 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 1242287.3013176506 word corpus (72.9%% of prior 1703716)', 'datetime': '2025-03-27T14:22:58.949432', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:22:58,956 : INFO : constructing a huffman tree from 20167 words\n", "2025-03-27 14:22:59,225 : INFO : built huffman tree with maximum node depth 18\n", "2025-03-27 14:22:59,291 : INFO : estimated required memory for 20167 words and 100 dimensions: 38317300 bytes\n", "2025-03-27 14:22:59,291 : INFO : resetting layer weights\n", "2025-03-27 14:22:59,298 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:22:59.298391', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:22:59,298 : WARNING : Both hierarchical softmax and negative sampling are activated. This is probably a mistake. You should set either 'hs=0' or 'negative=0' to disable one of them. \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2025-03-27 14:22:59,299 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 20167 vocabulary and 100 features, using sg=1 hs=1 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:22:59.299026', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:23:00,332 : INFO : EPOCH 0 - PROGRESS: at 15.64% examples, 193590 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:23:01,362 : INFO : EPOCH 0 - PROGRESS: at 34.08% examples, 209166 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:23:02,365 : INFO : EPOCH 0 - PROGRESS: at 52.51% examples, 215035 words/s, in_qsize 6, out_qsize 0\n", "2025-03-27 14:23:03,374 : INFO : EPOCH 0 - PROGRESS: at 72.07% examples, 220419 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:23:04,419 : INFO : EPOCH 0 - PROGRESS: at 91.06% examples, 221782 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:23:04,881 : INFO : EPOCH 0: training on 1788017 raw words (1242954 effective words) took 5.6s, 222967 effective words/s\n", "2025-03-27 14:23:05,970 : INFO : EPOCH 1 - PROGRESS: at 17.32% examples, 203356 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:23:06,981 : INFO : EPOCH 1 - PROGRESS: at 36.31% examples, 218432 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:23:07,988 : INFO : EPOCH 1 - PROGRESS: at 55.31% examples, 222844 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:23:09,002 : INFO : EPOCH 1 - PROGRESS: at 72.07% examples, 217886 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:23:10,011 : INFO : EPOCH 1 - PROGRESS: at 90.50% examples, 219950 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:23:10,530 : INFO : EPOCH 1: training on 1788017 raw words (1242313 effective words) took 5.6s, 220249 effective words/s\n", "2025-03-27 14:23:11,618 : INFO : EPOCH 2 - PROGRESS: at 17.32% examples, 215086 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:23:12,620 : INFO : EPOCH 2 - PROGRESS: at 34.64% examples, 215388 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:23:13,623 : INFO : EPOCH 2 - PROGRESS: at 51.40% examples, 212593 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:23:14,632 : INFO : EPOCH 2 - PROGRESS: at 70.39% examples, 216723 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:23:15,648 : INFO : EPOCH 2 - PROGRESS: at 88.83% examples, 218681 words/s, in_qsize 6, out_qsize 0\n", "2025-03-27 14:23:16,262 : INFO : EPOCH 2: training on 1788017 raw words (1241885 effective words) took 5.7s, 219204 effective words/s\n", "2025-03-27 14:23:17,350 : INFO : EPOCH 3 - PROGRESS: at 17.32% examples, 203542 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:23:18,407 : INFO : EPOCH 3 - PROGRESS: at 35.75% examples, 210582 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:23:19,445 : INFO : EPOCH 3 - PROGRESS: at 52.51% examples, 206976 words/s, in_qsize 6, out_qsize 0\n", "2025-03-27 14:23:20,466 : INFO : EPOCH 3 - PROGRESS: at 70.39% examples, 208498 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:23:21,496 : INFO : EPOCH 3 - PROGRESS: at 86.03% examples, 204819 words/s, in_qsize 6, out_qsize 0\n", "2025-03-27 14:23:22,382 : INFO : EPOCH 3: training on 1788017 raw words (1242427 effective words) took 6.1s, 203295 effective words/s\n", "2025-03-27 14:23:23,467 : INFO : EPOCH 4 - PROGRESS: at 15.64% examples, 194817 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:23:24,469 : INFO : EPOCH 4 - PROGRESS: at 32.96% examples, 205702 words/s, in_qsize 6, out_qsize 0\n", "2025-03-27 14:23:25,511 : INFO : EPOCH 4 - PROGRESS: at 50.84% examples, 208163 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:23:26,545 : INFO : EPOCH 4 - PROGRESS: at 69.27% examples, 210100 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:23:27,602 : INFO : EPOCH 4 - PROGRESS: at 87.71% examples, 211727 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:23:28,277 : INFO : EPOCH 4: training on 1788017 raw words (1242420 effective words) took 5.8s, 213101 effective words/s\n", "2025-03-27 14:23:28,278 : INFO : Word2Vec lifecycle event {'msg': 'training on 8940085 raw words (6211999 effective words) took 29.0s, 214357 effective words/s', 'datetime': '2025-03-27T14:23:28.278416', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:23:28,278 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:23:28.278634', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:23:28,282 : INFO : collecting all words and their counts\n", "2025-03-27 14:23:28,347 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Word2vec model #22: {'train_data': '10MB', 'compute_loss': True, 'sg': 1, 'hs': 1, 'train_time_mean': 29.573920567830402, 'train_time_std': 0.09235685986903587}\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2025-03-27 14:23:28,543 : INFO : collected 73167 word types from a corpus of 1788017 raw words and 179 sentences\n", "2025-03-27 14:23:28,543 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:23:28,594 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 20167 unique words (27.56% of original 73167, drops 53000)', 'datetime': '2025-03-27T14:23:28.594829', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:23:28,595 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 1703716 word corpus (95.29% of original 1788017, drops 84301)', 'datetime': '2025-03-27T14:23:28.595323', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:23:28,642 : INFO : deleting the raw counts dictionary of 73167 items\n", "2025-03-27 14:23:28,645 : INFO : sample=0.001 downsamples 38 most-common words\n", "2025-03-27 14:23:28,645 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 1242287.3013176506 word corpus (72.9%% of prior 1703716)', 'datetime': '2025-03-27T14:23:28.645667', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:23:28,650 : INFO : constructing a huffman tree from 20167 words\n", "2025-03-27 14:23:28,975 : INFO : built huffman tree with maximum node depth 18\n", "2025-03-27 14:23:29,038 : INFO : estimated required memory for 20167 words and 100 dimensions: 38317300 bytes\n", "2025-03-27 14:23:29,038 : INFO : resetting layer weights\n", "2025-03-27 14:23:29,045 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:23:29.045236', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:23:29,045 : WARNING : Both hierarchical softmax and negative sampling are activated. This is probably a mistake. You should set either 'hs=0' or 'negative=0' to disable one of them. \n", "2025-03-27 14:23:29,045 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 20167 vocabulary and 100 features, using sg=1 hs=1 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:23:29.045847', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:23:30,132 : INFO : EPOCH 0 - PROGRESS: at 17.32% examples, 204007 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:23:31,139 : INFO : EPOCH 0 - PROGRESS: at 36.31% examples, 219279 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:23:32,195 : INFO : EPOCH 0 - PROGRESS: at 55.87% examples, 222168 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:23:33,201 : INFO : EPOCH 0 - PROGRESS: at 74.86% examples, 224657 words/s, in_qsize 6, out_qsize 0\n", "2025-03-27 14:23:34,221 : INFO : EPOCH 0 - PROGRESS: at 93.85% examples, 225989 words/s, in_qsize 6, out_qsize 0\n", "2025-03-27 14:23:34,562 : INFO : EPOCH 0: training on 1788017 raw words (1242954 effective words) took 5.5s, 225663 effective words/s\n", "2025-03-27 14:23:35,642 : INFO : EPOCH 1 - PROGRESS: at 17.32% examples, 205079 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:23:36,647 : INFO : EPOCH 1 - PROGRESS: at 36.31% examples, 219886 words/s, in_qsize 6, out_qsize 0\n", "2025-03-27 14:23:37,705 : INFO : EPOCH 1 - PROGRESS: at 55.87% examples, 222445 words/s, in_qsize 6, out_qsize 0\n", "2025-03-27 14:23:38,744 : INFO : EPOCH 1 - PROGRESS: at 74.30% examples, 221401 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:23:39,795 : INFO : EPOCH 1 - PROGRESS: at 92.74% examples, 220796 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:23:40,178 : INFO : EPOCH 1: training on 1788017 raw words (1242313 effective words) took 5.6s, 221538 effective words/s\n", "2025-03-27 14:23:41,261 : INFO : EPOCH 2 - PROGRESS: at 17.32% examples, 215973 words/s, in_qsize 6, out_qsize 0\n", "2025-03-27 14:23:42,273 : INFO : EPOCH 2 - PROGRESS: at 36.31% examples, 224987 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:23:43,338 : INFO : EPOCH 2 - PROGRESS: at 55.87% examples, 225300 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:23:44,343 : INFO : EPOCH 2 - PROGRESS: at 74.86% examples, 227082 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:23:45,345 : INFO : EPOCH 2 - PROGRESS: at 91.62% examples, 223356 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:23:45,839 : INFO : EPOCH 2: training on 1788017 raw words (1241885 effective words) took 5.6s, 221943 effective words/s\n", "2025-03-27 14:23:46,935 : INFO : EPOCH 3 - PROGRESS: at 15.64% examples, 191522 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:23:47,959 : INFO : EPOCH 3 - PROGRESS: at 34.08% examples, 208525 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:23:48,995 : INFO : EPOCH 3 - PROGRESS: at 50.84% examples, 205920 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:23:50,009 : INFO : EPOCH 3 - PROGRESS: at 70.39% examples, 212855 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:23:51,056 : INFO : EPOCH 3 - PROGRESS: at 87.71% examples, 211642 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:23:51,767 : INFO : EPOCH 3: training on 1788017 raw words (1241878 effective words) took 5.9s, 211720 effective words/s\n", "2025-03-27 14:23:52,797 : INFO : EPOCH 4 - PROGRESS: at 15.64% examples, 194282 words/s, in_qsize 6, out_qsize 0\n", "2025-03-27 14:23:53,814 : INFO : EPOCH 4 - PROGRESS: at 34.08% examples, 210635 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:23:54,816 : INFO : EPOCH 4 - PROGRESS: at 52.51% examples, 216218 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:23:55,831 : INFO : EPOCH 4 - PROGRESS: at 70.39% examples, 215664 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:23:56,843 : INFO : EPOCH 4 - PROGRESS: at 87.71% examples, 215287 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:23:57,515 : INFO : EPOCH 4: training on 1788017 raw words (1242104 effective words) took 5.7s, 216408 effective words/s\n", "2025-03-27 14:23:57,515 : INFO : Word2Vec lifecycle event {'msg': 'training on 8940085 raw words (6211134 effective words) took 28.5s, 218164 effective words/s', 'datetime': '2025-03-27T14:23:57.515810', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:23:57,516 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:23:57.516042', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:23:57,520 : INFO : collecting all words and their counts\n", "2025-03-27 14:23:57,584 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:23:57,762 : INFO : collected 73167 word types from a corpus of 1788017 raw words and 179 sentences\n", "2025-03-27 14:23:57,763 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:23:57,808 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 20167 unique words (27.56% of original 73167, drops 53000)', 'datetime': '2025-03-27T14:23:57.808180', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:23:57,808 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 1703716 word corpus (95.29% of original 1788017, drops 84301)', 'datetime': '2025-03-27T14:23:57.808609', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:23:57,853 : INFO : deleting the raw counts dictionary of 73167 items\n", "2025-03-27 14:23:57,856 : INFO : sample=0.001 downsamples 38 most-common words\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2025-03-27 14:23:57,856 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 1242287.3013176506 word corpus (72.9%% of prior 1703716)', 'datetime': '2025-03-27T14:23:57.856809', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:23:57,861 : INFO : constructing a huffman tree from 20167 words\n", "2025-03-27 14:23:58,118 : INFO : built huffman tree with maximum node depth 18\n", "2025-03-27 14:23:58,179 : INFO : estimated required memory for 20167 words and 100 dimensions: 38317300 bytes\n", "2025-03-27 14:23:58,180 : INFO : resetting layer weights\n", "2025-03-27 14:23:58,186 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:23:58.186951', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:23:58,187 : WARNING : Both hierarchical softmax and negative sampling are activated. This is probably a mistake. You should set either 'hs=0' or 'negative=0' to disable one of them. \n", "2025-03-27 14:23:58,187 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 20167 vocabulary and 100 features, using sg=1 hs=1 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:23:58.187545', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:23:59,288 : INFO : EPOCH 0 - PROGRESS: at 15.64% examples, 191771 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:24:00,291 : INFO : EPOCH 0 - PROGRESS: at 32.40% examples, 200629 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:24:01,340 : INFO : EPOCH 0 - PROGRESS: at 49.16% examples, 199931 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:24:02,369 : INFO : EPOCH 0 - PROGRESS: at 65.92% examples, 199042 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:24:03,390 : INFO : EPOCH 0 - PROGRESS: at 82.68% examples, 200281 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:24:04,354 : INFO : EPOCH 0: training on 1788017 raw words (1242954 effective words) took 6.1s, 203718 effective words/s\n", "2025-03-27 14:24:05,385 : INFO : EPOCH 1 - PROGRESS: at 15.64% examples, 193874 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:24:06,393 : INFO : EPOCH 1 - PROGRESS: at 34.64% examples, 214739 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:24:07,456 : INFO : EPOCH 1 - PROGRESS: at 54.19% examples, 218728 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:24:08,464 : INFO : EPOCH 1 - PROGRESS: at 73.18% examples, 221753 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:24:09,464 : INFO : EPOCH 1 - PROGRESS: at 89.94% examples, 219393 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:24:10,014 : INFO : EPOCH 1: training on 1788017 raw words (1242313 effective words) took 5.7s, 219779 effective words/s\n", "2025-03-27 14:24:11,103 : INFO : EPOCH 2 - PROGRESS: at 17.32% examples, 214683 words/s, in_qsize 6, out_qsize 0\n", "2025-03-27 14:24:12,107 : INFO : EPOCH 2 - PROGRESS: at 36.31% examples, 225205 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:24:13,159 : INFO : EPOCH 2 - PROGRESS: at 55.87% examples, 226355 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:24:14,169 : INFO : EPOCH 2 - PROGRESS: at 75.42% examples, 229377 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:24:15,213 : INFO : EPOCH 2 - PROGRESS: at 94.41% examples, 228653 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:24:15,500 : INFO : EPOCH 2: training on 1788017 raw words (1241885 effective words) took 5.4s, 229095 effective words/s\n", "2025-03-27 14:24:16,594 : INFO : EPOCH 3 - PROGRESS: at 17.32% examples, 214048 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:24:17,605 : INFO : EPOCH 3 - PROGRESS: at 36.87% examples, 227656 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:24:18,605 : INFO : EPOCH 3 - PROGRESS: at 54.19% examples, 222850 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:24:19,631 : INFO : EPOCH 3 - PROGRESS: at 72.63% examples, 222214 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:24:20,632 : INFO : EPOCH 3 - PROGRESS: at 91.06% examples, 223718 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:24:21,105 : INFO : EPOCH 3: training on 1788017 raw words (1242427 effective words) took 5.5s, 224337 effective words/s\n", "2025-03-27 14:24:22,181 : INFO : EPOCH 4 - PROGRESS: at 17.32% examples, 205864 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:24:23,205 : INFO : EPOCH 4 - PROGRESS: at 36.31% examples, 218292 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:24:24,280 : INFO : EPOCH 4 - PROGRESS: at 54.19% examples, 213706 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:24:25,306 : INFO : EPOCH 4 - PROGRESS: at 73.18% examples, 216894 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:24:26,340 : INFO : EPOCH 4 - PROGRESS: at 91.06% examples, 216769 words/s, in_qsize 6, out_qsize 0\n", "2025-03-27 14:24:26,803 : INFO : EPOCH 4: training on 1788017 raw words (1242005 effective words) took 5.7s, 218309 effective words/s\n", "2025-03-27 14:24:26,803 : INFO : Word2Vec lifecycle event {'msg': 'training on 8940085 raw words (6211584 effective words) took 28.6s, 217065 effective words/s', 'datetime': '2025-03-27T14:24:26.803628', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:24:26,803 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:24:26.803883', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n", "2025-03-27 14:24:26,808 : INFO : collecting all words and their counts\n", "2025-03-27 14:24:26,873 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2025-03-27 14:24:27,051 : INFO : collected 73167 word types from a corpus of 1788017 raw words and 179 sentences\n", "2025-03-27 14:24:27,051 : INFO : Creating a fresh vocabulary\n", "2025-03-27 14:24:27,095 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 20167 unique words (27.56% of original 73167, drops 53000)', 'datetime': '2025-03-27T14:24:27.095541', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:24:27,096 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 1703716 word corpus (95.29% of original 1788017, drops 84301)', 'datetime': '2025-03-27T14:24:27.096039', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:24:27,137 : INFO : deleting the raw counts dictionary of 73167 items\n", "2025-03-27 14:24:27,139 : INFO : sample=0.001 downsamples 38 most-common words\n", "2025-03-27 14:24:27,140 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 1242287.3013176506 word corpus (72.9%% of prior 1703716)', 'datetime': '2025-03-27T14:24:27.140079', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'prepare_vocab'}\n", "2025-03-27 14:24:27,145 : INFO : constructing a huffman tree from 20167 words\n", "2025-03-27 14:24:27,476 : INFO : built huffman tree with maximum node depth 18\n", "2025-03-27 14:24:27,537 : INFO : estimated required memory for 20167 words and 100 dimensions: 38317300 bytes\n", "2025-03-27 14:24:27,538 : INFO : resetting layer weights\n", "2025-03-27 14:24:27,544 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-03-27T14:24:27.544972', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'build_vocab'}\n", "2025-03-27 14:24:27,545 : WARNING : Both hierarchical softmax and negative sampling are activated. This is probably a mistake. You should set either 'hs=0' or 'negative=0' to disable one of them. \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2025-03-27 14:24:27,545 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 20167 vocabulary and 100 features, using sg=1 hs=1 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-03-27T14:24:27.545476', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:24:28,685 : INFO : EPOCH 0 - PROGRESS: at 17.32% examples, 204337 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:24:29,732 : INFO : EPOCH 0 - PROGRESS: at 35.75% examples, 212112 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:24:30,802 : INFO : EPOCH 0 - PROGRESS: at 54.19% examples, 212146 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:24:31,847 : INFO : EPOCH 0 - PROGRESS: at 72.63% examples, 213075 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:24:32,888 : INFO : EPOCH 0 - PROGRESS: at 89.39% examples, 210613 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:24:33,460 : INFO : EPOCH 0: training on 1788017 raw words (1241447 effective words) took 5.9s, 212198 effective words/s\n", "2025-03-27 14:24:34,530 : INFO : EPOCH 1 - PROGRESS: at 15.64% examples, 197883 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:24:35,534 : INFO : EPOCH 1 - PROGRESS: at 33.52% examples, 210361 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:24:36,543 : INFO : EPOCH 1 - PROGRESS: at 49.72% examples, 206729 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:24:37,543 : INFO : EPOCH 1 - PROGRESS: at 68.16% examples, 210629 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:24:38,558 : INFO : EPOCH 1 - PROGRESS: at 85.47% examples, 211176 words/s, in_qsize 6, out_qsize 0\n", "2025-03-27 14:24:39,426 : INFO : EPOCH 1: training on 1788017 raw words (1242045 effective words) took 5.9s, 210499 effective words/s\n", "2025-03-27 14:24:40,515 : INFO : EPOCH 2 - PROGRESS: at 17.32% examples, 203589 words/s, in_qsize 6, out_qsize 0\n", "2025-03-27 14:24:41,521 : INFO : EPOCH 2 - PROGRESS: at 35.75% examples, 215809 words/s, in_qsize 6, out_qsize 0\n", "2025-03-27 14:24:42,585 : INFO : EPOCH 2 - PROGRESS: at 54.19% examples, 214991 words/s, in_qsize 6, out_qsize 0\n", "2025-03-27 14:24:43,588 : INFO : EPOCH 2 - PROGRESS: at 72.07% examples, 215764 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:24:44,597 : INFO : EPOCH 2 - PROGRESS: at 88.27% examples, 212761 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:24:45,250 : INFO : EPOCH 2: training on 1788017 raw words (1242434 effective words) took 5.8s, 213619 effective words/s\n", "2025-03-27 14:24:46,379 : INFO : EPOCH 3 - PROGRESS: at 17.32% examples, 207081 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:24:47,380 : INFO : EPOCH 3 - PROGRESS: at 35.75% examples, 217856 words/s, in_qsize 6, out_qsize 0\n", "2025-03-27 14:24:48,387 : INFO : EPOCH 3 - PROGRESS: at 54.19% examples, 220326 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:24:49,400 : INFO : EPOCH 3 - PROGRESS: at 70.95% examples, 215779 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:24:50,453 : INFO : EPOCH 3 - PROGRESS: at 89.39% examples, 216415 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:24:51,022 : INFO : EPOCH 3: training on 1788017 raw words (1241608 effective words) took 5.7s, 217640 effective words/s\n", "2025-03-27 14:24:52,126 : INFO : EPOCH 4 - PROGRESS: at 15.64% examples, 191154 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:24:53,142 : INFO : EPOCH 4 - PROGRESS: at 32.40% examples, 199016 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:24:54,143 : INFO : EPOCH 4 - PROGRESS: at 49.72% examples, 204345 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:24:55,150 : INFO : EPOCH 4 - PROGRESS: at 67.04% examples, 205138 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:24:56,197 : INFO : EPOCH 4 - PROGRESS: at 85.47% examples, 208219 words/s, in_qsize 5, out_qsize 0\n", "2025-03-27 14:24:57,031 : INFO : EPOCH 4: training on 1788017 raw words (1243257 effective words) took 5.9s, 209163 effective words/s\n", "2025-03-27 14:24:57,032 : INFO : Word2Vec lifecycle event {'msg': 'training on 8940085 raw words (6210791 effective words) took 29.5s, 210630 effective words/s', 'datetime': '2025-03-27T14:24:57.032068', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'train'}\n", "2025-03-27 14:24:57,032 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec', 'datetime': '2025-03-27T14:24:57.032289', 'gensim': '4.3.3', 'python': '3.9.6 (default, Nov 11 2024, 03:15:38) \\n[Clang 16.0.0 (clang-1600.0.26.6)]', 'platform': 'macOS-14.7.4-arm64-arm-64bit', 'event': 'created'}\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Word2vec model #23: {'train_data': '10MB', 'compute_loss': False, 'sg': 1, 'hs': 1, 'train_time_mean': 29.58455030123393, 'train_time_std': 0.45526659500085614}\n", " train_data compute_loss sg hs train_time_mean train_time_std\n", "4 25kB True 1 0 0.366939 0.011132\n", "5 25kB False 1 0 0.348893 0.003732\n", "6 25kB True 1 1 0.704012 0.008201\n", "7 25kB False 1 1 0.708051 0.013650\n", "0 25kB True 0 0 0.144582 0.004779\n", "1 25kB False 0 0 0.131805 0.002030\n", "2 25kB True 0 1 0.257878 0.028562\n", "3 25kB False 0 1 0.243136 0.004160\n", "12 1MB True 1 0 1.116595 0.018047\n", "13 1MB False 1 0 1.110381 0.016720\n", "14 1MB True 1 1 2.602241 0.074949\n", "15 1MB False 1 1 2.574062 0.083930\n", "8 1MB True 0 0 0.375568 0.005800\n", "9 1MB False 0 0 0.383377 0.008469\n", "10 1MB True 0 1 0.698642 0.011645\n", "11 1MB False 0 1 0.705466 0.001836\n", "20 10MB True 1 0 13.387713 0.232213\n", "21 10MB False 1 0 13.367362 0.213096\n", "22 10MB True 1 1 29.573921 0.092357\n", "23 10MB False 1 1 29.584550 0.455267\n", "16 10MB True 0 0 4.119073 0.012592\n", "17 10MB False 0 0 3.860674 0.060309\n", "18 10MB True 0 1 7.767792 0.105681\n", "19 10MB False 0 1 7.684871 0.046884\n" ] } ], "source": [ "# Temporarily reduce logging verbosity\n", "logging.root.level = logging.ERROR\n", "\n", "import time\n", "import numpy as np\n", "import pandas as pd\n", "\n", "train_time_values = []\n", "seed_val = 42\n", "sg_values = [0, 1]\n", "hs_values = [0, 1]\n", "\n", "fast = True\n", "if fast:\n", " input_data_subset = input_data[:3]\n", "else:\n", " input_data_subset = input_data\n", "\n", "\n", "for data in input_data_subset:\n", " for sg_val in sg_values:\n", " for hs_val in hs_values:\n", " for loss_flag in [True, False]:\n", " time_taken_list = []\n", " for i in range(3):\n", " start_time = time.time()\n", " w2v_model = gensim.models.Word2Vec(\n", " data,\n", " compute_loss=loss_flag,\n", " sg=sg_val,\n", " hs=hs_val,\n", " seed=seed_val,\n", " )\n", " time_taken_list.append(time.time() - start_time)\n", "\n", " time_taken_list = np.array(time_taken_list)\n", " time_mean = np.mean(time_taken_list)\n", " time_std = np.std(time_taken_list)\n", "\n", " model_result = {\n", " 'train_data': data.name,\n", " 'compute_loss': loss_flag,\n", " 'sg': sg_val,\n", " 'hs': hs_val,\n", " 'train_time_mean': time_mean,\n", " 'train_time_std': time_std,\n", " }\n", " print(\"Word2vec model #%i: %s\" % (len(train_time_values), model_result))\n", " train_time_values.append(model_result)\n", "\n", "train_times_table = pd.DataFrame(train_time_values)\n", "train_times_table = train_times_table.sort_values(\n", " by=['train_data', 'sg', 'hs', 'compute_loss'],\n", " ascending=[False, False, True, False],\n", ")\n", "print(train_times_table)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Visualising Word Embeddings\n", "---------------------------\n", "\n", "The word embeddings made by the model can be visualised by reducing\n", "dimensionality of the words to 2 dimensions using tSNE.\n", "\n", "Visualisations can be used to notice semantic and syntactic trends in the data.\n", "\n", "Example:\n", "\n", "* Semantic: words like cat, dog, cow, etc. have a tendency to lie close by\n", "* Syntactic: words like run, running or cut, cutting lie close together.\n", "\n", "Vector relations like vKing - vMan = vQueen - vWoman can also be noticed.\n", "\n", ".. Important::\n", " The model used for the visualisation is trained on a small corpus. Thus\n", " some of the relations might not be so clear.\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from sklearn.decomposition import IncrementalPCA # inital reduction\n", "from sklearn.manifold import TSNE # final reduction\n", "import numpy as np # array handling\n", "\n", "\n", "def reduce_dimensions(model):\n", " num_dimensions = 2 # final num dimensions (2D, 3D, etc)\n", "\n", " # extract the words & their vectors, as numpy arrays\n", " vectors = np.asarray(model.wv.vectors)\n", " labels = np.asarray(model.wv.index_to_key) # fixed-width numpy strings\n", "\n", " # reduce using t-SNE\n", " tsne = TSNE(n_components=num_dimensions, random_state=0)\n", " vectors = tsne.fit_transform(vectors)\n", "\n", " x_vals = [v[0] for v in vectors]\n", " y_vals = [v[1] for v in vectors]\n", " return x_vals, y_vals, labels\n", "\n", "\n", "x_vals, y_vals, labels = reduce_dimensions(model)\n", "\n", "def plot_with_plotly(x_vals, y_vals, labels, plot_in_notebook=True):\n", " from plotly.offline import init_notebook_mode, iplot, plot\n", " import plotly.graph_objs as go\n", "\n", " trace = go.Scatter(x=x_vals, y=y_vals, mode='text', text=labels)\n", " data = [trace]\n", "\n", " if plot_in_notebook:\n", " init_notebook_mode(connected=True)\n", " iplot(data, filename='word-embedding-plot')\n", " else:\n", " plot(data, filename='word-embedding-plot.html')\n", "\n", "\n", "def plot_with_matplotlib(x_vals, y_vals, labels):\n", " import matplotlib.pyplot as plt\n", " import random\n", "\n", " random.seed(0)\n", "\n", " plt.figure(figsize=(12, 12))\n", " plt.scatter(x_vals, y_vals)\n", "\n", " #\n", " # Label randomly subsampled 50 data points\n", " #\n", " indices = list(range(len(labels)))\n", " selected_indices = random.sample(indices, 50)\n", " for i in selected_indices:\n", " plt.annotate(labels[i], (x_vals[i], y_vals[i]))\n", "\n", "try:\n", " get_ipython()\n", "except Exception:\n", " plot_function = plot_with_matplotlib\n", "else:\n", " plot_function = plot_with_matplotlib\n", "\n", "plot_function(x_vals, y_vals, labels)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Conclusion\n", "----------\n", "\n", "In this tutorial we learned how to train word2vec models on your custom data\n", "and also how to evaluate it. Hope that you too will find this popular tool\n", "useful in your Machine Learning tasks!\n", "\n", "Links\n", "-----\n", "\n", "- API docs: :py:mod:`gensim.models.word2vec`\n", "- `Original C toolkit and word2vec papers by Google `_.\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.6" } }, "nbformat": 4, "nbformat_minor": 1 }