Question

我对MySQL，Gensim和Word2Vec还是很陌生，但我仍在通过个人项目来学习如何使用。

我拥有通过网络抓取获得的数据，因此它不是硬编码的。（我使用Instagram帐户从多个帖子中获取主题标签数据，所以我的数据是 Instagram主题标签）

我正尝试在下面的代码中使用该数据：

import pymysql.cursors
import re
from gensim.models import Word2Vec

# Connect to the database
connection = pymysql.connect(host=secrets[0],
user=username,
password=password,
db='test',
charset='charsetExample',
cursorclass=pymysql.cursors.DictCursor)

try:
    # connection to database
    with connection.cursor() as cursor:
    # cursor is iterator / 'Select' - caption is column 
     # post is the table 
     cursor.execute("SELECT caption FROM posts LIMIT 1000")
     data = cursor.fetchall()
     # list of captions
      captions = [d['caption'].lower() for d in data]
     # hashtags = [re.findall(r"#([A-Za-z_0-9]+)", caption) for caption in captions]
    # hashtags = [hashtag for hashtag in hashtags if hashtag != []]
    model = Word2Vec(captions, min_count=1)
    model = Word2Vec(hashtags) 
    res = model.wv.most_similar("fitness")

    print(captions)
    print(res)

finally:
    connection.close()

这是我正在研究的部分，并且不确定如何做：

res = model.wv.most_similar("fitness")

目前，我正尝试使用most_similar()方法来查看其工作原理。我要尝试的是在most_similar("value")中使用我的数据这将是我通过将Instagram网站报废来获得的每个主题标签。

谢谢！

Answer 1

好的，因此，您必须自己训练word2vec模型。您要做的就是确保您的＃标签实际上没有#符号和小写字母。

现在，按帖子对主题标签进行分组。因此，如果某些帖子具有标签#red，#Wine，#party，则应从列表中将其看起来像：[red, wine, party]。对每个帖子重复此操作，并将每个帖子的列表保存到新列表。因此，此输出应为列表列表：[[red, wine, party], [post_2_hashtags], ...]。现在，您可以将其输入到word2vec模型中，并通过以下代码行对其进行训练：

model = gensim.models.Word2Vec(
    documents,
    size=150,
    window=10,
    min_count=2,
    workers=10)
model.train(documents, total_examples=len(documents), epochs=10)
model.save("word2vec.model")

documents是在上一步中创建的列表的列表。然后，您可以使用model = gensim.models.Word2Vec.load("word2vec.model")加载模型。其余的都一样。您仍然使用most_similar()方法来获取最相似的单词（在本例中为＃）。

您唯一需要了解的是向量大小（size中的word2vec.model参数）。您在训练之前定义它。如果有大量数据，则将其设置为更大的数字，而如果有少量数据，则将其设置为较小的数字。但这是您必须弄清楚的事情，因为您是唯一可以看到您所拥有数据的人。尝试使用size参数，并使用most_similar()方法评估模型。

我希望这足够清楚：）

如何使用从网站到Word2vec Gensim的剪贴数据

1 个答案: