Question

我读了很多帖子，但没有运气。

到目前为止，我已经尝试了 .split() 和 regex。

注意：我在 repl.it/ 上运行此代码。

import math

documents = [
  ["It is going to rain today"],
  ["Today I am not going outside"],
  ["I am going to watch the season premiere"]
]
docs = 1000
words_per_doc = 100  # length of doc

dp = 4

# -- Setup --
all_words = []  # all instances
for doc in documents:
  for s in doc:
     words = s.split()
     print(words)
  all_words.append(words)
all_words = sorted(all_words)  # alphabeticalise
all_words = list(dict.fromkeys(all_words))  # remove duplicates

print('All Words')
print(all_words)
print()


print('Binary Scoring')
for doc in documents:
  scoring = []
  for word in all_words:
    if word in doc:
      scoring.append(1)
    else:
      scoring.append(0)
  print("\"" + doc + "\" = " + scoring)
print()

错误：

['It', 'is', 'going', 'to', 'rain', 'today']
['Today', 'I', 'am', 'not', 'going', 'outside']
['I', 'am', 'going', 'to', 'watch', 'the', 'season', 'premiere']
Traceback (most recent call last):
  File "main.py", line 6, in <module>
    import BagofWords
  File "/home/runner/DeepLearning/BagofWords.py", line 21, in <module>
    all_words = list(dict.fromkeys(all_words))  # remove duplicates
TypeError: unhashable type: 'list'

Answer 1

拆分似乎工作得很好

   for doc in documents:
       words=doc[0].split(' ')
       print(words)

你写错了整个代码

这是正确的代码

import re
import math

documents = [
  ["It is going to rain today"],
  ["Today I am not going outside"],
 ["I am going to watch the season premiere"]
]
docs = 1000
words_per_doc = 100  # length of doc

dp = 4

# -- Setup --
all_words = []  # all instances
for doc in documents: 
  words=doc[0].split(' ')
  print(words)
  all_words.append(words)

 print('All Words')
 print(all_words)
print()


print('Binary Scoring')
for doc in documents:
scoring = 0
for word in all_words[0]:
    if word in doc[0]:
        scoring = scoring + 1
    else:
        scoring = scoring

print("\"" + doc[0] + "\" = " + str(scoring))

Answer 2

完整的工作代码：

import math
import itertools

documents = [
  ["It is going to rain today"],
  ["Today I am not going outside"],
  ["I am going to watch the season premiere"]
]
docs = 1000
words_per_doc = 100  # length of doc

dp = 4

# -- Setup --
all_words = []  # all instances
for doc in documents:
  for s in doc:
     words = s.split()
     print(words)
     all_words.append(words)
all_words = list(itertools.chain.from_iterable(all_words))
all_words = sorted(all_words)  # alphabeticalise
all_words = list(dict.fromkeys(all_words))

print(all_words, "\n")

print('Binary Scoring')
for doc in documents:
  scoring = []
  for word in all_words:
    if word in doc[0]:
      scoring.append(1)
    else:
      scoring.append(0)
  print("\"" + doc + "\" = " + scoring)

请参阅我的另一个答案中的解释。

Answer 3

您有一个字符串列表，因此您必须遍历内部列表才能获取字符串（我假设内部列表可以是任意长度）。

for doc in documents:
  for s in doc:
     words = s.split()
     print(words)

将获取从每个文档中吐出的单词并打印出来。

输出：

['It', 'is', 'going', 'to', 'rain', 'today']
['Today', 'I', 'am', 'not', 'going', 'outside']
['I', 'am', 'going', 'to', 'watch', 'the', 'season', 'premiere']

从字符串中提取单词到列表 | Python

3 个答案: