我读了很多帖子,但没有运气。
到目前为止,我已经尝试了 .split()
和 regex
。
注意:我在 repl.it/ 上运行此代码。
import math
documents = [
["It is going to rain today"],
["Today I am not going outside"],
["I am going to watch the season premiere"]
]
docs = 1000
words_per_doc = 100 # length of doc
dp = 4
# -- Setup --
all_words = [] # all instances
for doc in documents:
for s in doc:
words = s.split()
print(words)
all_words.append(words)
all_words = sorted(all_words) # alphabeticalise
all_words = list(dict.fromkeys(all_words)) # remove duplicates
print('All Words')
print(all_words)
print()
print('Binary Scoring')
for doc in documents:
scoring = []
for word in all_words:
if word in doc:
scoring.append(1)
else:
scoring.append(0)
print("\"" + doc + "\" = " + scoring)
print()
错误:
['It', 'is', 'going', 'to', 'rain', 'today']
['Today', 'I', 'am', 'not', 'going', 'outside']
['I', 'am', 'going', 'to', 'watch', 'the', 'season', 'premiere']
Traceback (most recent call last):
File "main.py", line 6, in <module>
import BagofWords
File "/home/runner/DeepLearning/BagofWords.py", line 21, in <module>
all_words = list(dict.fromkeys(all_words)) # remove duplicates
TypeError: unhashable type: 'list'
答案 0 :(得分:0)
拆分似乎工作得很好
for doc in documents:
words=doc[0].split(' ')
print(words)
你写错了整个代码
这是正确的代码
import re
import math
documents = [
["It is going to rain today"],
["Today I am not going outside"],
["I am going to watch the season premiere"]
]
docs = 1000
words_per_doc = 100 # length of doc
dp = 4
# -- Setup --
all_words = [] # all instances
for doc in documents:
words=doc[0].split(' ')
print(words)
all_words.append(words)
print('All Words')
print(all_words)
print()
print('Binary Scoring')
for doc in documents:
scoring = 0
for word in all_words[0]:
if word in doc[0]:
scoring = scoring + 1
else:
scoring = scoring
print("\"" + doc[0] + "\" = " + str(scoring))
答案 1 :(得分:0)
完整的工作代码:
import math
import itertools
documents = [
["It is going to rain today"],
["Today I am not going outside"],
["I am going to watch the season premiere"]
]
docs = 1000
words_per_doc = 100 # length of doc
dp = 4
# -- Setup --
all_words = [] # all instances
for doc in documents:
for s in doc:
words = s.split()
print(words)
all_words.append(words)
all_words = list(itertools.chain.from_iterable(all_words))
all_words = sorted(all_words) # alphabeticalise
all_words = list(dict.fromkeys(all_words))
print(all_words, "\n")
print('Binary Scoring')
for doc in documents:
scoring = []
for word in all_words:
if word in doc[0]:
scoring.append(1)
else:
scoring.append(0)
print("\"" + doc + "\" = " + scoring)
请参阅我的另一个答案中的解释。
答案 2 :(得分:-1)
您有一个字符串列表,因此您必须遍历内部列表才能获取字符串(我假设内部列表可以是任意长度)。
for doc in documents:
for s in doc:
words = s.split()
print(words)
将获取从每个文档中吐出的单词并打印出来。
输出:
['It', 'is', 'going', 'to', 'rain', 'today']
['Today', 'I', 'am', 'not', 'going', 'outside']
['I', 'am', 'going', 'to', 'watch', 'the', 'season', 'premiere']