Question

我试图在多个文件中查找单词频率，最终返回一个列表，其中包含每个文件的单词频率元组列表。
例如：

[[File1 word frequency]，[File2 word frequency，... [FileN word 频率的频道]

我实际上只是从每个文件的行子集中提取单词。此代码有效：

titles = sc.textFile(text_file) \
.glom() \
.map(lambda x: " ".join(x)) \
.flatMap(lambda x: x.split("PMID- ")) \
.map(lambda x: x[x.find('TI ')+5:]) \
.map(lambda x: x[:(x.find('  - ')-3)]) \
.map(lambda x: x.replace('.','').replace(',','').replace('?','').replace('-',' ').replace('   ',' ').replace('   ',' ').lower())

title_word_freq = titles.flatMap(lambda x: x.split()) \
.map(lambda x: (x,1)) \
.reduceByKey(lambda x,y:x+y) \
.map(lambda x:(x[1],x[0])) \
.sortByKey(ascending=False)

正如预期的那样，它返回文件text_file的字频率列表或text_file指定的目录中的所有* .txt文件。

然而，我要做的是在目录中的每个文件上独立运行它。我尝试了一些方法，例如：

t_files = sc.wholeTextFiles("PATH")
indiv_files = t_files.map(lambda x: x[1])
word_counts = indiv_files.map(getWordCounts)

其中getWordCounts是一个定义为以各种方式修改的预览代码块的函数（例如删除.glom（））。除非我包含一个结合了所有文件的转换，否则我无法使其工作。

有什么建议吗？

Pyspark独立单词计数许多文件

0 个答案: