Question

我有一个用空格分隔的文本文件，格式为：[string] [string] [int] [int]。我正在尝试提取此文件的第二列，其中包含用“ _”分隔的单词，并查找最频繁出现的单词（从文本文件中的所有行）。到目前为止，我有以下scala代码：

DetailViewModel.fetch(uuid)

使用val wc = file .map(l => l.split(" ")) .map{ case Array(a,b,c,d) => (b,1) } .map{ case (k,v) => k.split("_") }显示此RDD的单个条目的格式为wc.first()

对我来说，下一步是使用Array[String] = Array(word1, word2,...)函数从上述数据结构中提取（键，值）对，以便每个单词成为map。然后，我可以减少此输出以找到出现次数最多的单词。我可以用什么来完成此步骤？有更好的方法吗？

Answer 1

除了排序操作外，您的问题是大数据中的经典WordCount问题。以下是一个示例：

val result =  file
        // splitting each line by space, followed by selecting the second column 
        // and splitting the second column text by "_" character
        .flatMap(line => line.split("\\s+")(1).split("_"))
        // now each line of the rdd is a single word, 
        // so we map each word to the (key, value) pair of (word, 1) 
        .map(word => (word, 1))
        // finally we reduce the (key, value) pairs by key and sum the values
        .reduceByKey((num1 , num2) => num1 + num2)
        // as you need the most common word, we sort rdd descending by values 
        .sortBy(_._2, false)

假设输入文本文件包含：

I hello_hi_world 1 1
you hi_world 1 1
he hi 2 3
she world_hi 4 5

和

println(result.take(3).toList)

将打印result rdd的前三个记录，这些记录在输入文件的第二列中显示最频繁的单词

List((hi,4), (world,3), (hello,1))

Spark：从文本文件列中找到最常见的值

1 个答案: