Question

我正在使用Spark 1.5.2和Java API。有没有办法创建包含单词的DataFrame 对每个文档计算每个文档的单行中的所有单词和计数？

到目前为止，我已经能够使用“org.apache.spark.sql.functions.explode”来转换每个单词在文档中将文本转换为新的行。

然后，我可以使用以下代码创建一个包含多行文档，单词和单词计数的新DataFrame：

df = df.orderBy("doc_id").groupBy(df.col("doc_id"), df.col("word")).count();

输出：

+------+-----------+-----+
|doc_id|       word|count|
+------+-----------+-----+
|doc_1 |       game|    2|
|doc_1 |       life|    1|
|doc_1 |everlasting|    1|
|doc_1 |      learn|    1|
|doc_2 |    special|    1|
|doc_2 |     moment|    1|
|doc_2 |       time|    1|
|doc_3 | unexamined|    1|
|doc_3 |       life|    1|
|doc_3 |      worth|    1|
|doc_3 |       live|    1|
+------+-----------+-----+

如何使用以下格式创建DataFrame：

 +------+-----------+---------------------------------+
 |doc_id|      word_counts|
 +------+-----------+------------------------------+
 |doc_1 |{game=1, learn=2, everlating=1, life=1}
 |doc_2 |{special=1, moment=2, everlating=1, time=1}

谢谢。任何想法都非常感谢

Answer 1

您可以下拉到RDD并使用Complexe::~Complexe() { delete re; delete im; }：

aggregateByKey

Answer 2

我首先不会使用explode。如果从每行文档开始，则可以直接使用计算计数，例如使用ML变换器。一个非常简单的例子可以看出这个：

import org.apache.spark.ml.feature.Tokenizer
import org.apache.spark.ml.feature.CountVectorizer

val df = sc.parallelize(Seq(
  ("doc_1", "game game life everlasting learn"),
  ("doc_2", "special moment time unexamined"),
  ("doc_3", "life worth live")
)).toDF("doc_id", "text")

val tokenizer = new Tokenizer()
  .setInputCol("text")
  .setOutputCol("words")

val tokenized = tokenizer.transform(df)

val cvModel = new CountVectorizer()
  .setInputCol("words")
  .setOutputCol("features")
  .fit(tokenized)

val counted = cvModel.transform(tokenized)

此时您已经拥有每个文档的计数。在每一行中明确地保留令牌是相当浪费的，但可以使用小型UDF来完成：

import org.apache.spark.mllib.linalg.{SparseVector, Vector} 

def vectorsToMaps(vocabulary: Array[String]) = {
  udf((v: Vector) => {
    val sv = v.toSparse
    sv.indices.map(i => (vocabulary(i) -> sv(i))).toMap
  })
}

counted.select(vectorsToMaps(cvModel.vocabulary)($"features")
  .alias("freqs"))
  .show(3, false)

// +------------------------------------------------------------------+
// |freqs                                                             |
// +------------------------------------------------------------------+
// |Map(game -> 2.0, life -> 1.0, learn -> 1.0, everlasting -> 1.0)   |
// |Map(moment -> 1.0, special -> 1.0, unexamined -> 1.0, time -> 1.0)|
// |Map(life -> 1.0, live -> 1.0, worth -> 1.0)                       |
// +------------------------------------------------------------------+

Spark DataFrame每个文档的字数，每个文档一行

2 个答案: