
时间:2017-01-03 22:46:09

标签: apache-spark pyspark apache-spark-sql apache-spark-ml


from pyspark.ml.feature import Tokenizer
from pyspark.ml.feature import CountVectorizer

df = sc.parallelize([
  ("1", "doc_1", "fruit is good for you"),
  ("2", "doc_2", "you should eat fruit and veggies"),
  ("2", "doc_3", "kids eat fruit but not veggies")
]).toDF(["month","doc_id", "text"])
|month|doc_id|                text|
|    1| doc_1|fruit is good for...|
|    2| doc_2|you should eat fr...|
|    2| doc_3|kids eat fruit bu...|

我想按月计算单词。 到目前为止,我采用了CountVectorizer方法:

tokenizer = Tokenizer().setInputCol("text").setOutputCol("words")
tokenized = tokenizer.transform(df)

cvModel = CountVectorizer().setInputCol("words").setOutputCol("features").fit(tokenized)
counted = cvModel.transform(tokenized)
|month|doc_id|                text|               words|            features|
|    1| doc_1|fruit is good for...|[fruit, is, good,...|(12,[0,3,4,7,8],[...|
|    2| doc_2|you should eat fr...|[you, should, eat...|(12,[0,1,2,3,9,11...|
|    2| doc_3|kids eat fruit bu...|[kids, eat, fruit...|(12,[0,1,2,5,6,10...|


month  word   count
1      fruit  1
1      is     1
2      fruit  2
2      kids   1
2      eat    2


1 个答案:

答案 0 :(得分:2)

Vector *聚合没有内置机制,但你不需要这里。获得标记化数据后,您只需explode并汇总:

from pyspark.sql.functions import explode

    .select("month", explode("words").alias("word"))
    .groupBy("month", "word")


from pyspark.sql.functions import col

    .select("month", explode("words").alias("word"))
    .groupBy("month", "word")

*自Spark 2.4 we have access to Summarizer起,但它在这里没有用。
