分组记录后计算单词

时间:2018-04-19 14:16:04

标签: pyspark

  

注意:虽然提供的答案有效,但在较大的数据集上可能会变慢。 Take a look at this以获得更快的解决方案。

我有一个数据框,其中包含标记文档,例如:

df_ = spark.createDataFrame([
    ('1', 'hello how are are you today'),
    ('1', 'hello how are you'),
    ('2', 'hello are you here'),
    ('2', 'how is it'),
    ('3', 'hello how are you'),
    ('3', 'hello how are you'),
    ('4', 'hello how is it you today')
], schema=['label', 'text'])

我想要的是按label对数据框进行分组,并为每个组制作一个简单的字数。我的问题是我不确定如何在PySpark中做到这一点。在第一步中,我将拆分文本并将文档作为标记列表:

def get_token_counts(text):
    if text is None:
        return list()    
    counter = Counter(text.lower().split())
    return list(counter.items())

udf_get_token_counts = F.udf(get_token_counts)

df_.select(['label'] + [udf_get_tokens(F.col('text')).alias('text')])\
    .show()

给予

+-----+--------------------+
|label|                text|
+-----+--------------------+
|    1|[hello, how, are,...|
|    1|[hello, how, are,...|
|    2|[hello, are, you,...|
|    2|[hello, how, is, it]|
|    3|[hello, how, are,...|
|    3|[hello, how, are,...|
|    4|[hello, how, is, ...|
+-----+--------------------+

我知道如何在整个数据框中计算字数,但我不知道如何继续groupby()reducyByKey()

我在考虑对数据框中的单词进行部分计算:

df_.select(['label'] + [udf_get_tokens(F.col('text')).alias('text')])\
    .rdd.map(lambda x: (x[0], list(Counter(x[1]).items()))) \
    .toDF(schema=['label', 'text'])\
    .show()

给出:

+-----+--------------------+
|label|                text|
+-----+--------------------+
|    1|[[are,2], [hello,...|
|    1|[[are,1], [hello,...|
|    2|[[are,1], [hello,...|
|    2|[[how,1], [it,1],...|
|    3|[[are,1], [hello,...|
|    3|[[are,1], [hello,...|
|    4|[[you,1], [today,...|
+-----+--------------------+

但是如何汇总呢?

1 个答案:

答案 0 :(得分:1)

您应该使用pyspark.ml.feature.Tokenizer分割文字,而不是使用udf。 (另外,根据您的工作情况,您可能会发现StopWordsRemover很有用。)

例如:

from pyspark.ml.feature import Tokenizer
tokenizer = Tokenizer(inputCol="text", outputCol="tokens")
tokens = tokenizer.transform(df_)
tokens.show(truncate=False)
+-----+---------------------------+----------------------------------+
|label|text                       |tokens                            |
+-----+---------------------------+----------------------------------+
|1    |hello how are are you today|[hello, how, are, are, you, today]|
|1    |hello how are you          |[hello, how, are, you]            |
|2    |hello are you here         |[hello, are, you, here]           |
|2    |how is it                  |[how, is, it]                     |
|3    |hello how are you          |[hello, how, are, you]            |
|3    |hello how are you          |[hello, how, are, you]            |
|4    |hello how is it you today  |[hello, how, is, it, you, today]  |
+-----+---------------------------+----------------------------------+

然后您可以explode()代币,并执行groupBy()来获取每个单词的计数:

import pyspark.sql.functions as f
token_counts = tokens.select("label", f.explode("tokens").alias("token"))\
    .groupBy("label", "token").count()\
    .orderBy("label", "token")
token_counts.show(truncate=False, n=10)
+-----+-----+-----+
|label|token|count|
+-----+-----+-----+
|1    |are  |3    |
|1    |hello|2    |
|1    |how  |2    |
|1    |today|1    |
|1    |you  |2    |
|2    |are  |1    |
|2    |hello|1    |
|2    |here |1    |
|2    |how  |1    |
|2    |is   |1    |
+-----+-----+-----+
only showing top 10 rows

如果您想要每个标签的所有代币和计数在一行上,只需使用pyspark.sql.functions.collect_list()执行另一行groupBy()并使用{{tokencount列连接3}}:

tokens.select("label", f.explode("tokens").alias("token"))\
    .groupBy("label", "token")\
    .count()\
    .groupBy("label")\
    .agg(f.collect_list(f.struct(f.col("token"), f.col("count"))).alias("text"))\
    .orderBy("label")\
    .show(truncate=False)
+-----+----------------------------------------------------------------+
|label|text                                                            |
+-----+----------------------------------------------------------------+
|1    |[[hello,2], [how,2], [are,3], [today,1], [you,2]]               |
|2    |[[you,1], [hello,1], [here,1], [are,1], [it,1], [how,1], [is,1]]|
|3    |[[are,2], [you,2], [how,2], [hello,2]]                          |
|4    |[[today,1], [hello,1], [it,1], [you,1], [how,1], [is,1]]        |
+-----+----------------------------------------------------------------+