分组记录后计算单词(第2部分)

时间:2018-04-20 11:02:22

标签: apache-spark pyspark

虽然我having an answer因为我想要实现的目标,但问题在于它的减速方式。数据集不是很大。总共50GB,但受影响的部分可能只有5到10GB的数据。但是,以下是我的要求,但它的速度慢,而且速度慢,我的意思是它运行了一个小时而且没有终止。

df_ = spark.createDataFrame([
    ('1', 'hello how are are you today'),
    ('1', 'hello how are you'),
    ('2', 'hello are you here'),
    ('2', 'how is it'),
    ('3', 'hello how are you'),
    ('3', 'hello how are you'),
    ('4', 'hello how is it you today')
], schema=['label', 'text'])

tokenizer = Tokenizer(inputCol='text', outputCol='tokens')
tokens = tokenizer.transform(df_)

token_counts.groupby('label')\
    .agg(F.collect_list(F.struct(F.col('token'), F.col('count'))).alias('text'))\
    .show(truncate=False)

这为我提供了每个标签的代币计数:

+-----+----------------------------------------------------------------+
|label|text                                                            |
+-----+----------------------------------------------------------------+
|3    |[[are,2], [how,2], [hello,2], [you,2]]                          |
|1    |[[today,1], [how,2], [are,3], [you,2], [hello,2]]               |
|4    |[[hello,1], [how,1], [is,1], [today,1], [you,1], [it,1]]        |
|2    |[[hello,1], [are,1], [you,1], [here,1], [is,1], [how,1], [it,1]]|
+-----+----------------------------------------------------------------+

但是,我认为对此explode()的调用过于昂贵。

我不知道但是计算每个“dokument”中的标记可能会更快,然后将其合并到groupBy()中:

df_.select(['label'] + [udf_get_tokens(F.col('text')).alias('text')])\
    .rdd.map(lambda x: (x[0], list(Counter(x[1]).items()))) \
    .toDF(schema=['label', 'text'])\
    .show() 

给出计数:

+-----+--------------------+
|label|                text|
+-----+--------------------+
|    1|[[are,2], [hello,...|
|    1|[[are,1], [hello,...|
|    2|[[are,1], [hello,...|
|    2|[[how,1], [it,1],...|
|    3|[[are,1], [hello,...|
|    3|[[are,1], [hello,...|
|    4|[[you,1], [today,...|
+-----+--------------------+

有没有办法以更有效的方式合并这些令牌计数?

1 个答案:

答案 0 :(得分:2)

如果由id定义的群体较大,则明显改善的目标是随机播放。不是随机播放文本,而是随机播放标签。首先矢量化输入

from pyspark.ml.feature import CountVectorizer
from pyspark.ml import Pipeline

pipeline_model = Pipeline(stages=[
    Tokenizer(inputCol='text', outputCol='tokens'),
    CountVectorizer(inputCol='tokens', outputCol='vectors')
]).fit(df_)

df_vec = pipeline_model.transform(df_).select("label", "vectors")

然后聚合:

from pyspark.ml.linalg import SparseVector, DenseVector
from collections import defaultdict

def seq_func(acc, v):
    if isinstance(v, SparseVector):
        for i in v.indices:
            acc[int(i)] += v[int(i)]
    if isinstance(v, DenseVector): 
        for i in len(v):
            acc[int(i)] += v[int(i)]
    return acc

def comb_func(acc1, acc2):
    for k, v in acc2.items():
        acc1[k] += v
    return acc1

aggregated = rdd.aggregateByKey(defaultdict(int), seq_func, comb_func)

并映射回所需的输出:

vocabulary = pipeline_model.stages[-1].vocabulary

def f(x, vocabulary=vocabulary):
    # For list of tuples use  [(vocabulary[i], float(v)) for i, v in x.items()]
    return {vocabulary[i]: float(v) for i, v in x.items()}


aggregated.mapValues(f).toDF(["id", "text"]).show(truncate=False)
# +---+-------------------------------------------------------------------------------------+
# |id |text                                                                                 |
# +---+-------------------------------------------------------------------------------------+
# |4  |[how -> 1.0, today -> 1.0, is -> 1.0, it -> 1.0, hello -> 1.0, you -> 1.0]           |
# |3  |[how -> 2.0, hello -> 2.0, are -> 2.0, you -> 2.0]                                   |
# |1  |[how -> 2.0, hello -> 2.0, are -> 3.0, you -> 2.0, today -> 1.0]                     |
# |2  |[here -> 1.0, how -> 1.0, are -> 1.0, is -> 1.0, it -> 1.0, hello -> 1.0, you -> 1.0]|
# +---+-------------------------------------------------------------------------------------+

仅当文本部分相当大时才值得尝试 - 否则DataFrame和Python对象之间的所有必需转换可能比collecting_list更昂贵。