按火花数据帧中的特定单词过滤

时间:2016-11-19 21:21:20

标签: apache-spark apache-spark-sql spark-dataframe

我有一个火花数据框,其中包含以下数据

    +---------------------------------------------------------------------------------------------------------------------------------------------------+
    |text                                                                                                                                               |
    +---------------------------------------------------------------------------------------------------------------------------------------------------+
    |Know what you don't do at 1:30 when you can't sleep? Music shopping. Now I want to dance. #shutUpAndDANCE                                          |
    |Serasi ade haha @AdeRais "@SMTOWNGLOBAL: #SHINee ONEW(@skehehdanfdldi) and #AMBER(@llama_ajol) at KBS 'Music Bank'."        |
    |Happy Birhday Ps.Jeffrey Rachmat #JR50 #flipagram  ? Music: This I Believe (The Creed) - Hillsong…                          |

数据框是一列'文字'并且在其中包含#的单词。例如 '#shutUpAndDANCE'

我正在尝试阅读每个单词并过滤掉,这样我只剩下一个带有哈希的单词列表

代码:

#Gets only those rows containing
hashtagList = sqlContext.sql("SELECT text FROM tweetstable WHERE text LIKE '%#%'")
print hashtagList.show(100, truncate=False)

#Process Rows to get the words
hashtagList = hashtagList.map(lambda p: p.text).map(lambda x: x.split(" ")).collect() 
print hashtagList

输出结果为:

[[u'Know', u'what', u'you', u"don't", u'do', u'at', u'1:30', u'when', u'you', u"can't", u'sleep?', u'Music', u'shopping.', u'Now', u'I', u'want', u'to', u'dance.', u'#shutUpAndDANCE'], [...]]

有没有办法可以过滤掉所有内容,并在我的地图阶段只保留#words。

hashtagList = hashtagList.map(lambda p: p.text).map(lambda x: x.split(" "))<ADD SOMETHING HERE TO FETCH ONLY #>.collect()

2 个答案:

答案 0 :(得分:1)

试试这个。

from pyspark.sql import Row
from __future__ import print_function

str = "Know what you don't do at 1:30 when you can't sleep? Music shopping. Now I want to dance. #shutUpAndDANCE Serasi ade haha @AdeRais @SMTOWNGLOBAL: #SHINee ONEW(@skehehdanfdldi) and #AMBER(@llama_ajol) at KBS 'Music Bank'.Happy Birhday Ps.Jeffrey Rachmat #JR50 #flipagram? Music: This I Believe (The Creed) - Hillsong"
df = spark.createDataFrame([Row(str)]);
words = df.rdd.flatMap(list).flatMap(lambda line: line.split()).filter(lambda word: word.startswith("#"));
words.foreach(print)

答案 1 :(得分:1)

使用:

>>> from pyspark.sql.functions import split, explode, col
>>>
>>> df.select(explode(split("text", "\\s+")).alias("word")) \
...     .where(col("word").startswith("#"))