Question

我有一个火花数据框，其中包含以下数据

    +---------------------------------------------------------------------------------------------------------------------------------------------------+
    |text                                                                                                                                               |
    +---------------------------------------------------------------------------------------------------------------------------------------------------+
    |Know what you don't do at 1:30 when you can't sleep? Music shopping. Now I want to dance. #shutUpAndDANCE                                          |
    |Serasi ade haha @AdeRais "@SMTOWNGLOBAL: #SHINee ONEW(@skehehdanfdldi) and #AMBER(@llama_ajol) at KBS 'Music Bank'."        |
    |Happy Birhday Ps.Jeffrey Rachmat #JR50 #flipagram  ? Music: This I Believe (The Creed) - Hillsong…                          |

数据框是一列＆＃39;文字＆＃39;并且在其中包含＃的单词。例如＆＃39;＃shutUpAndDANCE＆＃39;

我正在尝试阅读每个单词并过滤掉，这样我只剩下一个带有哈希的单词列表

代码：

#Gets only those rows containing
hashtagList = sqlContext.sql("SELECT text FROM tweetstable WHERE text LIKE '%#%'")
print hashtagList.show(100, truncate=False)

#Process Rows to get the words
hashtagList = hashtagList.map(lambda p: p.text).map(lambda x: x.split(" ")).collect() 
print hashtagList

输出结果为：

[[u'Know', u'what', u'you', u"don't", u'do', u'at', u'1:30', u'when', u'you', u"can't", u'sleep?', u'Music', u'shopping.', u'Now', u'I', u'want', u'to', u'dance.', u'#shutUpAndDANCE'], [...]]

有没有办法可以过滤掉所有内容，并在我的地图阶段只保留#words。

hashtagList = hashtagList.map(lambda p: p.text).map(lambda x: x.split(" "))<ADD SOMETHING HERE TO FETCH ONLY #>.collect()

Answer 1

试试这个。

from pyspark.sql import Row
from __future__ import print_function

str = "Know what you don't do at 1:30 when you can't sleep? Music shopping. Now I want to dance. #shutUpAndDANCE Serasi ade haha @AdeRais @SMTOWNGLOBAL: #SHINee ONEW(@skehehdanfdldi) and #AMBER(@llama_ajol) at KBS 'Music Bank'.Happy Birhday Ps.Jeffrey Rachmat #JR50 #flipagram? Music: This I Believe (The Creed) - Hillsong"
df = spark.createDataFrame([Row(str)]);
words = df.rdd.flatMap(list).flatMap(lambda line: line.split()).filter(lambda word: word.startswith("#"));
words.foreach(print)

Answer 2

使用：

>>> from pyspark.sql.functions import split, explode, col
>>>
>>> df.select(explode(split("text", "\\s+")).alias("word")) \
...     .where(col("word").startswith("#"))

按火花数据帧中的特定单词过滤

2 个答案: