根据某些标准过滤RDD

时间:2017-04-11 09:04:30

标签: python filter rdd

我有一个像下面这样的RDD -

[[u'100=NO', u'101=OR', u'102=-0.00955461556684', u'103=0.799738137456', u'104=-0.619426440691', u'105=-0.505799761741', u'106=1.06018348173', u'107=-0.203731351216', u'108=0.242253668965', u'109=20170411', u'110=14:47:54'], [u'100=NO', u'101=OR', u'102=1.09790894815', u'103=-0.591742622246', u'104=0.60404467739', u'105=-0.729487378829', u'106=-0.41507842821', u'107=-1.01921955205', u'108=-0.153191948561', u'109=20170411', u'110=14:47:56'], [u'100=NO', u'101=OR', u'102=-0.0845031955962', u'103=0.428040384808', u'104=0.0579505934162', u'105=0.893705789837', u'106=-0.544258436965', u'107=1.10990090862', u'108=0.740638990995', u'109=20170411', u'110=14:47:58'], [u'100=NO', u'101=ORCL', u'102=1.20406493416', u'103=-0.275962563807', u'104=-0.728142212616', u'105=2.04751448847', u'106=2.10361125056', u'107=0.588650303087', u'108=-0.693327897822', u'109=20170411', u'110=14:48:00']]

我想删除" ="之前的所有字符。来自RDD所有索引的符号。

我尝试了以下示例 -

rdd.filter(lambda x : str(x[6]).split("=",1)[-1])

但我想从rdd。

的所有索引中删除这些字符

预期的rdd设置 -

[[u'NO', u'OR', u'-0.00955461556684', u'0.799738137456', u'-0.619426440691', u'-0.505799761741', u'1.06018348173', u'-0.203731351216', u'0.242253668965', u'20170411', u'14:47:54'], [u'NO', u'OR', u'1.09790894815', u'-0.591742622246', u'0.60404467739', u'-0.729487378829', u'-0.41507842821', u'-1.01921955205', u'-0.153191948561', u'20170411', u'14:47:56'], [u'NO', u'OR', u'-0.0845031955962', u'0.428040384808', u'0.0579505934162', u'0.893705789837', u'-0.544258436965', u'1.10990090862', u'0.740638990995', u'20170411', u'14:47:58'], [u'100=NO', u'101=ORCL', u'102=1.20406493416', u'-0.275962563807', u'-0.728142212616', u'2.04751448847', u'2.10361125056', u'0.588650303087', u'-0.693327897822', u'20170411', u'14:48:00']]

2 个答案:

答案 0 :(得分:3)

您不仅要过滤因为必须修改数据,因此filter似乎不是合适的工具。

尝试嵌套 list comprehension sc.parallelize

 RDD = sc.parallelize([[i.split('=')[1] for i in j] for j in RDD.toLocalIterator()])

答案 1 :(得分:0)

你好,我是编程的新手,但我认为他也可以用正则表达式解决这个问题。  我尝试了类似的事情:

import re
test=[[u'100=NO', u'101=OR', u'102=-0.00955461556684', u'103=0.799738137456', u'104=-0.619426440691', u'105=-0.505799761741', u'106=1.06018348173', u'107=-0.203731351216', u'108=0.242253668965', u'109=20170411', u'110=14:47:54'], [u'100=NO', u'101=OR', u'102=1.09790894815', u'103=-0.591742622246', u'104=0.60404467739', u'105=-0.729487378829', u'106=-0.41507842821', u'107=-1.01921955205', u'108=-0.153191948561', u'109=20170411', u'110=14:47:56'], [u'100=NO', u'101=OR', u'102=-0.0845031955962', u'103=0.428040384808', u'104=0.0579505934162', u'105=0.893705789837', u'106=-0.544258436965', u'107=1.10990090862', u'108=0.740638990995', u'109=20170411', u'110=14:47:58'], [u'100=NO', u'101=ORCL', u'102=1.20406493416', u'103=-0.275962563807', u'104=-0.728142212616', u'105=2.04751448847', u'106=2.10361125056', u'107=0.588650303087', u'108=-0.693327897822', u'109=20170411', u'110=14:48:00']]
result = re.sub(r"[u]'\d+", r"", test)
print(result)

但它给出了一个错误:预期的字符串或类似字节的对象。 如果有人能解释我会很开心。