Spark - 解析包含附加文本的json文件

时间:2017-04-03 22:07:36

标签: json scala apache-spark apache-spark-sql spark-dataframe

我的JSON文件有很多行,每行都是这样的。

Mon Jan 20 00:00:00 -0800 2014, {"cl":"js","ua":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.77 Safari/537.36","ip":"76.4.253.137","cc":"US","rg":"NV","ct":"North Las Vegas","pc":"89084","mc":839,"bf":"402d6c3bdd18e5b5f6541a98a01ecc47d698420d","vst":"0e1c96ff-1f4a-4279-bfdc-ba3fe51c2a4e","lt":"Sun Jan 19 23:59:59 -0800 2014","hk":["memba","alyson stoner","memba them","member them","member them 80s","missy elliotts","www.tmzmembathem","80s memba then","missy elliott","mini"]}, 

/ 为了清晰起见增加了空间 /

{"v":"1.1","pv":"7963ee21-0d09-4924-b315-ced4adad425f","r":"v3","t":"tmzdtcom","a":[{"i":15,"u":"ll-media.tmz.com/2012/10/03/100312-alyson-stoner-then-480w.jpg","w":523,"h":480,"x":503,"y":651,"lt":"none","af":false}],"rf":"http://www.zergnet.com/news/128786/stars-whove-changed-a-lot-since-you-last-saw-them","p":"www.tmz.com/photos/2007/12/20/740-memba-them/images/2012/10/03/100312-alyson-stoner-then-jpg/","fs":true,"tr":0.7,"ac":{},"vp":{"ii":false,"w":1915,"h":1102},"sc":{"w":1920,"h":1200,"d":1},"pid":239,"vid":1,"ss":"0.5"}

我尝试了以下内容:

方法1:

val value1 = sc.textFile(filename).map(_.substring(32))

val df = sqlContext.read.json(value1)

这里我试图省略行开头的文本。在这种情况下,我只从每一行获得第一个json对象。

那是:

{"cl":"js","ua":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.77 Safari/537.36","ip":"76.4.253.137","cc":"US","rg":"NV","ct":"North Las Vegas","pc":"89084","mc":839,"bf":"402d6c3bdd18e5b5f6541a98a01ecc47d698420d","vst":"0e1c96ff-1f4a-4279-bfdc-ba3fe51c2a4e","lt":"Sun Jan 19 23:59:59 -0800 2014","hk":["memba","alyson stoner","memba them","member them","member them 80s","missy elliotts","www.tmzmembathem","80s memba then","missy elliott","mini"]}

方法2:

val df = sqlContext.read.json(sc.wholeTextFiles(filename).values) 

在这种情况下,我只是将输出作为一个损坏的记录。

你能告诉我这里有什么问题以及如何解析这种文件吗?

1 个答案:

答案 0 :(得分:1)

您的sqlContext.read.json仅适用于在文件中逐行显示的完整JSON条目,而不是展开或"漂亮打印。"你最好的办法是这样做:

val jsonRDD = sparkContext.wholeTextFiles(fileName).map(_._2)

documentation中所述,wholeTextFiles返回RDD[(String, String)],其中每个Tuple2中的第一个条目是文件名,第二个条目是内容。只有第二个是你关心的,所以你用._2访问内容。

然后,您可以将RDD转换为DataFrame并将to_json应用于所述内容here

val jsonDF = sparkContext
   .wholeTextFiles(fileName)
   .map(_._2)
   .toDF("json")
   .select(to_json('json))