Reading multiple json files from Spark

时间:2016-04-25 08:58:13

标签: apache-spark

I have a list of json files which I would like load in parallel.

I can't use read.json("*") cause files are not in the same folder and there is no specific pattern I can implement.

I've tried sc.parallelize(fileList).select(hiveContext.read.json) but hive context, as expected, doesn't exists in executor.

Any ideas?

4 个答案:

答案 0 :(得分:3)

Looks like I found the solution:

val text sc.textFile("file1,file2....")
val df = sqlContext.read.json(text)

答案 1 :(得分:1)

pyspark解决方案:

from pyspark.sql import SparkSession
from pyspark import SparkConf, SparkContext

sc = SparkContext("local[2]")
spark = SparkSession.builder.master("local[2]").getOrCreate()

text = sc.textFile("file1,file2...")
ddff = spark.read.json(text)

答案 2 :(得分:1)

函数json(paths:String*)采用可变参数。 (documentation

因此您可以这样更改代码:

sc.read.json(file1, file2, ...)

答案 3 :(得分:0)

此外,您可以将目录指定为参数:

cat 1.json
{"x": 1.0, "y": 2.0}
{"x": 1.5, "y": 1.0}
sudo -u hdfs hdfs dfs -put 1.json /tmp/test

cat 2.json
{"x": 3.0, "y": 4.0}
{"x": 1.8, "y": 7.0}
sudo -u hdfs hdfs dfs -put 2.json /tmp/test

sqlContext.read.json("/tmp/test").show()
+---+---+
|  x|  y|
+---+---+
|1.0|2.0|
|1.5|1.0|
|3.0|4.0|
|1.8|7.0|
+---+---+    
相关问题