Spark:使用与新行不同的分隔符读取文件

时间:2014-08-12 08:22:58

标签: apache-spark

我正在使用Apache Spark 1.0.1。我有许多文件用UTF8 \u0001分隔,而不是通常的新行\n。如何在Spark中读取此类文件?意思是,sc.textfile("hdfs:///myproject/*")的默认分隔符为\n,我想将其更改为\u0001

5 个答案:

答案 0 :(得分:10)

您可以使用textinputformat.record.delimiter设置TextInputFormat的分隔符,例如,

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.mapreduce.Job
import org.apache.hadoop.io.{LongWritable, Text}
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat

val conf = new Configuration(sc.hadoopConfiguration)
conf.set("textinputformat.record.delimiter", "X")
val input = sc.newAPIHadoopFile("file_path", classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf)
val lines = input.map { case (_, text) => text.toString}
println(lines.collect)

例如,我的输入是一个包含一行aXbXcXd的文件。上面的代码将输出

Array(a, b, c, d)

答案 1 :(得分:7)

在Spark shell中,我根据Setting textinputformat.record.delimiter in spark提取数据:

$ spark-shell
...
scala> import org.apache.hadoop.io.LongWritable
import org.apache.hadoop.io.LongWritable

scala> import org.apache.hadoop.io.Text
import org.apache.hadoop.io.Text

scala> import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.conf.Configuration

scala> import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat

scala> val conf = new Configuration
conf: org.apache.hadoop.conf.Configuration = Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml

scala> conf.set("textinputformat.record.delimiter", "\u0001")

scala> val data = sc.newAPIHadoopFile("mydata.txt", classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf).map(_._2.toString)
data: org.apache.spark.rdd.RDD[(org.apache.hadoop.io.LongWritable, org.apache.hadoop.io.Text)] = NewHadoopRDD[0] at newAPIHadoopFile at <console>:19

sc.newAPIHadoopFile("mydata.txt", ...)RDD[(LongWritable, Text)],其中元素的第一部分是起始字符索引,第二部分是"\u0001"分隔的实际文本。

答案 2 :(得分:5)

在python中,这可以通过以下方式实现:

rdd = sc.newAPIHadoopFile(YOUR_FILE, "org.apache.hadoop.mapreduce.lib.input.TextInputFormat",
            "org.apache.hadoop.io.LongWritable", "org.apache.hadoop.io.Text",
            conf={"textinputformat.record.delimiter": YOUR_DELIMITER}).map(lambda l:l[1])

答案 3 :(得分:0)

以下是Scala用户的Chad@zsxwing答案的现成版本,可以这样使用:

let imageIDToFind = 7
let foundImage = SharedManager.shared.images?.first { $0.id == imageIDToFind }
print(foundImage)

以下代码段使用sc.textFile("some/path.txt", "\u0001") 创建了隐含附加到textFile的其他SparkContext方法(以便复制implicit class的默认SparkContext方法):

textFile

可以这样使用:

package com.whatever

import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.io.{LongWritable, Text}
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat

object Spark {

  implicit class ContextExtensions(val sc: SparkContext) extends AnyVal {

    def textFile(
        path: String,
        delimiter: String,
        maxRecordLength: String = "1000000"
    ): RDD[String] = {

      val conf = new Configuration(sc.hadoopConfiguration)

      // This configuration sets the record delimiter:
      conf.set("textinputformat.record.delimiter", delimiter)
      // and this one limits the size of one record:
      conf.set("mapreduce.input.linerecordreader.line.maxlength", maxRecordLength)

      sc.newAPIHadoopFile(
          path,
          classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
          conf
        )
        .map { case (_, text) => text.toString }
    }
  }
}

请注意额外设置import com.whatever.Spark.ContextExtensions sc.textFile("some/path.txt", "\u0001") ,它限制了记录的最大大小。当从一个损坏的文件中读取时,这会很方便,因为该文件的记录可能太长而无法放入内存中(使用记录分隔符时更有可能发生这种情况)。

使用此设置,当读取损坏的文件时,将抛出异常(mapreduce.input.linerecordreader.line.maxlength - 因此可捕获)而不是使内存混乱,这将停止SparkContext。

答案 4 :(得分:0)

如果您使用的是Spark上下文,则以下代码对我有帮助 sc.hadoopConfiguration.set("textinputformat.record.delimiter","delimeter")