Question

以下是应该从csv文件读取并写入另一个csv文件和BigQuery的代码：

import argparse
import logging
import re
import apache_beam as beam
from apache_beam.io import ReadFromText
from apache_beam.io import WriteToText
from apache_beam.metrics import Metrics
from apache_beam.metrics.metric import MetricsFilter
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions
parser = argparse.ArgumentParser()
parser.add_argument('--input',
                  dest='input',
                  default='gs://dataflow-samples/shakespeare/kinglear.txt',
                  help='Input file to process.')
parser.add_argument('--output',
                  dest='output',
                  required=True,
                  help='Output file to write results to.')
known_args, pipeline_args = parser.parse_known_args(None)
pipeline_options = PipelineOptions(pipeline_args)
pipeline_options.view_as(SetupOptions).save_main_session = True
p = beam.Pipeline(options=pipeline_options)
# Read the text file[pattern] into a PCollection.
lines = p | 'read' >> ReadFromText(known_args.input)
lines | beam.Map(lambda x: x.split(','))
lines | 'write' >> WriteToText(known_args.output)
lines | 'write2' >> beam.io.Write(beam.io.BigQuerySink('xxxx:yyyy.aaaa'))
# Actually run the pipeline (all operations above are deferred).
result = p.run()

它能够写入输出文件但是无法对BigQuery表执行此操作（xxxx：yyyy.aaaa）

以下是显示的消息：

WARNING:root:A task failed with exception.
'unicode' object has no attribute 'iteritems'

即使模式相同且BigQuery表为空，csv文件中包含的表也不会写入BigQuery。我怀疑这是因为数据必须转换为JSON格式。为了使代码正常工作，必须对此代码进行哪些更正？您能否提供我必须添加的代码行以使其正常工作？

Answer 1

看以下几行：

1: lines = p | 'read' >> ReadFromText(known_args.input)
2: lines | beam.Map(lambda x: x.split(','))
3: lines | 'write' >> WriteToText(known_args.output)
4: lines | 'write2' >> beam.io.Write(beam.io.BigQuerySink('xxxx:yyyy.aaaa'))

将lines定义为从文本文件中读取的行的PCollection。
通过拆分每一行来创建新的PCollection。但它实际上并没有保留那个PCollection，所以它实际上什么也没做。
将原始行写入文本文件（因此每行看不到一个单词，每个输出上都会看到一个原始行。）
将从输入读取的行写入BigQuery文件。

如果查看BigQuery tornadoes example，您可以看到（1）您需要将每行转换为字典，每个列ad（2）都有字段，您需要提供与该字典匹配的模式到BigQuerySink 。例如：

def to_table_row(x):
  values = x.split(',')
  return { 'field1': values[0], 'field2': values[1] } 

lines = p | 'read' >> ReadFromText(known_args.input)
lines
  | 'write' >> WriteToText(known_args.output)
lines
  | 'ToTableRows' >> beam.Map(to_table_row)
  | 'write2' >> beam.io.Write(beam.io.BigQuerySink(
      'xxxx:yyyy.aaaa',
      schema='field1:INTEGER, field2:INTEGER'))

读取csv文件并在BigQuery表中填充数据

1 个答案: