Sqoop导入:复合主键和文本主键

时间:2016-05-13 09:27:04

标签: hadoop hive sqoop hortonworks-data-platform

Stack:使用Ambari 2.1安装HDP-2.3.2.0-2950

源数据库架构在sql server上,它包含几个主键为:

的表
  • A varchar
  • 复合 - 两个varchar列或一个varchar +一个int列或 两个int列。有一张大桌子?有三个的行 PK中的列一个int +两个varchar列

根据Sqoop文档:

Sqoop cannot currently split on multi-column indices. If your table has no index column, or has a multi-column key, then you must also manually choose a splitting column.

第一个问题是:'手动选择拆分列' - 我怎么能牺牲pk而只使用一列或者我错过了一些概念?

SQL Server表是(仅两列,它们形成复合主键):

ChassiNo    varchar(8)  Unchecked
ECU_Name    nvarchar(15)    Unchecked

我继续导入,源表有7909097条记录

sqoop import --connect 'jdbc:sqlserver://somedbserver;database=somedb' --username someusname --password somepass --as-textfile --fields-terminated-by '|&|'  --table ChassiECU --num-mappers 8  --warehouse-dir /dataload/tohdfs/reio/odpdw/may2016 --verbose

令人担忧的警告和错误的映射器输入和记录:

16/05/13 10:59:04 WARN manager.CatalogQueryManager: The table ChassiECU contains a multi-column primary key. Sqoop will default to the column ChassiNo only for this job.
16/05/13 10:59:08 WARN db.TextSplitter: Generating splits for a textual index column.
16/05/13 10:59:08 WARN db.TextSplitter: If your database sorts in a case-insensitive order, this may result in a partial import or duplicate records.
16/05/13 10:59:08 WARN db.TextSplitter: You are strongly encouraged to choose an integral split column.
16/05/13 10:59:38 INFO mapreduce.Job: Counters: 30
        File System Counters
                FILE: Number of bytes read=0
                FILE: Number of bytes written=1168400
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=1128
                HDFS: Number of bytes written=209961941
                HDFS: Number of read operations=32
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=16
        Job Counters
                Launched map tasks=8
                Other local map tasks=8
                Total time spent by all maps in occupied slots (ms)=62785
                Total time spent by all reduces in occupied slots (ms)=0
                Total time spent by all map tasks (ms)=62785
                Total vcore-seconds taken by all map tasks=62785
                Total megabyte-seconds taken by all map tasks=128583680
        Map-Reduce Framework
                Map input records=15818167
                Map output records=15818167
                Input split bytes=1128
                Spilled Records=0
                Failed Shuffles=0
                Merged Map outputs=0
                GC time elapsed (ms)=780
                CPU time spent (ms)=45280
                Physical memory (bytes) snapshot=2219433984
                Virtual memory (bytes) snapshot=20014182400
                Total committed heap usage (bytes)=9394716672
        File Input Format Counters
                Bytes Read=0
        File Output Format Counters
                Bytes Written=209961941
16/05/13 10:59:38 INFO mapreduce.ImportJobBase: Transferred 200.2353 MB in 32.6994 seconds (6.1235 MB/sec)
16/05/13 10:59:38 INFO mapreduce.ImportJobBase: Retrieved 15818167 records.

创建表:

CREATE EXTERNAL TABLE IF NOT EXISTS ChassiECU(`ChassiNo` varchar(8),
`ECU_Name` varchar(15)) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'  LOCATION '/dataload/tohdfs/reio/odpdw/may2016/ChassiECU';

可怕的结果(没有错误) - PROBLEM:15818167 vs 7909097(sql server)记录:

 > select count(1) from ChassiECU;
Query ID = hive_20160513110313_8e294d83-78aa-4e52-b90f-b5640268b8ac
Total jobs = 1
Launching Job 1 out of 1
Tez session was closed. Reopening...
Session re-established.
Status: Running (Executing on YARN cluster with App id application_1446726117927_0059)
--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED     14         14        0        0       0       0
Reducer 2 ......   SUCCEEDED      1          1        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 6.12 s
--------------------------------------------------------------------------------
OK
_c0
15818167

令人惊讶的是,如果复合键由一个int(用于拆分)组成,我得到的准确或不匹配少于10条记录,但我仍然对这些记录感到担忧! < / p>

我该怎么办?

1 个答案:

答案 0 :(得分:6)

手动指定拆分列。拆分列不一定等于PK。你可以有复杂的PK和一些int Split列。您可以指定任何整数列甚至简单函数(一些简单的函数,如substring或cast,而不是聚合或分析)。 拆分列最好应该是均匀分布的整数

例如,如果您的拆分列包含值为-1的10行和10M行,值为10000 - 10000000且num-mappers = 8,那么sqoop将不均匀地拆分映射器之间的数据集:

  • 第一个mapper将获得少量行-1,
  • 第2至第7位映射器将获得0行,
  • 第8个映射器将获得近10M行,

将导致数据倾斜,第8个映射器将永远运行或    甚至失败了。当使用非整数时,我也有重复    拆分列与MS-SQL 。因此,使用整数拆分列。在你的情况下    只有两个varchar列的表可以

(1)添加代理int PK并将其用作分割或

(2)使用带有WHERE子句的自定义查询手动拆分数据,并使用num-mappers = 1运行sqoop几次,或

(3)将一些确定性整数非聚合函数应用于varchar列,例如cast(substr(...)as int)或second(timestamp_col)或{{1}等,作为分裂列。