日期:2014-05-16  浏览次数:20582 次

Sqoop源码分析(四) Sqoop中通过hadoop mapreduce从关系型数据库import数据分析

??????本博客属原创文章,转载请务必注明出处:http://guoyunsky.iteye.com/blogs/1213966/?????????

? ? ???欢迎加入Hadoop超级群:?180941958

?????????? Sqoop中一大亮点就是可以通过hadoop的mapreduce从关系型数据库中导入数据到HDFS,如此可以加快导入时间.一直想了解MapReduce,所以也仔细的阅读了下相关代码,整理成这篇博客.

??????????? .原理:

? ? ? ? ? ? ?Sqoop在import时,需要制定split-by参数.Sqoop根据不同的split-by参数值来进行切分,然后将切分出来的区域分配到不同map中.每个map中再处理数据库中获取的一行一行的值,写入到HDFS中.同时split-by根据不同的参数类型有不同的切分方法,如比较简单的int型,Sqoop会取最大和最小split-by字段值,然后根据传入的num-mappers来确定划分几个区域。比如select max(split_by),min(split-by) from得到的max(split-by)和min(split-by)分别为1000和1,而num-mappers为2的话,则会分成两个区域(1,500)和(501-100),同时也会分成2个sql给2个map去进行导入操作,分别为select XXX from table where split-by>=1 and split-by<500和select XXX from table where split-by>=501 and split-by<=1000.最后每个map各自获取各自SQL中的数据进行导入工作。

??????

???????二.mapreduce job所需要的各种参数在Sqoop中的实现

?????????? 1)InputFormatClass
??????????????com.cloudera.sqoop.mapreduce.db.DataDrivenDBInputFormat
?????????? 2)OutputFormatClass
?????????????? 1)TextFile
?????????????????? com.cloudera.sqoop.mapreduce.RawKeyTextOutputFormat
?????????????? 2)SequenceFile
???????????????????org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat?
?????????????? 3)AvroDataFile
?????????????????? com.cloudera.sqoop.mapreduce.AvroOutputFormat
????????? 3)Mapper
???????????? 1)TextFile
???????????????? com.cloudera.sqoop.mapreduce.TextImportMapper
???????????? 2)SequenceFile
???????????????? com.cloudera.sqoop.mapreduce.SequenceFileImportMapper
?????????? ?3)AvroDataFile
???????????????? com.cloudera.sqoop.mapreduce.AvroImportMapper
???????? 4)taskNumbers
??????????? 1)mapred.map.tasks (对应num-mappers参数)
????????????2)job.setNumReduceTasks(0);


???????这里以我命令行:import --connect jdbc:mysql://localhost/sqoop_datas? --username root --password 123456 --query "select sqoop_1.id as foo_id, sqoop_2.id as bar_id from sqoop_1 ,sqoop_2? WHERE $CONDITIONS" --target-dir /tmp/sqoop/foo2 -split-by sqoop_1.id?? --hadoop-home=/home/guoyun/Downloads/hadoop-0.20.2-CDH3B4? --num-mappers 2

?????? 注:红色部分参数,后接根据我的命令衍生的参数值

????? 1)设置Input

?????? DataDrivenImportJob.configureInputFormat(Job job, String tableName,String tableClassName, String splitByCol)
?????????? a)DBConfiguration.configureDB(Configuration conf, String driverClass,
?????????? String dbUrl, String userName, String passwd, Integer fetchSize)

??????????????? 1).mapreduce.jdbc.driver.class com.mysql.jdbc.Driver
??????????????? 2).mapreduce.jdbc.url? jdbc:mysql://localhost/sqoop_datas
??????????????? 3).mapreduce.jdbc.username? root
??????????????? 4).mapreduce.jdbc.password? 123456
??????????????? 5).mapreduce.jdbc.fetchsize<