日期:2014-05-16  浏览次数:20622 次

Hive Join 优化 翻译


翻译自?https://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimization#LanguageManualJoinOptimization-AutoConversiontoSMBMapJoin

目录结构

Join Optimization ----Join 调优
	Improvements to the Hive Optimizer ----Hive的优化
	Star Join Optimization ----星型结构的优化
		Star Schema Example ----例子
		Prior Support for MAPJOIN  ----MapJoin的预备知识
			Limitations of Prior Implementation  ----使用限制
		Enhancements for Star Joins  ----优化星型join
			Optimize Chains of Map Joins  ----优化一连串的Map join
				Current and Future Optimizations ----当前以及未来的优化方向
			Optimize Auto Join Conversion ----自动join转换
				Current Optimization ----当前的优化方向
				Auto Conversion to SMB Map Join ----自动转换成SMS join
			Generate Hash Tables on the Task Side ----在task side生成hash表
				Pros and Cons of Client-Side Hash Tables ----优缺点
				Task-Side Generation of Hash Tables ----在task siade 生成hash表
					Further Options for Optimization----优化方向

?

Improvements to the Hive Optimizer

?

Hive automatically recognizes various use cases and optimizes for them. Hive 0.11 improves the optimizer for these cases: ? hive可以自动优化,在0.11里面改进了一些优化用例

  • Joins where one side fits in memory. In the new optimization: ?join的一边适合放进内存,有新的优化方案
    • that side is loaded into memory as a hash table?把表按照hash表的形式读进内存
    • only the larger table needs to be scanned?只扫描大表
    • fact tables have a smaller footprint in memory?fact表只使用少量内存
  • Star-schema joins 星型join
  • Hints are no longer needed for many cases. 在很多情况下不再需要HINT
  • Map joins are automatically picked up by the optimizer. Map join 自动优化

Star Join Optimization

Star Schema Example

?

Select count(*) cnt
From store_sales ss
     join household_demographics hd on (ss.ss_hdemo_sk = hd.hd_demo_sk)
     join time_dim t on (ss.ss_sold_time_sk = t.t_time_sk)
     join store s on (s.s_store_sk = ss.ss_store_sk)
Where
     t.t_hour = 8
     t.t_minute >= 30
     hd.hd_dep_count = 2
order by cnt;
?

?

DW 常用的star schema,这个属于BI 基本概念了就不解释了。

Prior Support for MAPJOIN

?

The default value for hive.auto.convert.join was false in Hive 0.10.0. ?Hive 0.11.0 changed the default to true (HIVE-3297).hive.auto.convert.join 在0.10默认是fales,到了0.11就是变成了true

MAPJOINs are processed by loading the smaller table into an in-memory hash map and matching keys with the larger table as they are streamed through. The prior implementation has this division of labor:

MAPJOINs ?把小表以hash map的形式读进内存,然后和大表匹配key,以下是各个阶段的分工

  • Local work:本地
    • read records via standard ta