Hadoop 分区函数Partitioner -

metooxi

浏览: 70707 次
性别:
来自: 北京

最近访客更多访客>>

joe123

再现江湖

owfhkw

caesar_q_d

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

Hadoop 分区函数Partitioner

博客分类：

Hadoop

分区函数

MapReduce的使用者通常会指定Reduce任务和Reduce任务输出文件的数量（R）。我们在中间key上使用分区函数来对数据进行分区，之后再输入到后续任务执行进程。一个缺省的分区函数是使用hash方法(比如，hash(key) mod R)进行分区。hash方法能产生非常平衡的分区。然而，有的时候，其它的一些分区函数对key值进行的分区将非常有用。比如，输出的key值是URLs，我们希望每个主机的所有条目保持在同一个输出文件中。为了支持类似的情况，MapReduce库的用户需要提供专门的分区函数。例如，使用“hash(Hostname(urlkey)) mod R”作为分区函数就可以把所有来自同一个主机的URLs保存在同一个输出文件中。

所有的分区函数必须继承自：Partitioner

package org.apache.hadoop.mapreduce;

/** 
 * Partitions the key space.
 * 
 * <p><code>Partitioner</code> controls the partitioning of the keys of the 
 * intermediate map-outputs. The key (or a subset of the key) is used to derive
 * the partition, typically by a hash function. The total number of partitions
 * is the same as the number of reduce tasks for the job. Hence this controls
 * which of the <code>m</code> reduce tasks the intermediate key (and hence the 
 * record) is sent for reduction.</p>
 * 
 * @see Reducer
 */
public abstract class Partitioner<KEY, VALUE> {
  
  /** 
   * Get the partition number for a given key (hence record) given the total 
   * number of partitions i.e. number of reduce-tasks for the job.
   *   
   * <p>Typically a hash function on a all or a subset of the key.</p>
   *
   * @param key the key to be partioned.
   * @param value the entry value.
   * @param numPartitions the total number of partitions.
   * @return the partition number for the <code>key</code>.
   */
  public abstract int getPartition(KEY key, VALUE value, int numPartitions);
  
}

默认的Hash Partitioner 函数的例子

/** Partition keys by their {@link Object#hashCode()}. */
public class HashPartitioner<K, V> extends Partitioner<K, V> {

  /** Use {@link Object#hashCode()} to partition. */
  public int getPartition(K key, V value,
                          int numReduceTasks) {
    return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
  }

}

Job job = new Job(conf, "Process Name");

job.setPartitionerClass(cls)

使用这个方法可以使用自定义的Partitioner 。

分享到：

MapReduce 执行过程分析 | SSH 无密访问

2012-03-01 10:45
浏览 4996
评论(1)
分类:开源软件
查看更多

1 楼 everlasting_188 2013-07-03

不错！！！

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

Hadoop 分区函数Partitioner

分区函数

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

Hadoop 分区函数Partitioner

分区函数

评论

发表评论

相关推荐

Hadoop 自动安装脚本

HDFS 中读取数据的方法

使用Jconsole对Hadoop的JVM进行监控

Hadoop 任务调度

Hadoop 运行硬件的选择

如何配置Hadoop的 Secondary节点 & NameNode节点失效恢复

Hadoop SecondaryNameNode 异常

Warning: $HADOOP_HOME is deprecated. 关闭

Hadoop Hive 中的排序 Order by ,Sort by ,Distribute by, Cluster By,

Hadoop Hive 复合数据结构Array,Struct,Maps

Hadoop 安装配置

Hadoop 各参数优化

MapReduce 执行过程分析

Hadoop: The Definitive Guide, 3rd Edition (Early Release)

Hbase 介绍

MapReduce 的最简单解释

最近访客更多访客>>