Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I am trying to run a high-memory job on a Hadoop cluster (0.20.203). I modified the mapred-site.xml to enforce some memory limits.

  <property>
    <name>mapred.cluster.max.map.memory.mb</name>
    <value>4096</value>
  </property>
  <property>
    <name>mapred.cluster.max.reduce.memory.mb</name>
    <value>4096</value>
  </property>
  <property>
    <name>mapred.cluster.map.memory.mb</name>
    <value>2048</value>
  </property>
  <property>
    <name>mapred.cluster.reduce.memory.mb</name>
    <value>2048</value>
  </property>

In my job, I am specifying how much memory I will need. Unfortunately, even though I am running my process with -Xmx2g (the job will run just fine with this much memory as a console application) I need to request much more memory for my mapper (as a subquestion, why is this?) or it is killed.

val conf = new Configuration()
conf.set("mapred.child.java.opts", "-Xms256m -Xmx2g -XX:+UseSerialGC");
conf.set("mapred.job.map.memory.mb", "4096");
conf.set("mapred.job.reduce.memory.mb", "1024");

The reducer needs hardly any memory since I am performing an identity reducer.

  class IdentityReducer[K, V] extends Reducer[K, V, K, V] {
    override def reduce(key: K,
        values: java.lang.Iterable[V],
        context:Reducer[K,V,K,V]#Context) {
      for (v <- values) {
        context write (key, v)
      }
    }
  }

However, the reducer is still using a lot of memory. Is it possible to give the reducer different JVM arguments than the mapper? Hadoop kills the reducer and claims it is using 3960 MB of memory! And the reducers end up failing the job. How is this possible?

TaskTree [pid=10282,tipID=attempt_201111041418_0005_r_000000_0] is running beyond memory-limits.
Current usage : 4152717312bytes.
Limit : 1073741824bytes.
Killing task.

UPDATE: even when I specify a streaming job with cat as the mapper and uniq as the reducer and -Xms512M -Xmx1g -XX:+UseSerialGC my tasks take over 2g of virtual memory! This seems extravagant at 4x the max heap size.

TaskTree [pid=3101,tipID=attempt_201111041418_0112_m_000000_0] is running beyond memory-limits.
Current usage : 2186784768bytes.
Limit : 2147483648bytes.
Killing task.

Update: the original JIRA for changing the configuration format for memory usage specifically mentions that Java users are mostly interested in physical memory to prevent thrashing. I think this is exactly what I want: I don't want a node to spin up a mapper if there is inadequate physical memory available. However, these options all seem to have been implemented as virtual memory constraints, which are difficult to manage.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
156 views
Welcome To Ask or Share your Answers For Others

1 Answer

Check your ulimit. From Cloudera, on version 0.20.2, but a similar issue probably applies for later versions:

...if you set mapred.child.ulimit, it's important that it must be more than two times the heap size value set in mapred.child.java.opts. For example, if you set a 1G heap, set mapred.child.ulimit to 2.5GB. Child processes are now guaranteed to fork at least once, and the fork momentarily requires twice the overhead in virtual memory.

It's also possible that setting mapred.child.java.opts programmatically is "too late"; you might want to verify it really is going into effect, and put it in your mapred-site.xml if not.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...