Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I am new to hadoop and trying to process wikipedia dump. It's a 6.7 GB gzip compressed xml file. I read that hadoop supports gzip compressed files but can only be processed by mapper on a single job as only one mapper can decompress it. This seems to put a limitation on the processing. Is there an alternative? like decompressing and splitting the xml file into multiple chunks and recompressing them with gzip.

I read about the hadoop gzip from http://researchcomputing.blogspot.com/2008/04/hadoop-and-compressed-files.html

Thanks for your help.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
230 views
Welcome To Ask or Share your Answers For Others

1 Answer

A file compressed with the GZIP codec cannot be split because of the way this codec works. A single SPLIT in Hadoop can only be processed by a single mapper; so a single GZIP file can only be processed by a single Mapper.

There are atleast three ways of going around that limitation:

  1. As a preprocessing step: Uncompress the file and recompress using a splittable codec (LZO)
  2. As a preprocessing step: Uncompress the file, split into smaller sets and recompress. (See this)
  3. Use this patch for Hadoop (which I wrote) that allows for a way around this: Splittable Gzip

HTH


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...