java - Hadoop gzip compressed files

Question

Welcome To Ask or Share your Answers For Others

java - Hadoop gzip compressed files

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

I am new to hadoop and trying to process wikipedia dump. It's a 6.7 GB gzip compressed xml file. I read that hadoop supports gzip compressed files but can only be processed by mapper on a single job as only one mapper can decompress it. This seems to put a limitation on the processing. Is there an alternative? like decompressing and splitting the xml file into multiple chunks and recompressing them with gzip.

I read about the hadoop gzip from http://researchcomputing.blogspot.com/2008/04/hadoop-and-compressed-files.html

Thanks for your help.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

230 views

1 Answer

深蓝 · Answer 1 · 2021-10-23T17:50:08+0000

A file compressed with the GZIP codec cannot be split because of the way this codec works. A single SPLIT in Hadoop can only be processed by a single mapper; so a single GZIP file can only be processed by a single Mapper.

There are atleast three ways of going around that limitation:

As a preprocessing step: Uncompress the file and recompress using a splittable codec (LZO)
As a preprocessing step: Uncompress the file, split into smaller sets and recompress. (See this)
Use this patch for Hadoop (which I wrote) that allows for a way around this: Splittable Gzip

HTH

Categories

java - Hadoop gzip compressed files

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags