File/data compression brings two major benefits: it reduces the space needed to store files and it speeds up data transfer across the network or to or from disk. When dealing with large volumes of data, both of these savings can be significant.
Hadoop supports the following compression types and codecs:
- gzip – org.apache.hadoop.io.compress.GzipCodec
- bzip2 – org.apache.hadoop.io.compress.BZip2Codec
- LZO – com.hadoop.compression.lzo.LzopCodec
- Snappy – org.apache.hadoop.io.compress.SnappyCodec
- Deflate – org.apache.hadoop.io.compress.DeflateCodec
These codecs are installed by default along with CDH parcel, so no separate installation is needed.
Only for LZO, we have to install GPL Extras parcel.
To add a compression type, go to CM – HDFS – configuration – search ‘compression’ and add the codec you want to use.
If you’re asked to run a mapreduce job using a compression type,
For map intermediate data compression:
hadoop jar hadoop-examples-.jar sort "-Dmapreduce.compress.map.output=true" "-Dmapreduce.map.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec"
For mapreduce output compression:
hadoop jar hadoop-examples-.jar sort "-Dmapreduce.output.compress=true" "-Dmapreduce.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec" -outKey org.apache.hadoop.io.Text -outValue org.apache.hadoop.io.Text input output
Which compression type to choose:
- GZIP compression uses more CPU resources than Snappy or LZO, but provides a higher compression ratio. GZip is often a good choice for cold data, which is accessed infrequently.
- Snappy or LZO are a better choice for hot data, which is accessed frequently. Snappy often performs better than LZO.
- BZip2 can also produce more compression than GZip for some types of files, at the cost of some speed when compressing and decompressing.
- For MapReduce, if you need your compressed data to be splittable, BZip2 and LZO formats can be split.
- Snappy and GZip blocks are not splittable, but files with Snappy blocks inside a container file format such as SequenceFile or Avro can be split.
Problem Scenarios:
- Enable Snappy compression for the mapreduce jobs
- Install and enable LZO codec.
Thus we covered Install new type of I/O compression library in cluster.
—
Use the comments section below to post your doubts, questions, feedback.
Please follow my blog to get notified of more certification related posts, exam tips, etc.