Benchmarking is the process of stress testing the resources of the cluster. It’s very useful in understanding the performance of your cluster and to check whether it’s performing as expected before taking it live.
Here we are going to test speed in which files are being read/write in HDFS, time taken for mappers/reducers to process a given size of data, measure the performance etc.,
We can easily benchmark the cluster by running the test jars bundled with Cloudera distribution.
Those are available in the hadoop installation directory.
/opt/cloudera/parcels/CDH/jars/hadoop-mapreduce-examples-*.jar /opt/cloudera/parcels/CDH/jars/hadoop-test-*.jar
When you run the jar file without any arguments, it will show you the list of programs available.
Below are the four primary jars used often to benchmark and let’s take a look at each of them.
- Teragen
- Terasort
- Teravalidate
- TestDFSIO
The first three benchmarks are commonly used together to evaluate the cluster performance. They are collectively called Terasort Benchmark Suite.
Terasort Benchmark Suite:
The TeraSort benchmark is the widely known Hadoop benchmark. It combines testing the HDFS and MapReduce layers of a Hadoop cluster and consists of three MapReduce programs.
There are three steps involved in Terasort benchmarking suite:
1. Generating the input data via TeraGen.
2. Running the actual TeraSort on the input data.
3. Validating the sorted output data via TeraValidate.
Teragen:
This program is available in hadoop-examples.jar (you can also use hadoop-*examples*.jar).
When ran with no arguments, it’s asking to specify number of rows and output dir.
Syntax: hadoop jar /opt/cloudera/parcels/CDH/jars/hadoop-examples.jar teragen teragen <num rows> <output dir>
Each row is of size 100 byte, so to generate 1GB of data, num of rows value is 10000000 and the output will be stored in hdfs dir /hadoop/teragen
To change the block size of the generated data, you can pass the argument “–D dfs.block.size=sizeinbytes “
I’m generating 1GB of data and storing it in /hadoop/teragen hdfs directory.
#hadoop jar /opt/cloudera/parcels/CDH/jars/hadoop-examples.jar teragen 10000000 /hadoop/teragen
Terasort:
The data generated from Teragen can be used as the input data for Terasort.
Syntax: hadoop jar /opt/cloudera/parcels/CDH/jars/hadoop-examples.jar terasort inputdir outputdir [root@master ~]# sudo -u hdfs hadoop jar /opt/cloudera/parcels/CDH/jars/hadoop-examples.jar terasort /hadoop/teragen /hadoop/terasort
Teravalidate:
TeraValidate validates the sorted output of Terasort and to ensure that the keys are sorted within each file. If anything is wrong with the sorted output, the output of this reducer reports the problem.
Syntax: hadoop jar /opt/cloudera/parcels/CDH/jars/hadoop-examples.jar teravalidate terasort_output teravalidate_output [root@master ~]# sudo -u hdfs hadoop jar /opt/cloudera/parcels/CDH/jars/hadoop-examples.jar teravalidate /hadoop/terasort /hadoop/teravalidate
TestDFSIO:
The TestDFSIO benchmark is a I/O test, i.e, read and write test for HDFS. It’s useful in identifying the read/write speed of your hdfs i.e all the datanodes disk and understand how fast the cluster is in terms of I/O.
When ran without arguments, it will show you the usage details.
Write Test:
Always first run the write test and then run the read test using the file generated in write test.
Here, I’m running a write test and creating 3 files with each of size of 100 MB, i.e total 300MB file is written in HDFS. TestDFSIO will write the files to /benchmarks/TestDFSIO on hdfs and the benchmark results are stored in local file, TestDFSIO_results.log.
[root@master ~]# hadoop jar /opt/cloudera/parcels/CDH/jars/hadoop-test-2.6.0-mr1-cdh5.12.1.jar TestDFSIO -write -nrFiles 3 -size 100MB
Read Test:
[root@master ~]# hadoop jar /opt/cloudera/parcels/CDH/jars/hadoop-test-2.6.0-mr1-cdh5.12.1.jar TestDFSIO -read -nrFiles 3 -size 100MB
Problem Scenarios:
- Run the terasort benchmarking suite for 4TB of data and store the results in hdfs output dir.
- Run the Testdfsio to test the hdfs performance and store the results in hdfs dir.
- Generate a data of size 1TB with block size of 256MB using teragen
- Troubleshoot the errors encountered in benchmarking jobs and run it successfully.
Note:
This is one of the important exam topics and you can definitely expect a question from benchmarking in the exam.
To know more about these benchmarking, check out this wonderful post, which explains these in detail.
Thus we covered how to Benchmark the cluster.
—
Use the comments section below to post your doubts, questions and feedback.
Please follow my blog to get notified of more certification related posts, exam tips, etc.