In HDFS, the blocks of the files are distributed among the datanodes as per the replication factor. Whenever you add a new datanode, the node will start receiving,storing the blocks of the new files. Though this sounds alright, the cluster is not balanced when you look at administrative point view.
HDFS provides a balancer utility that analyzes block placement and balances data across the DataNodes. The balancer moves blocks until the cluster is deemed to be balanced, which means that the utilization of every DataNode (ratio of used space on the node to total capacity of the node) differs from the utilization of the cluster (ratio of used space on the cluster to total capacity of the cluster) by no more than a given threshold percentage.
i.e if the cluster utilization is 70%, the threshold percentage for balancer is 10%, then the balancer will run until each datanode has 60-80% (i.e + threshold or – threshold) of utilization.
To run the balancer, go to CM – HDFS – Instances – Balancer – Actions – Rebalance
As you can see, balancer starts analyzing the utilization of datanodes rackwise and then individual node wise.
Once the balancing is complete, the balancer will stop automatically.
To change the threshold ratio of balancer,
CM – HDFS – Configuration – Scope ‘Balancer’ – rebalancing threshold – enter the desired threshold ratio
The property ‘Maximum Concurrent Moves’ sets the maximum number of threads used by the DataNode balancer.
It is a throttling mechanism to prevent the balancer from taking too many resources from the DataNode and interfering with normal cluster operations.
Increasing the value allows the balancing process to complete more quickly, decreasing the value allows rebalancing to complete more slowly, but is less likely to impact the resources utilization of other services.
To run balancer at specific bandwidth:
By default, the balancer will run at 10MB per second and you can raise it to make the balancer complete its work faster.
$ hdfs dfsadmin -setBalancerBandwidth <bandwidth in bytes per second>
The bandwidth you specify here is the maximum number of bytes per second that will be used by each DataNode in the cluster for balancing..
Ensure that you have adequate bandwidth available before changing the balancer bandwidth. The higher the bandwidth will help the balancer to run faster but it may hinder the cluster’s performance if sufficient bandwidth is not available.
- Balance the cluster with 15% threshold for the nodes.
- Run the balancer with 100MB bandwidth
Thus we covered how to rebalance the cluster.
Use the comments section below to post your doubts, questions and feedback.
Please follow my blog to get notified of more certification related posts, exam tips, etc.