Hadoop

Rebalance the cluster

In HDFS, the blocks of the files are distributed among the datanodes as per the replication factor. Whenever you add a new datanode, the node will start receiving,storing the blocks of the new files. Though this sounds alright, the cluster is not balanced when you look at administrative point view.

HDFS provides a balancer utility that analyzes block placement and balances data across the DataNodes. The balancer moves blocks until the cluster is deemed to be balanced, which means that the utilization of every DataNode (ratio of used space on the node to total capacity of the node) differs from the utilization of the cluster (ratio of used space on the cluster to total capacity of the cluster) by no more than a given threshold percentage.

i.e if the cluster utilization is 70%, the threshold percentage for balancer is 10%, then the balancer will run until each datanode has 60-80% (i.e + threshold or – threshold) of utilization.

To run the balancer, go to CM – HDFS – Instances – Balancer – Actions – Rebalance

As you can see, balancer starts analyzing the utilization of datanodes rackwise and then individual node wise.

Once the balancing is complete, the balancer will stop automatically.

To change the threshold ratio of balancer,

CM – HDFS – Configuration – Scope ‘Balancer’ – rebalancing threshold – enter the desired threshold ratio

The property ‘Maximum Concurrent Moves’ sets the maximum number of threads used by the DataNode balancer.

It is a throttling mechanism to prevent the balancer from taking too many resources from the DataNode and interfering with normal cluster operations.
Increasing the value allows the balancing process to complete more quickly, decreasing the value allows rebalancing to complete more slowly, but is less likely to impact the resources utilization of other services.

To run balancer at specific bandwidth:

By default, the balancer will run at 10MB per second and you can raise it to make the balancer complete its work faster.

$ hdfs dfsadmin -setBalancerBandwidth <bandwidth in bytes per second>

The bandwidth you specify here is the maximum number of bytes per second that will be used by each DataNode in the cluster for balancing..

Ensure that you have adequate bandwidth available before changing the balancer bandwidth. The higher the bandwidth will help the balancer to run faster but it may hinder the cluster’s performance if sufficient bandwidth is not available.

 

Problem Scenarios:

  • Balance the cluster with 15% threshold for the nodes.
  • Run the balancer with 100MB bandwidth

 

Thus we covered how to rebalance the cluster.

Use the comments section below to post your doubts, questions and feedback.

Please follow my blog to get notified of more certification related posts, exam tips, etc.

6 thoughts on “Rebalance the cluster

    1. There’s no configuration available to set balancer’s bandwidth in Cloudera Manager.

      But you can use the below command in the balancer node to set the bandwidth, before rebalancing the cluster.

      dfsadmin -setBalancerBandwidth newbandwidth

      newbandwidth is the maximum amount of network bandwidth,in bytes per second, that each DataNode can use during the balancing operation.

  1. Hi Kannan
    Is there a difference between running below command from command line
    and setting Datanode Balancing Bandwidth in Cloudera Manager

    dfsadmin -setBalancerBandwidth

    Which one is correct for changing balancing bandwidth before running hdfs balancer?

    1. Hi Venkat,

      Both of them will work.

      If you set it in CM, then by default all balancer jobs will run with the specified bandwidth.

      If you set it via cmd line, then the particular balancer will run with the given value and the rest will take the bandwidth from configuration.

      Hope this helps.

Leave a Reply

Your email address will not be published. Required fields are marked *