How to increase the HDFS capacity of AWS Elastic Mapreduce EMR cluster

In this tutorial, we’re going to see how to increase the hdfs capacity of a running EMR cluster.

Sometime back, we received an alert that HDFSutilization was high on one of our cluster. Upon checking, the usage is an expected one but we under provisioned the storage capacity during the creation of the cluster and its capacity has to be increased.

The straightforward solution from AWS support team is to add another core node to the cluster and hdfs capacity will be automatically increased.

Even in EMR documentation [1] , they recommend below steps for HDFS resizing.

  • If the calculated HDFS capacity value is smaller than your data, you can increase the amount of HDFS storage in the following ways:
  • Creating a cluster with additional EBS volumes or adding instance groups with attached EBS volumes to an existing cluster
  • Adding more core nodes
  • Choosing an EC2 instance type with greater storage capacity

Not a ideal long term solution

Adding another instance just for the disk resource doesn’t make sense as you’ll be charged for the instance. Why would you pay for the compute capacity, when all you need is storage resource. Also it’s neither a cost effective nor a long term solution.

Moreover we created the core node instance group with the disk size of 32GB and so you can’t create a new core instance group, only increase the count of existing instance groups.

So to increase the hdfs capacity by 100+ GB, we have to add 4 more core nodes (32GB *4 = 128GB) which is a dumb move.  To know more about instance groups , check https://www.hadoopandcloud.com/hadoop/aws-emr-uniform-instance-groups/

I performed hdfs expansion activity multiple times in Cloudera distribution by adding additional disks, so this should be definitely doable in EMR. After further discussion with EMR SMEs, we figured out two solutions to increase the capacity.

  • Modify the EBS volume of the core node – On the fly change, no service restart required.
  • Add a new EBS volume to the core node – Increase HDFS fault tolerance, requires service restart.

Solution 1: Modify the EBS volume of the core node

In this approach, we can increase the hdfs capacity by increasing the ebs volume size. This change will be applied on the fly and it doesn’t require any service restarts.

In the below example, cluster is provisioned with 1 core node with disk capacity of 50GB and hdfs 45 GB. Let’s increase it to 100GB.

Namenode UI

Filesystem information before the changes:

check filesystem usage
df -Ph: Display about filesystem information and free space.


 To modify the EBS volume of core node:

  1. Open EMR console: https://console.aws.amazon.com/elasticmapreduce/home
  2. Select your cluster –> inside your cluster, click on ‘Hardware’ tab –> Click in Instance group ID [for core] –> from the list of core instance, for the core node for which volume size is to be increased, click ‘View All’ under ‘EBS volume per instance’ column –> click on volume Id will take you to EC2 volume page.
  3. Click ‘Actions’ –> click ‘Modify Volume’ –> enter the required size (100G) –> click Modify
  4. This modification will take some time, wait till the State of volume is in ‘in-use – completed (100%)’
ebs volume resizing
EBS volume modification

After volume modification, check if the resize was successful by running lsblk in the server. lsblk command displays your available disk devices and their mount points.

lsblk
lsblk: lists information about all available or the specified block devices

We can see here that xvdb volume is increased to 100G, but partition xvdb2 (/mnt), HDFS is still at 45G. So we should extend the partition use the available space.

To extend the partition:

$ sudo growpart devicename partition-number

disk growpart
growpart: extend a partition in a partition table to fill available space

Now the /mnt is resized to 95G.

But this change is not reflecting in the filesystem yet.

filesystem usage

Because we only extended the partition, so we have to extend the filesystem to reflect the changes.

$ sudo xfs_growfs device

aws growfs
growfs: expands a File System

Ref https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/recognize-expanded-volume-linux.html to know more about EBS volume resizing.

After resize:

After the filesystem resize, we can see the hdfs capacity is increased in the namenode ui.

emr hdfs capacity

Thus, We have successfully increased the hdfs capacity without any service restarts. We applied this change in our production cluster and it worked like a charm.

In this way, we can vertically scale the EBS volume of the existing core node and increase the hdfs capacity.

In part 2, I’ll write about how we can increase the hdfs capacity by adding a new EBS volume to the core node which will increase fault tolerance to the cluster.

If you like this post, please share your valuable comments and feedback which are always welcome! Also feel free to subscribe to blog to get notified of new posts by email.

Leave a Reply

Your email address will not be published. Required fields are marked *