In this post, I listed some of the key cloudwatch metrics that can be used for the monitoring AWS EMR – Elastic Mapreduce cluster monitoring.
As you know, EMR can be used either as a transient cluster or long running cluster (EMR introduction). If it’s transient, then the cluster will run till the lifetime of the job and terminate once the job ends. For the transient ones, you wouldn’t require any monitoring as the cluster itself lives for a short time.
In my company, We use EMR as a persistent cluster, it is always up with a master and core nodes. This requires dedicated monitoring because if something goes wrong with the cluster then multiple production jobs would be affected.
So, I’ve created cloudwatch alarms for the following metrics in each of the clusters. Though cloudwatch provides 39 metrics (job flow metrics) for the cluster, We would need only a handful of them which covers the critical aspects of the cluster.
Note: Image shows 156 Job flow metrics, it’s for 4 clusters (39*4).
This is an alarm based on EC2 instance metrics not EMR. I included this because EMR master node is the single point of failure and so it’s recommended to create more than adequate alarms covering the master node.
This alarm checks the ‘StatusCheckFailed’ metrics of EC2 instance which covers both instance and system status checks. If the metric value becomes > 1, it means either one of the status checks failed, which most likely requires the stop/start of the instance and when you stop the master node, the instance followed by cluster are terminated. Since EMR doesn’t offer high availability for the master node, you’ve to create a new cluster.
So, setting up this alarm notification will help you to notice the failure at the earliest, inform the stakeholders and plan to migrate to the new cluster.
The percentage of data(core) nodes that are communicating to master. Ideally this metric value should be 100%. If it’s less than 100, then one or multiple data nodes are not reporting to the master. This usually happens when the core node is down or terminated.
So, this alarm notification will help you to identity the issues with core nodes and take appropriate action. The metric value will become 100% once the core node is brought up.
If any of the running core nodes are terminated, EMR will automatically spin up a new core node. Even though you have the desired number of core nodes running, the terminated core node will still remain part of the cluster and EMR will consider the terminated one as a decommissioned node.
This will reduce the percent of LiveDataNodes metric as the metric is basically the ratio of [MR active Nodes / MR total Nodes]. You can still change the alarm for metric threshold as <80%, but that doesn’t look nice.
So, in an unlikely scenario of where your core node is decommissioned and LiveDataNodes report <100%, you can configure the cloudwatch alarm for CoreNodesRunning < number of corenodes in your cluster.
The NodeManager runs healthcheck services to determine the health of the node it is executing on. The services perform healthchecks on the disks of the nodes. If any health check fails, the NodeManager marks the node as unhealthy and communicates this to the ResourceManager, which then stops assigning containers to the node. By default, it uses disk checker as the healthcheck.
The disk checker checks the state of the disks in nodemanager, whether the disks have enough free space, proper permissions, filesystem is not in a read-only mode, etc. If any of the check fails, then the node would be reported to Resourcemanager as unhealthy.
Resource Manager will stop assigning containers to the node if the node is reported unhealthy for more than 45 minutes.
You can set up a notification for this alarm to warn you of unhealthy nodes before the 45-minute timeout is reached and take proactive actions before Resourcemanager stops assigning containers to the node.
Alternatively, you can also adjust the yarn property “yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb” as per your requirements, which is one of the main health checks. You can read more about his on hadoop.apache.org.
The HDFSUtilization metric is the percentage of disk space currently used in HDFS. If the usage becomes high, such as 85% of capacity used, you need to increase the hdfs capacity or clean up unwanted files.
If the hdfs replication factor is 3, then the usage will increase faster as each file will be replicated thrice causing 3x disk space. So having this alarm notification will help you in managing the hdfs in good health
AppsPending metric tells you the number of applications submitted to YARN that are in a pending/accepted state. Sometimes couple of rogue jobs could run forever and block the cluster. The jobs submitted during that time would be in accepted state.
It’s normal for 4 or 5 jobs to be in accepted state any time. Being a cluster administrator, you would know the schedule of the jobs and approximate pending apps at a given time. But if there are 20+ jobs are in accepted state, it means either someone is running any adhoc job or some job gone rogue/idle, blocked the cluster.
Having this metric alarm notification will help you to identify any such unusual number of jobs in pending state in the cluster.
These are some of the metrics I found very helpful for monitoring the EMR cluster. I’ve configured alarm notification to email subscription in SNS topics. You can integrate the alarm notifications with pagerduty, based on your SLAs and for quick turnaround.
For any questions or feedback, please post it in the comments section below.
If you found this post useful, kindly like, share the post, and subscribe to my blog for more useful posts. Cheers!