Script to find the long running hadoop jobs

Yarn Resource manager long running applications

Being a hadoop admin, one of your roles are to track long running hadoop jobs and take appropriate action on it. If it’s a small cluster and only 4 or 5 jobs are running most of the time, then you can monitor it in the resource manager UI. But if the cluster is huge and 100+ jobs are running, you can’t manually check the run time of each job to figure the long running one as it’s counter productive. Also when you have multiple clusters, checking the long running jobs become more tedious one. 

I faced the same scenario at my work where some of the jobs are stuck and running for more than a day, blocked the core node resources, which often lead to multiple jobs remain in accepted state.

So I wrote the below python script, which will print the list of jobs running for more than specified hours.

Github link: https://github.com/KannanAK/python/blob/master/print_yarn_running_apps.py

# This is the script to get the list of applications which are running for more than N hours

import json, urllib.request
# Passing the rest api url of the resource manager and filtering the applications to fetch the running ones
rm="http://cluster_url:8088/ws/v1/cluster/apps?states=RUNNING"

# Setting the threshold. In RM, time duration is measured in milliseconds
threshold=3600000
# Given 1 hour as threshold. You can change it as per requirements.

# Calling the RM api and storing the data in json. Added the decode('utf8') as python requires it for versions below 3.6
# You can check the results by entering the variable data in the python idle.
with urllib.request.urlopen(rm) as response:
    data=json.loads(response.read().decode('utf8'))

print ("Please find the list of long running jobs.")

# The json has a dictionary key 'apps' which has applications as values (which are again nested key value pairs).
# Now we're iterating through the each app and check whether app's elapsed time is more than our threshold (1 hour).
# If it's running for more than an hour, the app details will be printed.

for running_apps in data['apps']['app']:
    if running_apps['elapsedTime']>threshold:
        print ("\nApp Name: {}".format(running_apps['name']))
        print ("Application id: {}".format(running_apps['id']))
# Elapsed time is given in milliseconds. So I divided by 1000, then 60, again 60 to convert it to hours.
        print ("Total elapsed time: {} hours".format(round(running_apps['elapsedTime']/1000/60/60)))
        print ("Tracking Url: ",running_apps['trackingUrl'])
    else:
        print ("No long running jobs!")


This will give us the output in below format.

Please find the list of long running jobs.

App Name: Testjob
Application id: application_123456789123_123
Total elapsed time: 3hours
Tracking Url: http://randomnode:port/proxy/application_123456789123_123

The above script will run properly only when there are jobs in running state. If no jobs are running, then we would get an empty json while querying api and the script will throw an error when it is looping through the data[‘apps’][app].

To avoid the exception, we can change the logic as below.

if data['apps']==None:
    print ("No jobs are running in the cluster now.")
else:
    for running_apps in data['apps']['app']:

If no jobs are running, then the script will produce the following output.

Please find the list of long running jobs.

No jobs are running in the cluster now.

Automation usecases:

The above script only print out the jobs which are running for more than the specified hour.

  • We can schedule this script in crontab to check for long running jobs every hour and send an email notification to Ops team or to the concerned application team.
  • Automatically kill the long running applications by using ‘yarn application -kill running_apps[‘id’]’ cli command via subprocess module and send an email notification.

3 thoughts on “Script to find the long running hadoop jobs

  1. hadoop jobs and take appropriate action on it. If it’s a small cluster and only 4 or 5 jobs are running most of the time, then you can monitor it in the resource manager

Leave a Reply

Your email address will not be published. Required fields are marked *