There are many possible causes for a job/application failure varying from code error, environments, files availability, permissions, mapreduce/yarn configuration, resources allocation and even due to server i/o, network issue etc.,
So the first thing you’ve to do when a job fails is, to look at the error message and correlate with your job.
If an application fails before even gets submitted, it’s probably due to issue in the server/node the job is getting launched or could be that the node is unable to make a contact with Resource Manager.
If the application fails in mappers, reducers the error message will be mentioned in the logs and also in the RM job tracker UI.
Most common memory related errors occur in mapper stages are:
· Java out of memory exception
· Java heap space error
In those cases you have to increase the heap size for mapper and mapper memory (mapreduce.map.memory.mb).
If the job fails during the final stages after the mapper, reducer tasks completion, then the possible reason could be the job failed to write the output and check where it’s writing and the output directory/files permissions.
Note: Troubleshooting a job sometimes requires tuning the YARN configuration.It requires firm understanding of YARN internals, containers, mapreduce etc to tune the cluster. But in the exam you may be asked to identify the reason for failure and troubleshoot, which doesn’t involve tuning the cluster but simply applying reasoning.
So the key thing in solving these issues are going through the logs, understanding the error and troubleshoot accordingly.
Use the comments section below to post your doubts, questions and feedback.
Please follow my blog to get notified of more certification related posts, exam tips, etc.