Data Locality optimization is the method of running computation closer to the node where the actual data resides.
Since Hadoop is dealing with large amount of data, network bandwidth is one of the valuable resources for them.
Hadoop does its best to run the map task on a node where the blocks of input data resides, so that it won’t consume much of cluster bandwidth.
There are possible cases available with respect to data locality optimization.
As you know, the datasets are stored as blocks in datanodes which are accessible via HDFS. When you run the mapreduce job, the job scheduler will look for the free slots in nodemanager where the HDFS block resides.
If all nodes having data block replicas are running other tasks, then the job scheduler will look for a node in the same rack where the nodes with data block resides. This way the data transfer will be happening between two machines in the same rack.
There is also an assumption that two machines in the same rack have more bandwidth and lower latency between each other than two machines in two different racks.
In occasional cases, when both the local node and nodes in same rack are running other tasks, then the scheduler will assign the map tasks to the node which is not part of the rack and doesn’t have the HDFS block either.
In order to access the data required for map task, this node will copy the blocks from the nodes in other rack. This will consume lot of network bandwidth compared to the data local, rack local methods.