DistCp (distributed copy) is a tool used for large inter/intra-cluster copying of HDFS data. It uses MapReduce to effect its distribution, error handling and recovery, and reporting.
It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list.
It is much faster, effective than normal “cp” command.
To copy data in the same cluster between different HDFS directories:
# hadoop distcp /source_path /user/destination_path
To copy between two different clusters:
# hadoop distcp hdfs://cluster1:50070/source_path hdfs://cluster2:50070/destination_path
Distcp uses only mappers, not reducers. If the source content is modified during the distcp, then the job will fail.