Efficiently copy data within a cluster/between clusters

DistCp (distributed copy) is a tool used for large inter/intra-cluster copying of HDFS data. It uses MapReduce to effect its distribution, error handling and recovery, and reporting.

It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list.

It is much faster, effective than normal “cp” command.

To copy data in the same cluster between different HDFS directories:

# hadoop distcp /source_path /user/destination_path

To copy between two different clusters:

# hadoop distcp hdfs://cluster1:50070/source_path hdfs://cluster2:50070/destination_path

 

Note:

Distcp uses only mappers, not reducers.  If the source content is modified during the distcp, then the job will fail.

Leave a Reply

%d bloggers like this: