Understanding HDFS commands with examples

Karthik Sharma
9 min readJun 1, 2021

Hadoop Distributed File System (HDFS) is file system of Hadoop designed for storing very large files running on clusters of commodity hardware. Generally, when dataset outgrows the storage capacity of a single machine, it is necessary to partition it across number of separate machines. The file system that manages the storage across a network of machines are called distributed file systems.

In this topic, we will discuss about the different HDFS commands with examples. Most of the commands have similar functionality as that of the Unix commands.

1. mkdir:

This command is similar to that of Unix mkdir and is used to create a directory in HDFS.

hdfs dfs [-mkdir <-p> … Path]

If the directory already exists or if intermediate directories doesn’t exists then it will throws an error. In order to overcome that error we will be using -p (parent), which not only ignores if the directory already exists but also create the intermediate directories if they doesn’t exists.

2. ls:

This command is used for listing the directories and files present under the current working directory in HDFS.

hdfs dfs [generic options] -ls [-d] [-h] [-R] [<path> …]

  • -d is used to list the directories as plain files.
  • -h is used to print file size in human readable format.
  • -R is used to recursively list the content of the directories.

3. cat:

This command is used for displaying the contents of a file on the console.

hdfs dfs [-cat [-ignoreCrc] <src> …]

  • -ignoreCrc option will disable the checksum verification.

4. appendToFile:

This command will append the content of all local files to the provided destination file on the HDFS. If the destination file doesn’t exists then this command will automatically create the file.

hdfs dfs [-appendToFile <localsrc> … <dst>]

We can also append the local file to existing file in HDFS.

5. checksum:

This command is used to obtain the checksum information of a file in HDFS.

hdfs dfs [-checksum <src> …]

6. chgrp, chown & chmod:

  • chgrp command is used to change the group of a file or a path.
  • chown command is used to change the owner of a file or a path.
  • chmod command is used to change the permissions of a file.

hdfs dfs [-chgrp [-R] GROUP PATH…]
hdfs dfs [-chmod [-R] <MODE[,MODE]… | OCTALMODE> PATH…]
hdfs dfs [-chown [-R] [OWNER][:[GROUP]] PATH…]

  • -R option is used to modify the files recursively.

7. put:

This command is used to copy files from the local file system to the HDFS file system. The difference between put and copyFromLocal is that the “CopyFromLocal” command will help copy the file from local file system to HDFS, while the “Put” command will copy from anywhere (local or network) to anywhere (HDFS or local file system).

hdfs dfs -put [-f] [-p] [-l] [-d] [ — | <localsrc> .. ]. <dst>

  • -f overwrites the destination if it already exists.
  • -p preserves access and modification times, ownership and the permissions
  • -d skips creation of temporary file with the suffix ._COPYING_.
  • -l allows Data Node to lazily persist the file to disk, Forces a replication factor of 1.

8. get:

This command is used to copy files from HDFS to the local file system. The destination is restricted to a local file reference when we use copyToLocal. While using “Get” there are no such restrictions.

hdfs dfs [generic options] -get [-f][-p] [-ignoreCrc] [-Crc] <src> … <localdst>

9. copyFromLocal:

This command is similar to put command except that source is restricted to the local file system. This command is used to move the file from local file system to HDFS.

hdfs dfs [generic options] -copyFromLocal [-f] [-p] [-l] <localsrc> … <dst>

  • -f overwrites the destination if it already exists.
  • -p preserves access and modification times, ownership and the permissions
  • -d : Skip creation of temporary file with the suffix ._COPYING_.

10. copyToLocal:

This command is similar to get command except that destination is restricted to the local file system. This command is used to move the file from HDFS to local file system.

hdfs dfs -copyToLocal [-f] [-p][-ignoreCrc] [-Crc] URI <localdst>

11. moveFromLocal:

This command will move the file from local file system to HDFS. It ensures that the local copy is deleted.

hdfs dfs -moveFromLocal <localsrc> <dst>

12. count:

This command is used to count the number of directories, files, and bytes under the path that matches the provided file pattern.

hdfs dfs -count [-q] [-h] [-v] [-x] [-t [<storage type>]] [-u] [-e] <paths>

The count command output contains, DIR_COUNT, FILE_COUNT, CONTENT_SIZE, PATHNAME

  • -h option shows sizes in human readable format.
  • -q means show quotas, the output is QUOTA, REMAINING_QUOTA, SPACE_QUOTA, REMAINING_SPACE_QUOTA, DIR_COUNT, FILE_COUNT, CONTENT_SIZE, PATHNAME.
  • -u limits the output to show quotas and usage only. The output is QUOTA, REMAINING_QUOTA, SPACE_QUOTA, REMAINING_SPACE_QUOTA, PATHNAME.
  • -v option displays a header line.

13. cp:

This command is used for copying files from one directory to another directory within the HDFS.

hdfs dfs -cp [-f] [-p | -p[topax]] URI [URI …] <dest>

14. df:

This command is used to show the capacity, free and used space available on the HDFS.

hdfs dfs -df [-h] URI [URI …]

15. du:

This command is used to show the amount of space in bytes that have been used by the files that match the specified file pattern.

hdfs dfs -du [-s] [-h] [-v] [-x] URI [URI …]

  • -s option will result in an aggregate summary of file lengths being displayed, rather than the individual files. Without the -s option, calculation is done by going 1-level deep from the given path.
  • -h option will format file sizes in a “human-readable” fashion (e.g 64.0m instead of 67108864)
  • -v option will display the names of columns as a header line.

The du command output format is “size disk_space_consumed_with_all_replicas full_path_name”. Since the replication factor of my cluster is 1. We can notice that both are having same size.

16. find:

This command is used to finds all files that match the specified expression and applies selected actions to them. If no path is specified then defaults to the current working directory. If no expression is specified then defaults to -print.

hdfs dfs -find <path> … <expression> …

17. getmerge:

This is the most important and the most useful command on the HDFS when trying to read the contents of a MapReduce job output files. This is used for merging a list of files in a directory on the HDFS into a single local file on the local file system.

hdfs dfs -getmerge [-nl] <src> <localdst>

  • -nl option is used to add a new line after end of each file.
  • -skip-empty-file can be used to avoid unwanted newline characters in case of empty files.

18. head:

This command will display the first kilobyte of the file on console.

hdfs dfs -head URI

Example: hdfs dfs -head pathname

19. tail:

This command will display the last kilobyte of the file on console.

hdfs dfs -tail URI

Example: hdfs dfs -tail pathname

20. mv:

This command is used to move the file from one location to another location in HDFS.

hdfs dfs -mv URI [URI …] <dest>

21. rm:

This command is used to remove a file or directory from HDFS.

hdfs dfs -rm [-f] [-r |-R] [-skipTrash] [-safely] URI [URI …]

  • –rm option will remove only files but directories can’t be deleted by this command.
  • –skipTrash option is used to bypass the trash then it immediately deletes the source.
  • –f option is used to mention that if there is no file existing.
  • –r option is used to recursively delete directories
  • -safely option will require safety confirmation before deleting directory with total number of files greater than hadoop.shell.delete.limit.num.files (in core-site.xml, default: 100). It can be used with -skipTrash to prevent accidental deletion of large directories.

22. stat:

This command is used to print the statistics about the file or directory in the specified format. Format accepts file size in blocks (%b), the group name of the owner (%g) and the file name (%n), block size (%o), replication (%r), the username of the owner (%u), modification date (%y, %Y).

hdfs dfs -stat [format] <path> …

23. setrep:

This command is used to change the replication factor of a file to a specific count instead of the default replication factor for the remaining in the HDFS. It is a directory then the command will recursively change the replication factor of all the residing files in the directory tree as per the input provided.

hdfs dfs -setrep [-R] [-w] <numReplicas> <path>

  • -w option requests that the command wait for the replication to complete. This can potentially take a very long time.
  • -R option is accepted for backwards compatibility. It has no effect.

24. touch:

This command is used to update the access and modification times of the file specified by the URI to the current time. If the file does not exist, then a zero length file is created at URI with current time as the timestamp of that URI.

hdfs dfs -touch [-a] [-m] [-t TIMESTAMP] [-c] URI [URI …]

  • -a option to change only the access time.
  • -m option to change only the modification time.
  • -t option to specify timestamp (in format yyyyMMddHHmmss) instead of current time.
  • -c option to not create file if it does not exist.

Examples: hdfs dfs -touch pathname
hdfs dfs -touch -m -t 20180809230000 pathname
hdfs dfs -touch -t 20180809230000 pathname
hdfs dfs -touch -a pathname

25. touchz:

This command is used to create a file of zero length. An error is returned if the file exists with non-zero length.

hdfs dfs -touchz URI [URI …]

Ex: hdfs dfs -touchz pathname

Hope you like this post. Happy Learning!!

--

--