Home » Why are there so many different ways to measure disk usage?

Why are there so many different ways to measure disk usage?

Solutons:


Adding up numbers is easy. The problem is, there are many different numbers to add.

How much disk space does a file use?

The basic idea is that a file containing n bytes uses n bytes of disk space, plus a bit for some control information: the file’s metadata (permissions, timestamps, etc.), and a bit of overhead for the information that the system needs to find where the file is stored. However there are many complications.

Microscopic complications

Think of each file as a series of books in a library. Smaller files make up just one volume, but larger files consist of many volumes, like an encyclopedia. In order to be able to locate the files, there is a card catalog which references every volume. Each volume has a bit of overhead due to the covers. If a file is very small, this overhead is relatively large. Also the card catalog itself takes up some room.

Going a bit more technical, in a typical simple filesystem, the space is divided in blocks. A typical block size is 4KiB. Each file takes up an integer number of blocks. Unless the file size is a multiple of the block size, the last block is only partially used. So a 1-byte file and a 4096-byte file both take up 1 block, whereas a 4097-byte file takes up two blocks. You can observe this with ls or du: if your filesystem has a 4KiB block size, then ls -s and du will report 4KiB for a 1-byte file.

If a file is large, then additional blocks are needed just to store the list of blocks that make up the file (these are indirect blocks; more sophisticated filesystems may optimize this in the form of extents). Those don’t show in the file size as reported by ls -l or GNU du --apparent-size. du and ls -s, which report disk usage as opposed to size, does account for them.

Some filesystems try to reuse the free space left in the last block to pack several file tails in the same block. Some filesystems (such as ext4 since Linux 3.8 use 0 blocks for tiny files (just a few bytes) that entirely fit in the inode.

Macroscopic complications

Generally, as seen above, the total size reported by du is the sum of the sizes of the blocks or extents used by the file.

The size reported by du may be smaller if the file is compressed. Unix systems traditionally support a crude form of compression: if a file block contains only null bytes, then instead of storing a block of zeroes, the filesystem can omit that block altogether. A file with omitted blocks like this is called a sparse file. Sparse files are not automatically created when a file contains a large series of null bytes, the application must arrange for the file to become sparse.

Some filesystems such as btrfs and zfs support general-purpose compression.

Advanced complications

Two major features of very modern filesystems such as zfs and btrfs make the relationship between file size and disk usage significantly more distant: snapshots and deduplication.

Snapshots are a frozen state of the filesystem at a certain date. Filesystems that support this feature can contain multiple snapshots taken at different dates. These snapshots take room, of course. At one extreme, if you delete all the files from the active version of the filesystem, the filesystem won’t become empty if there are snapshots remaining.

Any file or block that hasn’t changed since a snapshot, or between two snapshots was taken exists identically in the snapshot and in the active version or other snapshot. This is implemented via copy-on-write. In some edge cases, it’s possible that deleting a file on a full filesystem will fail due to insufficient available space — because removing that file would require making a copy of a block in the directory, and there’s no more room for even that one block.

Deduplication is a storage optimization technique that consists of avoiding storing identical blocks. With typical data, looking for duplicates isn’t always worth the effort. Both zfs and btrfs support deduplication as an optional feature.

Why is the total from du different from the sum of the file sizes?

As we’ve seen above, the size reported by du for each file is normally is the sum of the sizes of the blocks or extents used by the file. Note that by default, ls -l lists sizes in bytes, but du lists sizes in KiB, or in 512-byte units (sectors) on some more traditional systems (du -k forces the use of kilobytes). Most modern unices support ls -lh and du -h to use “human-readable” numbers using K, M, G, etc. suffices (for KiB, MiB, GiB) as appropriate.

When you run du on a directory, it sums up the disk usage of all the files in the directory tree, including the directories themselves. A directory contains data (the names of the files, and a pointer to where the file’s metadata is), so it needs a bit of storage space. A small directory will take up one block, a larger directory will require more blocks. The amount of storage used by a directory sometimes depends not only on the files it contains but also the order in which they were inserted and in which some files are removed (with some filesystems, this can leave holes — a compromise between disk space and performance), but the difference will be tiny (an extra block here and there). When you run ls -ld /some/directory, the directory’s size is listed. (Note that the “total NNN” line at the top of the output from ls -l is an unrelated number, it’s the sum of the sizes in blocks of the listed items, expressed in KiB or sectors.)

Keep in mind that du includes dot files which ls doesn’t show unless you use the -A or -a option.

Sometimes du reports less than the expected sum. This happens if there are hard links inside the directory tree: du counts each file only once.

On some file systems like ZFS on Linux, du does not report the full disk space occupied by extended attributes of a file.

Beware that if there are mount points under a directory, du will count all the files on these mount points as well, unless given the -x option. So if for instance you want the total size of the files in your root filesystem, run du -x /, not du /.

If a filesystem is mounted to a non-empty directory, the files in that directory are hidden by the mounted filesystem. They still occupy their space, but du won’t find them.

Deleted files

When a file is deleted, this only removes the directory entry, not necessarily the file itself. Two conditions are necessary in order to actually delete a file and thus reclaim its disk space:

  • The file’s link count must drop to 0: if a file has multiple hard links, removing one doesn’t affect the others.
  • As long as the file is open by some process, the data remains. Only when all processes have closed the file is the file deleted. The output fuser -m or lsof on a mount point includes the processes that have a file open on that filesystem, even if the file is deleted.
  • even if no process has the deleted file open, the file’s space may not be reclaimed if that file is the backend of a loop device. losetup -a (as root) can tell you which loop devices are currently set up and on what file. The loop device must be destroyed (with losetup -d) before the disk space can be reclaimed.

If you delete a file in some file managers or GUI environments, it may be put into a trash area where it can be undeleted. As long as the file can be undeleted, its space is still consumed.

What are these numbers from df exactly?

A typical filesystem contains:

  • Blocks containing file (including directories) data and some metadata (including indirect blocks, and extended attributes on some filesystems).
  • Free blocks.
  • Blocks that are reserved to the root user.
  • superblocks and other control information.
  • Inodes
  • A journal

Only the first kind is reported by du. When it comes to df, what goes into the “used”, “available” and total columns depends on the filesystem (of course used blocks (including indirect ones) are always in the “used” column, and unused blocks are always in the “available” column).

Filesystems in the ext2/ext3/ext4 reserve 5% of the space to the root user. This is useful on the root filesystem, to keep the system going if it fills up (in particular for logging, and to let the system administrator store a bit of data while fixing the problem). Even for data partitions such as /home, keeping that reserved space is useful because an almost-full filesystem is prone to fragmentation. Linux tries to avoid fragmentation (which slows down file access, especially on rotating mechanical devices such as hard disks) by pre-allocating many consecutive blocks when a file is being written, but if there are not many consecutive blocks, that can’t work.

Traditional filesystems, up to and including ext4 but not btrfs, reserve a fixed number of inodes when the filesystem is created. This significantly simplifies the design of the filesystem, but has the downside that the number of inodes needs to be sized properly: with too many inodes, space is wasted; with too few inodes, the filesystem may run out of inodes before running out of space. The command df -i reports how many inodes are in use and how many are available (filesystems where the concept is not applicable may report 0).

Running tune2fs -l on the volume containing an ext2/ext3/ext4 filesystem reports some statistics including the total number and number of free inodes and blocks.

Another feature that can confuse matter is subvolumes (supported in btrfs, and in zfs under the name datasets). Multiple subvolumes share the same space, but have separate directory tree roots.

If a filesystem is mounted over the network (NFS, Samba, etc.) and the server exports a portion of that filesystem (e.g. the server has a /home filesystem, and exports /home/bob), then df on a client reflects the data for the whole filesystem, not just for the part that is exported and mounted on the client.

What’s using the space on my disk?

As we’ve seen above, the total size reported by df does not always take all the control data of the filesystem into account. Use filesystem-specific tools to get the exact size of the filesystem if needed. For example, with ext2/ext3/ext4, run tune2fs -l and multiply the block size by the block count.

When you create a filesystem, it normally fills up the available space on the enclosing partition or volume. Sometimes you might end up with a smaller filesystem when you’ve been moving filesystems around or resizing volumes.

On Linux, lsblk presents a nice overview of the available storage volumes. For additional information or if you don’t have lsblk, use specialized volume management or partitioning tools to check what partitions you have. On Linux, there’s lvs, vgs, pvs for LVM, fdisk for traditional PC-style (“MBR”) partitions (as well as GPT on recent systems), gdisk for GPT partitions, disklabel for BSD disklabels, Parted, etc. Under Linux, cat /proc/partitions gives a quick summary. Typical installations have at least two partitions or volumes used by the operating system: a filesystem (sometimes more), and a swap volume.

Some computers have a partition containing the BIOS or other diagnostic software. Computers with UEFI have a dedicated bootloader partition.

Finally, note that most computer programs use units based on powers of 1024 = 210 (because programmers love binary and powers of 2). So 1 kB = 1024 B, 1 MB = 1048576 B, 1 GB = 1073741824, 1 TB = 1099511627776 B, … Officially, these units are known as kibibyte KiB, mebibyte MiB, etc., but most software just reports k or kB, M or MB, etc. On the other hand, hard disk manufacturers systematically use metric (1000-based units). So that 1 TB drive is only 931 GiB or 0.904 TiB.

A short summary of complications to calculating file sizes and disk spaces:

  • The space the file takes on disk is a multiplier of the number of blocks it takes against the size of each block + the number of inodes it takes. A 1 byte long file will take at least 1 block, 1 inode and one directory entry.

    But it could take only 1 additional directory entry if the file is a hard link to another file. It would be just another reference to the same set of blocks.

  • The size of the contents of the file. This is what ls displays.
  • Free disk space is not the size of the largest file you can fit in or the sum of all file content sizes that will fit on the disk. It’s somewhere in between. It depends on the number of files (taking up inodes) the block size and how closely each file’s contents fill blocks completely.

This is just scratching the surface of file systems and it is overly simplified. Also remember that different file systems operate differently.

stat is very helpful at spotting some of this information. Here’s some examples of how to use stat and what it is good for: http://landoflinux.com/linux_stat_command_examples.html

df is generally used to see what the file systems are, how full each is and where they’re mounted. Very useful when you’re running out of space in a file system, and maybe want to shift things around among the file systems, or buy a bigger disk, etc.

du shows details of how much cumulative storage each of one’s directories is consuming (sort of like windirstat in Windows). Great for finding where you’re hogging up space when trying to do file cleanup.

Aside from small numerical differences explained by others, I think the du and df utilities serve very different purposes.

Related Solutions

Joining bash arguments into single string with spaces

[*] I believe that this does what you want. It will put all the arguments in one string, separated by spaces, with single quotes around all: str="'$*'" $* produces all the scripts arguments separated by the first character of $IFS which, by default, is a space....

AddTransient, AddScoped and AddSingleton Services Differences

TL;DR Transient objects are always different; a new instance is provided to every controller and every service. Scoped objects are the same within a request, but different across different requests. Singleton objects are the same for every object and every...

How to download package not install it with apt-get command?

Use --download-only: sudo apt-get install --download-only pppoe This will download pppoe and any dependencies you need, and place them in /var/cache/apt/archives. That way a subsequent apt-get install pppoe will be able to complete without any extra downloads....

What defines the maximum size for a command single argument?

Answers Definitely not a bug. The parameter which defines the maximum size for one argument is MAX_ARG_STRLEN. There is no documentation for this parameter other than the comments in binfmts.h: /* * These are the maximum length and maximum number of strings...

Bulk rename, change prefix

I'd say the simplest it to just use the rename command which is common on many Linux distributions. There are two common versions of this command so check its man page to find which one you have: ## rename from Perl (common in Debian systems -- Ubuntu, Mint,...

Output from ls has newlines but displays on a single line. Why?

When you pipe the output, ls acts differently. This fact is hidden away in the info documentation: If standard output is a terminal, the output is in columns (sorted vertically) and control characters are output as question marks; otherwise, the output is...

mv: Move file only if destination does not exist

mv -vn file1 file2. This command will do what you want. You can skip -v if you want. -v makes it verbose - mv will tell you that it moved file if it moves it(useful, since there is possibility that file will not be moved) -n moves only if file2 does not exist....

Is it possible to store and query JSON in SQLite?

SQLite 3.9 introduced a new extension (JSON1) that allows you to easily work with JSON data . Also, it introduced support for indexes on expressions, which (in my understanding) should allow you to define indexes on your JSON data as well. PostgreSQL has some...

Combining tail && journalctl

You could use: journalctl -u service-name -f -f, --follow Show only the most recent journal entries, and continuously print new entries as they are appended to the journal. Here I've added "service-name" to distinguish this answer from others; you substitute...

how can shellshock be exploited over SSH?

One example where this can be exploited is on servers with an authorized_keys forced command. When adding an entry to ~/.ssh/authorized_keys, you can prefix the line with command="foo" to force foo to be run any time that ssh public key is used. With this...

Why doesn’t the tilde (~) expand inside double quotes?

The reason, because inside double quotes, tilde ~ has no special meaning, it's treated as literal. POSIX defines Double-Quotes as: Enclosing characters in double-quotes ( "" ) shall preserve the literal value of all characters within the double-quotes, with the...

What is GNU Info for?

GNU Info was designed to offer documentation that was comprehensive, hyperlinked, and possible to output to multiple formats. Man pages were available, and they were great at providing printed output. However, they were designed such that each man page had a...

Set systemd service to execute after fstab mount

a CIFS network location is mounted via /etc/fstab to /mnt/ on boot-up. No, it is not. Get this right, and the rest falls into place naturally. The mount is handled by a (generated) systemd mount unit that will be named something like mnt-wibble.mount. You can...

Merge two video clips into one, placing them next to each other

To be honest, using the accepted answer resulted in a lot of dropped frames for me. However, using the hstack filter_complex produced perfectly fluid output: ffmpeg -i left.mp4 -i right.mp4 -filter_complex hstack output.mp4 ffmpeg -i input1.mp4 -i input2.mp4...

How portable are /dev/stdin, /dev/stdout and /dev/stderr?

It's been available on Linux back into its prehistory. It is not POSIX, although many actual shells (including AT&T ksh and bash) will simulate it if it's not present in the OS; note that this simulation only works at the shell level (i.e. redirection or...

How can I increase the number of inodes in an ext4 filesystem?

It seems that you have a lot more files than normal expectation. I don't know whether there is a solution to change the inode table size dynamically. I'm afraid that you need to back-up your data, and create new filesystem, and restore your data. To create new...

Why doesn’t cp have a progress bar like wget?

The tradition in unix tools is to display messages only if something goes wrong. I think this is both for design and practical reasons. The design is intended to make it obvious when something goes wrong: you get an error message, and it's not drowned in...