Sparse files with Python
Written by Barry Warsaw in technology on Sat 14 January 2017. Tags: linux, python,
For a project at work we create sparse files, which on Linux and other POSIX systems, represent empty blocks more efficiently. Let's say you have a file that's a gibibyte [1] in size, but which contains mostly zeros, i.e. the NUL byte. It would be inefficient to write out all those zeros, so file systems that support sparse files actually just write some metadata to represent all those zeros. The real, non-zero data is then written wherever it may occur. These sections of zero bytes are called "holes".
Sparse files are used in many situations, such as disk images, database files, etc. so having an efficient representation is pretty important. When the file is read, the operating system transparently turns those holes into the correct number of zero bytes, so software reading sparse files generally don't have to do anything special. They just read data as normal, and the OS gives them zeros for the holes.
You can create a sparse file right from the shell:
$ truncate -s 1000000 /tmp/sparse
Now /tmp/sparse is a file containing one million zeros. It actually consumes almost no space on disk (just some metadata), but for most intents and purposes, the file is one million bytes in size:
$ ls -l /tmp/sparse -rw-rw-r-- 1 barry barry 1000000 Jan 14 11:36 /tmp/sparse $ wc -c /tmp/sparse 1000000 /tmp/sparse
The commands ls and wc don't really know or care that the file is sparse; they just keep working as if it weren't.
But, sometimes you do need to know that a file contains holes. A common case is if you want to copy the file to some other location, say on a different file system. A naive use of cp will fill in those holes, so a command like this …