The versatile and ubiquitous tar utility

Tue 25 June 2013
By Stephen Cripps

In GNU\Linux

tags: linuxtarstoragebash

Subscribe: RSS

Flattr this

What is tar?

The TAR (Tape ARchiver) utility was originally released in 1979 with the seventh edition of the Unix operating system. Despite its age, tar is still used everywhere. Now granted, tar is not the same program it was when it was released, but the function it performs is the same.

Initially developed to write data to sequential I/O devices for tape backup purposes, tar is now commonly used to collect many files into one larger file for distribution or archiving, while preserving file system information such as user and group permissions, dates, and directory structures. (Wikipedia, June 2013)

As per the Unix philosophy, tar performs one function, and does so very well. It takes a list of files, and spits out a single stream of data. What I want to cover, is some of the cool things you can do with tar.

Command Syntax

tar [operation-flag] [option-flags] <list of files>

The operation flag tells tar what it should be doing with the data given to it. Note the following operations:

  • x: extract the files from the data stream
  • c: create a data stream given a list of file/directory names,

  • z: compress the data stream with gzip

  • t: print out the file names as they are found in the data stream
  • C: (upper-case) requires a directory path as argument, extracts to that directory

For example, a typical tar command might look like the following when you want to store some files in a tar archive.

> tar czf myArchive.tar.gz somefiles/

f tells tar you want the files to be stored in an archive named "myArchive.tar.gz"

c tells tar you are creating an archive

z tells tar that you want to apply compression to the archive, so it passes the data stream through gzip as it writes the file to disk.

Everything else is a file or directory name that you would like to include in the archive. If you don't specify f, tar will write the data stream to stdout, which allows you to do the following command:

> tar c somefiles/ > myArchive.tar

Which I personally find easier to read, since the file you wish to save the archive to appears after the list of files you wish to archive.

Similarly, extracting an archive is:

> tar xf myArchive.tar
(or)
> cat myArchive | tar x

The "Data Stream"

The data stream is actually just a file format that tar uses to describe the data. For each file in the stream, there is a block of bytes containing information about the file, such as its name, size, user attributes and so on. What makes it a stream, is the fact that the file can be processed without knowing where the end is.

What I mean by that, is you can feed the data stream into tar for extraction, and it will immediately begin extracting files before it even knows where the end of the archive is! This unique trait is a big part of what makes it so useful, and a big part of what makes it so different from other archive formats like zip.

The tar file format also does not have an index. You cannot just ask it for a file and receive it immediately. If you request a file from tar, it starts reading through the archive linearly and keeps going until it finds a file with the name you were looking for.

A word on data stream formats

Tar has gone through several different file formats in order to overcome limitations with the original implementation. Currently, the newest POSIX implementation places no limits on file size, filename/path length, and is included in modern versions of GNU tar. You can specify the tar format with:

> tar c --format=posix ...
(or)
> tar c --posix ...

Another modern format was "gnu", which included some neat features, like being able to specify the length of the tape and split it across tapes in an interactive way. These extensions are not a part of posix however, and have some limitations (albeit unlikely to be a problem) on the file UIDs.

I know I have had to deal with the tar file format when using FreeBSD's tar, which is also standard on Apple's OSX, which has the 8GiB file size limit.

This linear nature is a reminder of tar's roots, where writing to tape drives had to be linear, they didn't have the ability to randomly access data. A sequential data stream allows you to do some cool things:

The tar-pipe

Should you need to transfer a large number of small files across computers, you may find using tar faster than tools that focus on copying files. In scp for example, there is a lot of overhead for every file transfered.

> tar c somefiles/ | ssh myuser@myserver "tar x"

When you pipe data into an ssh command, it will provide it as stdin for the specified command on the remote host. In this case, we are piping to "tar x" on the remote host, which will extract the data stream into the host's home directory.

If you want to extract the data to somewhere other than the home directory, you can use the -Coption flag, like so:

> tar c somefiles/ | ssh myuser@myserver "tar x -C /tmp/somedir"

Adding a progress bar to tar

Although you can view the filenames as they are being processed by tar with the -v flag, it doesn't give you a definitive view of how much progress tar has made, and doesn't contain an option to do so. Instead, we can use pv (pipe viewer) to monitor the data stream as it enters or leaves tar.

The easiest example of this is using pv to pipe the archive into a tar extract command:

> pv somearchive.tar | tar x

In this case, pv knows the file size; which allows it to give you an estimate of the time remaining. Creating an archive and knowing the ETA is a little more difficult, check the next section for details. You can still use pv to create an archive, without knowing the total archive size:

> tar c somefiles/ | pv > somearchive.tar

Tar-pipe with pv bash function

If we want to know how long the tar command is going to take, you can get a pretty good idea if you know the total size of the files, and then tell pv the amount of data you expect to pass through the pipe.

For example, you can use the du command to get the total file size before hand:

> du -sc content/ draft/ output/
828     content/
1656    draft/
1328    output/
3812    total

The -sc flags tell du that you want the numbers to represent the size in bytes and print a grand total at the end. Now you tar command would be:

> tar c content/ draft/ output/ | pv -s 3812 > myArchive.tar

Now putting together what we know, we can turn this into a nifty tar pipe shell script which includes a progress bar. If you just want to create an archive file, you would use > filename.tar to create an archive.

#!/bin/bash

size=0
for file in "$@" ; do
    size=$(( ${size} + $( du -sb "$file" | cut -f 1 ) ))
done
tar c "$@" | pv -s ${size}

I wrote the previous without know about du's -c option, but I've also tested it much more thoroughly and know that it works. The for loop runs du on ever file argument and accumulates the sum in the size variable. The tar command at the end pipes output to stdout.

Applying compression

The tar file format is not actually compressed, its just turning many files into a single stream of data. The advantage of this is that you can just pipe this stream into any number of compression utilities.

For example, to compress an archive with gzip:

> tar c somefiles/ | gzip > myArchive.tar.gz

Tar also has flags to specify compression, rather than piping it through the command externally. But choosing to pipe through an external compression utility allows you to use pigz, which is a multi threaded implementation of gzip; much faster on computers with more than one core.

Finally, adding a progress bar will still work, as long as you remember to give pv the decompressed/compressed data stream as appropriate.

Splitting up the Archive

Say you want to take a very large archive and split it over some media of a fixed size; simply take the data stream and pipe it through the split command.

For example:

tar c somefiles/ | split -d - myprefix-somefiles.tar-

Check out the man page for split for more information. Typically I'm only concerned about,

  • -b: Size of each file being split, use a single letter suffix for "B,G,T"
  • -d: Use numbers for the suffix
  • -: Grab the data from the pipe
  • myprefix..: the prefix for each of the output files

Or putting together everything from above:

tar c somefiles/ | pv | pigz | \
    ( cd somedirectory; split -d - myprefix-someilfes.tar- )

You can use the cd command to move into another directory for the output files.

Of course splitting things should make you nervous, you're now depending on each of these files to remain intact if you want to be able to recreate the original archive. You can always use something like parachive, which will generate parity files, capable of repairing files.

Resources

https://blogs.oracle.com/janp/entry/how_the_scp_protocol_works

http://www.gnu.org/software/tar/

http://ftp.gnu.org/old-gnu/Manuals/tar-1.12/html_node/tar_117.html

http://www.linfo.org/pipes.html

http://unix.stackexchange.com/questions/45709/behavior-of-stdin-stdout-in-conjunction-with-subshells-and-cd-command

Comments !

blogroll