Glob Expressions

Note: This is a description taken from the Linux glob library. VMoo has its own implementation for these expressions, so there might be some differences (especially because it is not used to match files ;-).

NAME

glob - Globbing pathnames

DESCRIPTION

Long ago, in Unix V6, there was a program /etc/glob that would expand wildcard patterns. Soon afterwards this became a shell built-in.

These days there is also a library routine glob(3) that will perform this function for a user program.

The rules are as follows (POSIX 1003.2, 3.13).

WILDCARD MATCHING

A string is a wildcard pattern if it contains one of the characters `?', `*' or `['. Globbing is the operation that expands a wildcard pattern into the list of pathnames matching the pattern. Matching is defined by:

A `?' (not between brackets) matches any single character.

A `*' (not between brackets) matches any string, including the empty string.

Character classes: An expression `[...]' where the first character after the leading `[' is not an `!' matches a single character, namely any of the characters enclosed by the brackets. The string enclosed by the brackets cannot be empty; therefore `]' can be allowed between the brackets, pro­vided that it is the first character. (Thus, `[][!]' matches the three characters `[', `]' and `!'.)

Ranges: There is one special convention: two characters separated by `-' denote a range. (Thus, `[A-Fa-f0-9]' is equivalent to `[ABCDEFabcdef0123456789]'.) One may include `-' in its literal meaning by making it the first or last charac­ter between the brackets. (Thus, `[]-]' matches just the two characters `]' and `-', and `[--/]' matches the three characters `-', `.', `/'.)

Complementation: An expression `[!...]' matches a single character, namely any character that is not matched by the expression obtained by removing the first `!' from it. (Thus, `[!]a-]' matches any single character except `]', `a' and `-'.)

One can remove the special meaning of `?', `*' and `[' by preceding them by a backslash, or, in case this is part of a shell command line, enclosing them in quotes. Between brackets these characters stand for themselves. Thus, `[[?*\]' matches the four characters `[', `?', `*' and `\'.

EMPTY LISTS

The nice and simple rule given above: `expand a wildcard pattern into the list of matching pathnames' was the orig­inal Unix definition. It allowed one to have patterns that expand into an empty list, as in

 xv -wait 0 *.gif *.jpg

where perhaps no *.gif files are present (and this is not an error). However, POSIX requires that a wildcard pat­tern is left unchanged when it is syntactically incorrect, or the list of matching pathnames is empty. With bash one can force the classical behaviour by setting allow_null_glob_expansion=true.

(Similar problems occur elsewhere. E.g., where old scripts have

 rm `find . -name "*~"`

new scripts require

 rm -f nosuchfile `find . -name "*~"`

to avoid error messages from rm called with an empty argu­ment list.)

NOTES

Regular expressions: Note that wildcard patterns are not regular expressions, although they are a bit similar. First of all, they match filenames, rather than text, and secondly, the conventions are not the same: e.g., in a regular expression `*' means zero or more copies of the preceding thing.

Now that regular expressions have bracket expressions where the negation is indicated by a `^', POSIX has declared the effect of a wildcard pattern `[^...]' to be undefined.

Character classes and Internationalization: Of course ranges were originally meant to be ASCII ranges, so that `[ -%]' stands for `[ !"#$%]' and `[a-z]' stands for "any lowercase letter". Some Unix implementations generalized this so that a range X-Y stands for the set of characters with code between the codes for X and for Y. However, this requires the user to know the character cod­ ing in use on the local system, and moreover, is not con­venient if the collating sequence for the local alphabet differs from the ordering of the character codes. There­fore, POSIX extended the bracket notation greatly, both for wildcard patterns and for regular expressions. In the above we saw three types of item that can occur in a bracket expression: namely

POSIX specifies ranges in an internationally more useful way and adds three more types:

       [:alnum:]  [:alpha:]  [:blank:]  [:cntrl:]
       [:digit:]  [:graph:]  [:lower:]  [:print:]
       [:punct:]  [:space:]  [:upper:]  [:xdigit:]

so that one can say `[:lower:]' instead of `[a-z]', and have things work in Denmark, too, where there are three letters past `z' in the alphabet. These character classes are defined by the LC_CTYPE category in the current locale.