Go to the first, previous, next, last section, table of contents.

User-defined Functions

Complicated awk programs can often be simplified by defining your own functions. User-defined functions can be called just like built-in ones (see section Function Calls), but it is up to you to define them--to tell awk what they should do.

Function Definition Syntax

Definitions of functions can appear anywhere between the rules of an awk program. Thus, the general form of an awk program is extended to include sequences of rules and user-defined function definitions. There is no need in awk to put the definition of a function before all uses of the function. This is because awk reads the entire program before starting to execute any of it.

The definition of a function named name looks like this:

function name(parameter-list)
{
     body-of-function
}

name is the name of the function to be defined. A valid function name is like a valid variable name: a sequence of letters, digits and underscores, not starting with a digit. Within a single awk program, any particular name can only be used as a variable, array or function.

parameter-list is a list of the function's arguments and local variable names, separated by commas. When the function is called, the argument names are used to hold the argument values given in the call. The local variables are initialized to the empty string. A function cannot have two parameters with the same name.

The body-of-function consists of awk statements. It is the most important part of the definition, because it says what the function should actually do. The argument names exist to give the body a way to talk about the arguments; local variables, to give the body places to keep temporary values.

Argument names are not distinguished syntactically from local variable names; instead, the number of arguments supplied when the function is called determines how many argument variables there are. Thus, if three argument values are given, the first three names in parameter-list are arguments, and the rest are local variables.

It follows that if the number of arguments is not the same in all calls to the function, some of the names in parameter-list may be arguments on some occasions and local variables on others. Another way to think of this is that omitted arguments default to the null string.

Usually when you write a function you know how many names you intend to use for arguments and how many you intend to use as local variables. It is conventional to place some extra space between the arguments and the local variables, to document how your function is supposed to be used.

During execution of the function body, the arguments and local variable values hide or shadow any variables of the same names used in the rest of the program. The shadowed variables are not accessible in the function definition, because there is no way to name them while their names have been taken away for the local variables. All other variables used in the awk program can be referenced or set normally in the function's body.

The arguments and local variables last only as long as the function body is executing. Once the body finishes, you can once again access the variables that were shadowed while the function was running.

The function body can contain expressions which call functions. They can even call this function, either directly or by way of another function. When this happens, we say the function is recursive.

In many awk implementations, including gawk, the keyword function may be abbreviated func. However, POSIX only specifies the use of the keyword function. This actually has some practical implications. If gawk is in POSIX-compatibility mode (see section Command Line Options), then the following statement will not define a function:

func foo() { a = sqrt($1) ; print a }

Instead it defines a rule that, for each record, concatenates the value of the variable `func' with the return value of the function `foo'. If the resulting string is non-null, the action is executed. This is probably not what was desired. (awk accepts this input as syntactically valid, since functions may be used before they are defined in awk programs.)

To ensure that your awk programs are portable, always use the keyword function when defining a function.

Function Definition Examples

Here is an example of a user-defined function, called myprint, that takes a number and prints it in a specific format.

function myprint(num)
{
     printf "%6.3g\n", num
}

To illustrate, here is an awk rule which uses our myprint function:

$3 > 0     { myprint($3) }

This program prints, in our special format, all the third fields that contain a positive number in our input. Therefore, when given:

 1.2   3.4    5.6   7.8
 9.10 11.12 -13.14 15.16
17.18 19.20  21.22 23.24

this program, using our function to format the results, prints:

   5.6
  21.2

This function deletes all the elements in an array.

function delarray(a,    i)
{
    for (i in a)
       delete a[i]
}

When working with arrays, it is often necessary to delete all the elements in an array and start over with a new list of elements (see section The delete Statement). Instead of having to repeat this loop everywhere in your program that you need to clear out an array, your program can just call delarray.

Here is an example of a recursive function. It takes a string as an input parameter, and returns the string in backwards order.

function rev(str, start)
{
    if (start == 0)
        return ""

    return (substr(str, start, 1) rev(str, start - 1))
}

If this function is in a file named `rev.awk', we can test it this way:

$ echo "Don't Panic!" |
> gawk --source '{ print rev($0, length($0)) }' -f rev.awk
-| !cinaP t'noD

Here is an example that uses the built-in function strftime. (See section Functions for Dealing with Time Stamps, for more information on strftime.) The C ctime function takes a timestamp and returns it in a string, formatted in a well known fashion. Here is an awk version:

# ctime.awk
#
# awk version of C ctime(3) function

function ctime(ts,    format)
{
    format = "%a %b %d %H:%M:%S %Z %Y"
    if (ts == 0)
        ts = systime()       # use current time as default
    return strftime(format, ts)
}

Calling User-defined Functions

Calling a function means causing the function to run and do its job. A function call is an expression, and its value is the value returned by the function.

A function call consists of the function name followed by the arguments in parentheses. What you write in the call for the arguments are awk expressions; each time the call is executed, these expressions are evaluated, and the values are the actual arguments. For example, here is a call to foo with three arguments (the first being a string concatenation):

foo(x y, "lose", 4 * z)

Caution: whitespace characters (spaces and tabs) are not allowed between the function name and the open-parenthesis of the argument list. If you write whitespace by mistake, awk might think that you mean to concatenate a variable with an expression in parentheses. However, it notices that you used a function name and not a variable name, and reports an error.

When a function is called, it is given a copy of the values of its arguments. This is known as call by value. The caller may use a variable as the expression for the argument, but the called function does not know this: it only knows what value the argument had. For example, if you write this code:

foo = "bar"
z = myfunc(foo)

then you should not think of the argument to myfunc as being "the variable foo." Instead, think of the argument as the string value, "bar".

If the function myfunc alters the values of its local variables, this has no effect on any other variables. Thus, if myfunc does this:

function myfunc(str)
{
  print str
  str = "zzz"
  print str
}

to change its first argument variable str, this does not change the value of foo in the caller. The role of foo in calling myfunc ended when its value, "bar", was computed. If str also exists outside of myfunc, the function body cannot alter this outer value, because it is shadowed during the execution of myfunc and cannot be seen or changed from there.

However, when arrays are the parameters to functions, they are not copied. Instead, the array itself is made available for direct manipulation by the function. This is usually called call by reference. Changes made to an array parameter inside the body of a function are visible outside that function. This can be very dangerous if you do not watch what you are doing. For example:

function changeit(array, ind, nvalue)
{
     array[ind] = nvalue
}

BEGIN {
    a[1] = 1; a[2] = 2; a[3] = 3
    changeit(a, 2, "two")
    printf "a[1] = %s, a[2] = %s, a[3] = %s\n",
            a[1], a[2], a[3]
}

This program prints `a[1] = 1, a[2] = two, a[3] = 3', because changeit stores "two" in the second element of a.

Some awk implementations allow you to call a function that has not been defined, and only report a problem at run-time when the program actually tries to call the function. For example:

BEGIN {
    if (0)
        foo()
    else
        bar()
}
function bar() { ... }
# note that `foo' is not defined

Since the `if' statement will never be true, it is not really a problem that foo has not been defined. Usually though, it is a problem if a program calls an undefined function.

If `--lint' has been specified (see section Command Line Options), gawk will report about calls to undefined functions.

The return Statement

The body of a user-defined function can contain a return statement. This statement returns control to the rest of the awk program. It can also be used to return a value for use in the rest of the awk program. It looks like this:

return [expression]

The expression part is optional. If it is omitted, then the returned value is undefined and, therefore, unpredictable.

A return statement with no value expression is assumed at the end of every function definition. So if control reaches the end of the function body, then the function returns an unpredictable value. awk will not warn you if you use the return value of such a function.

Sometimes, you want to write a function for what it does, not for what it returns. Such a function corresponds to a void function in C or to a procedure in Pascal. Thus, it may be appropriate to not return any value; you should simply bear in mind that if you use the return value of such a function, you do so at your own risk.

Here is an example of a user-defined function that returns a value for the largest number among the elements of an array:

function maxelt(vec,   i, ret)
{
     for (i in vec) {
          if (ret == "" || vec[i] > ret)
               ret = vec[i]
     }
     return ret
}

You call maxelt with one argument, which is an array name. The local variables i and ret are not intended to be arguments; while there is nothing to stop you from passing two or three arguments to maxelt, the results would be strange. The extra space before i in the function parameter list indicates that i and ret are not supposed to be arguments. This is a convention that you should follow when you define functions.

Here is a program that uses our maxelt function. It loads an array, calls maxelt, and then reports the maximum number in that array:

awk '
function maxelt(vec,   i, ret)
{
     for (i in vec) {
          if (ret == "" || vec[i] > ret)
               ret = vec[i]
     }
     return ret
}

# Load all fields of each record into nums.
{
     for(i = 1; i <= NF; i++)
          nums[NR, i] = $i
}

END {
     print maxelt(nums)
}'

Given the following input:

 1 5 23 8 16
44 3 5 2 8 26
256 291 1396 2962 100
-6 467 998 1101
99385 11 0 225

our program tells us (predictably) that 99385 is the largest number in our array.


Go to the first, previous, next, last section, table of contents.