Internationalizing Shell Scripts

By Thomas Hedden

1. Motivation: why internationalize?

Many people are surprised to learn that a very large percentage of the revenue of major American software developers comes from sales of their software outside the United States. As of 1991, this percentage ranged from around 45% for Apple up to close to 60% for companies such as Digital and Microsoft (Tuthill 1993: 6). Since the effort and cost of localizing a software application is considerably less than the effort and cost of developing it, it is foolish not to pursue foreign markets. However, foreign customers now expect that software be fully localized, and time-to-market is crucial in the software business. Therefore, it is very important that software applications be written in such a way that they can be localized easily and quickly. That is where internationalization becomes important.

2. Internationalization and localization

The terms internationalization and localization have special meanings in the software development community (although many people use them loosely in other meanings). Internationalization refers to "designing and producing software that can easily be adapted to local markets" (Tuthill 1993: xxi), while localization is "the process of actually adapting the potentially useful internationalized software to meet the needs of one or more users in a particular geographical area" (Madell et al. 1994:1-2). Thus, localization is basically translation of the user interface into a foreign language and setting any necessary environment variables affecting codeset, sorting order, etc. By and large software developers need concern themselves only with internationalization, that is, making it easy for localizers to adapt the software to specific locales.

As was mentioned above, internationalization is important because it should be possible to localize software easily and quickly. Localizers tend not to have very much technical expertise, and will generally make a mess of a program if allowed to. Those localizers who do have technical expertise tend to charge rates which are similar to those of developers. Thus, the objective of internationalized is to make it possible for a localizer or translator to translate the user interface without having an unusual amount of technical expertise.

3. Information about internationalizing shell scripts not readily available

Although a number of books about internationalization have been written (a few are mentioned in the references at the end of this paper), they tend to ignore shell script programming. This is not surprising, since shell scripts have not tended to be sold commercially the way C programs are. However, the advent of the powerful programming capabilities of the Korn shell and various shell programming tools are changing this situation.

4. Examples of programming habits which cause problems for localizers

Sometimes the best way to explain how to do something correctly is to show incorrect examples, and explain what is wrong with them. Here are a few examples of problems, and an explanation of how to avoid them.

4.1. Concatenation of strings

An obvious example of strings which are difficult for localizers to deal with is when a phrase is built by concatenating several phases together.

echo "Please enter the name of the user whose"
echo "files you wish to delete: \c"

Since word order in foreign languages may not be the same as in English, this requires the localizer to put the phrases together, translate it, and then break it up again into two phrases.

An even worse habit is putting together messages in which one element is constant, and another changes:

echo "Invalid \c"
if [ -f ${name} ]; then
	echo "filename."
elif [ -d ${name} ]; then
	echo "directory."
else
	echo "choice."
fi

Since in many foreign languages words have "gender", and since the gender of the translated terms may not be the same, it is impossible for the translation of the word "Invalid \c" to be correct in every case. Rather, one should write:

if [ -f ${name} ]; then
	echo "Invalid filename."
elif [ -d ${name} ]; then
	echo "Invalid directory."
else
	echo "Invalid choice."
fi

4.2. Reuse of strings

Since creating messages can be somewhat time-consuming, programmers often check to see whether a particular string has already been defined, and then reuse it. This is very dangerous, since terms and phrases are sometimes translated differently depending on the context. The classic example is the German translation of the word "file" for the Macintosh platform. When it means a collection of data on a storage device, it is translated as "Datei"; when it means the name of the "File" menu, then it is translated as "Ablage". Thus, any new requirement for a user interface message should be produced as a completely new string.

4.3. Assumptions about the format of locale-dependent data

Shell scripts which make assumptions about the format of the output of commands such as the date command will fail in a foreign language environment. For example, determining the day of the week by piping the date command to the cut command will fail if the date is output in a foreign language. If a routine makes assumptions about the format of the output from such environment-dependent commands, then either it should be changed so that it is independent of the environment, or the environment should be explicitly defined before the routine is executed. In that case the previous value can be "remembered" and restored after the environment-dependent routine has been executed:

remember_lang=${LANG}
export LANG=english
environment_dependent_routine()
	{
	commands...
	}
export LANG=${remember_lang}
(etc.)

4.4. Difficulty in understanding what needs to be translated

Programmers tend not to understand how difficult it is for non-programmers to understand exactly what parts of a program need to be translated and what should be left alone. If localizers or translators ask for guidance about this, programmers frequently bark out a sentence or two such as "Tell them to translate everything in double quotes". However, it is almost never possible to explain what to do in such simple terms, or in any terms that a non-programmer could apply consistently without making mistakes.

4.5. Unnecessary additional complexity

The following example is taken from Kochan and Wood (1990: 184). In this case the use of a variable ($name) follwed by a contraction (the apostrophe s) makes it difficult for a non-programmer to understand how to handle the variable.

who | grep "^$name " > /dev/null || echo "$name's not logged on"

The localizer has probably been told to ignore words which begin with a dollar sign. Does that mean to leave "$name's" as it is? Or should just "$name" be left as it is, and "'s" be treated as part of the rest of the message?

5. An examples of how to isolate user interface strings

The following example is taken from Kochan and Wood (1990: 173).

$ cat rem
# 
# Remove someone from the phone book -- version 3
# 
if [ "$#" -ne 1 ]
then
	echo "Incorrect number of arguments."
	echo "Usage: rem name"
	exit 1
fi
(etc.)

We want to put the strings which appear on the screen in a form which can be given to a localizer who is not familiar with shell script programming. The first step is to isolate the strings into variables. In order to preserve the readability of the program, it is important to use variable names whose names are mnemonic (similar to the names of C library functions such as isalpha(), etc.).

$ cat rem
# 
# Remove someone from the phone book -- version tom-1
# 
# User interface strings:
# 
inc_no_arg="Incorrect number of arguments."
usage="Usage: rem name"
if [ "$#" -ne 1 ]
then
	echo "${inc_no_arg}"
	echo "${usage}"
	exit 1
fi
(etc.)

This is readable, but is still a little harder to read if you are interested in the exact wording of the message, since you have to go back and forth in the program. Therefore, if you are still stubbing out the program and deciding exactly what message should be printed, it is probably better to leave the strings hard-coded, and then put them into variables once the program becomes more stable. After the program is stable, it actually becomes more readable, rather than less, since later the programmer will be more interested in reviewing the functionality, and less so in the actual message. There is an additional advantage that not only is it easier to translate the messages when they are isolated like this, but it becomes easier to make changes required for other reasons. Since many programmers are not native speakers of English, it might be necessary for the user interface messages to be made more understandable, or checked for spelling mistakes, and this need not be done by a programmer.

To localize the above file, one could simply send the entire file to the localizer. However, the localizer may be confused by the code down below, or may actually try to translate it (or, in the worst case, may try to steal it!). Therefore, it is best not to send the code to the localizer at all. It is possible simply to cut out the section of the program which contains the strings and send that to the localizer, but a better solution is to put the user interface messages in a separate file.

$ cat rem_msg
# 
# User interface messages for the program "rem"
# 
inc_no_arg="Incorrect number of arguments."
usage="Usage: rem name"

$ cat rem
# 
# Remove someone from the phone book -- version tom-2
# 
# Read in user interface strings.
# 
# The following one error message cannot be put in the
# message file, since it is testing for whether the
# message file exists and has the correct permissions;
# if the message file is missing or has the wrong
# permissions then the error message would not be
# available if it were in that file.
no_msg_file="Error: message file not found or has wrong permissions."
if [ -f rem_msg -a -r rem_msg -a -x rem_msg ]; then
	. ./rem_msg
else
	echo ${no_msg_file}
	exit 3
fi
if [ "$#" -ne 1 ]
then
	echo "${inc_no_arg}"
	echo "${usage}"
	exit 1
fi
(etc.)

The method shown above requires that the file containing the user interface messages have execute permission. It is also necessary to test for the existence of the message file, since the program will not work correctly without it.

Now the file rem_msg can be sent to a localizer. However, once we get back the translated file, another issue confronts us: The original shell script ran only in English. Should the shell script now run only in the foreign language? It would be nice if the shell script could run in either language. This requires that two message files be made available to the program:

$ cat rem_msg.eng
# 
# User interface messages for the program "rem"
# 
inc_no_arg="Incorrect number of arguments."
usage="Usage: rem name"

$ cat rem_msg.ger
# 
# User interface messages for the program "rem"
# 
inc_no_arg="Ungültige Anzahl der Parameter."
usage="Verwendung: rem Name"

Now we have to decide how the language should be chosen. There are a number of ways this could be done: we could define the message file in the program and let the user change it by editing the file, or we could explicitly ask the user every time the program is run. However, the best solution is to make use of the existing shell environment variable LANG to make this decision automatically. (A completely rigorous solution would check other environment variables such as LC_ALL, etc., but that is outside the scope of this paper.)

$ cat rem
# 
# Remove someone from the phone book -- version tom-3
# 
# User interface strings:
# 
# The following one error message cannot be put in the
# message file, since it is testing for whether the
# message file exists and has the correct permissions;
# if the message file is missing or has the wrong
# permissions then the error message would not be
# available if it were in that file.
no_msg_file="Error: message file not found or has wrong permissions."
lang=${LANG:=english}
case ${lang} in
	[Ee][Nn][Gg]*	)	rem_msg=rem_msg.eng ;;
	[Gg][Ee][Rr]*	)	rem_msg=rem_msg.ger ;;
	*			)	rem_msg=rem_msg.eng ;;
esac
if [ -f ${rem_msg} -a -r ${rem_msg} -a -x ${rem_msg} ]; then
	. ./${rem_msg}
else
	echo ${no_msg_file}
	exit 3
fi
if [ "$#" -ne 1 ]
then
	echo "${inc_no_arg}"
	echo "${usage}"
	exit 1
fi
(etc.)

6. Other steps necessary to run foreign-language shell scripts

If the environment variables such as LANG, LC_ALL, etc., are set correctly, then most UNIX systems will correctly display foreign characters. However, if you are connecting to a UNIX host via a telnet session, note that unless your communications software allows you to change the codepage that the special characters will not display correctly. Multi-byte languages such as Japanese and Chinese also require special fonts and input-method editors to enable foreign-language capability.

References

Blinn, Bruce. 1996. Portable Shell Programming. An Extensive Collection of Bourne Shell Examples. Upper Saddle River, NJ: Prentice Hall PTR.

Kochan, Stephen G., and Patrick H. Wood. 1990. UNIX Shell Programming, revised edn. Carmel, IN: Hayden Books.

Madell, Tom, Clark Parsons, and John Abegg. 1994. Developing and Localizing International Software. Englewood Cliffs, NJ: PTR Prentice Hall.

O'Donnell, Sandra Martin. 1994. Programming for the World. A Guide to Internationalization. Englewood Cliffs, NJ: PTR Prentice Hall.

Rosenblatt, Bill. 1993. Learning the Korn Shell. Sebastopol, CA: O'Reilly & Associates, Inc.

Tuthill, Bill. 1993. Solaris^® International Developer's Guide. Mountain View, CA: SunSoft Press.

Click here to return to Thomas Hedden's homepage.

This page is viewable with any browser.