Sanitizing file names

Gentics CMS offers central translation table that allows you to configure sanitizing of file names.

1 Overview

Gentics CMS will use the $SANITIZE_CHARACTER array from the node.conf file to transform special characters from filenames of pages, images and files and folder paths. This configuration is also used with the Aloha Editor headerids plugin to generate header ids from text contents.

The purpose of this feature is to transform special characters to characters that are allowed in a filename in a meaningful way.

For example: the character “è” is used in many languages, so it makes sense to replace it with “e” in the filename, because it would be lost otherwise. The default settings for sanitizing characters are:


		'é' => 'e',
		'è' => 'e',
		'ë' => 'e',
		'ê' => 'e',
		'à' => 'a',
		'ä' => 'ae',
		'â' => 'a',
		'Ä' => 'Ae',
		'ù' => 'u',
		'ü' => 'ue',
		'û' => 'u',
		'Ü' => 'Ue',
		'ö' => 'oe',
		'ô' => 'o',
		'Ö' => 'Oe',
		'ï' => 'i',
		'î' => 'i',
		'ß' => 'ss'

This will transform strings as follows:


	"äöï 23.jpg" => "aeoei_23.jpg"
	"ia 23$%.html" => "ia_23__.html"

2 Standard behavior

If the configuration is not changed, the standard behavior of sanitizing filenames and paths will be the following

  1. Replace all characters listed in the map above by their specific replacement.
  2. Replace all characters, that are not allowed with the standard replacement character (see Setting the standard replacement character). By default all alphanumeric characters, including “_”, and all of “.-()[]{}$/” are allowed and will not be replaced.

Further Rules:

  • All leading and trailing whitespace will be removed
  • If the name is empty, then the resulting filename will begin with “1”
  • If the name starts with a dot “.” then the sanitizing will put a “1” before the dot. (This is because certain apache installations would interpret the filename as a hidden file, if it starts with a dot.)

3 Configuration

If you do any modifications, then the form validation for the page properties is turned off, but the input will be sanitized as specified.

3.1 Sanitize characters list

You can redefined the pre-defined set of replacements or just add new ones in “node.conf” file like this:

/Node/etc/node.conf

	$SANITIZE_CHARACTER["ï"] = "i";
	$SANITIZE_CHARACTER["ä"] = "ae";

Do not replace any character by “/” or “\”, since those are separators for path names.

Make sure to use UTF-8 encoding for the configuration.

When using replacement characters other than alphanumeric (including “_”) and all of “.-()[]{}$/”, make sure that the replacement characters are also listed as allowed characters (see below).

3.2 Allowing other characters

You can specifically allow other characters, so that they will not be replaced.

Use this at your own risk. Also do not add more than 9216 characters.

This does not work with Java 1.5. Use this feature only with Java 1.6 and above.

/Node/etc/node.conf

$SANITIZE_ALLOWED_CHARACTERS = array(
	',' , 'µ' // allow , and µ in filenames
);

3.3 Setting the standard replacement character

All characters that are not allowed in filenames are replaced with an underscore by default. You can however redefine the standard replacement character like this:

/Node/etc/node.conf

$SANITIZE_REPLACEMENT_CHARACTER = "-"; //use '-' instead of '_'

Use a “safe” character. The character should be safe for use in Urls and on the filesystem (not the path separator). Good replacement characters are “_” or “-”.

4 Examples

4.1 Replacing with special characters

You can also use the sanitize characters list together with the list of allowed characters like this:

/Node/etc/node.conf

$SANITIZE_CHARACTER["(c)"] = "©"; //replace (c) with ©

$SANITIZE_ALLOWED_CHARACTERS = array(
	'©' //Allow copyrights in filenames.
		//(Works, but not a good idea.)
);

This would do


	"(c) 2014 Gentics.jpg" => "©_2014_Gentics.jpg"

4.2 Replacing a character that is allowed by default

If you want to replace e.g. ( and ) with the default replacement character, you need to list them in the sanitize characters.

/Node/etc/node.conf

$SANITIZE_CHARACTER["("] = "_";
$SANITIZE_CHARACTER[")"] = "_";

4.3 Using hyphens instead of underscores

It’s considered best practice to use hyphens (-) instead of underscores (_) in URL’s. For new installations you should add these settings.

If you want to use hyphens (-) instead of underscores (_), you need to do the following:

/Node/etc/node.conf

$SANITIZE_CHARACTER[" "] = "-"; // replace space with - instead of _
$SANITIZE_CHARACTER["_"] = "-"; // replace _ with -
$SANITIZE_REPLACEMENT_CHARACTER = "-"; // replace not allowed characters with -