Version 5, last updated by Xander Jansen at September 01, 2011 UTC
International Character Handling
Assumptions
As of version beta 0.1.17 filesender can handle text in character sets other than US-ASCII. The code is based on the following assumptions:
- The Flash UI handles text as Unicode and uses UTF-8 as encoding for transmitting text strings containing international characters to the backend
- The PHP backend and database store strings with international characters in UTF-8 encoding
- Browser, Mail client and OS can handle UTF-8 encoded strings where appropriate
All handling of strings that might contain international characters should be UTF-8 aware/transparent. At the moment this concerns the 'filesubject', 'filemessage' and 'fileoriginalname' variables.
Database considerations
The filesender PostgreSQL database must be in either the SQL_ASCII or UTF8 encoding. Check with psql -l
Character set distinctions
When handling one of the relevant variables (mainly in outgoing mails and when downloading) the following distinction should be made:
- string contains only US-ASCII characters (plain text, no special characters)
- string contains (multibyte) UTF-8 encoded characters from the ISO-8859-1 character set only
- string contains (multibyte) UTF-8 encoded characters from other character sets
Category 2. (UTF-8 encoded ISO-8859-1) is handled differently on some occasions to either adhere to existing standards or to prevent multibyte encoding where 'single byte encoding' can be used. In these cases the multibyte UTF-8 string is converted (iconv) to a single byte encoding in the ISO-8859-1 character set (with, where possible, the appropriate charset-labels).
Strings from category 3. (UTF-8 encoded 'other' character sets) are handled and labeled as such, UTF-8.
The "Content-Disposition" problem
At the moment the labeling of a string as UTF-8 encoded is not done when a file containing special characters is downloaded. The download process uses the HTTP Content-Disposition header with a filename parameter. This header is only defined for filenames containing characters from the (single byte) ISO-8859-1 character set.
Although there are currently ongoing standardisation efforts to make it possible to use other character sets there is no universally adopted way of transmitting a UTF-8 encoded filename yet (see http://greenbytes.de/tech/webdav/draft-ietf-httpbis-content-disp-latest.html for details). The current filesender code therefore relies on what is called 'encoding sniffing' in browsers such as FireFox, Chrome and Safari. Internet Explorer can only handle ISO-8859-1 (single byte) encoded filenames.
For now this means that when using international characters in filenames other than ISO-8859-1 the filename will look 'funny' when downloaded with Internet Explorer and possibly other browsers other than FireFox, Chrome or Safari. The file is however correctly downloaded and usable.
Known issues
- Sometimes the Flash UI gets confused by certain international characters and will display text (also text with only ASCII characters) in a garbled way. This appears to be related to uploading filenames with characters from the Unicode 'Combining Diacritical Marks' set with Gears. This will be commonly seen with files uploaded from a Mac since the Mac HFS+ file system enforces the use of 'decomposed Unicode' characters in filenames (see http://code.google.com/p/macfuse/issues/detail?id=139#c2 for a nice explanation) . This is not affecting the functionality (uploads and dowloads use the correct text and filenames).
- Some of the PHP-code in the back-end might appear to be unneeded (most notably the utf_encode before json_decode parts).
