BCA Server upgrades - possible problems with character sets

Discussion about BCA's Internet Hosting Service
User avatar
David Gibson
DG test
Posts: 622
Joined: Thu 16 Mar 2006 23:45

BCA Server upgrades - possible problems with character sets

Post by David Gibson » Fri 15 Dec 2017 16:44

If you are using non-7-bit characters on your web pages, BCA's server upgrades may give rise to a number of subtle problems. Non-7-bit characters include the pound sign (i.e. the HTML entity &pound; ) and all accented characters (e.g. if you have a <FORM> that allows a user to enter his name and address).

If your pound symbols, £, are being replaced by £ or � or disappearing completely, read on...[/b]

BCA's britiac3 server runs PHP 5.4.45. Some of the problems reported here will arise at a later upgrade, as they occur only from PHP 5.6. However the problem with htmlentities() already occurs (from PHP 5.4) - see my earlier posting in March 2016 at http://british-caving.org.uk/phpBB3/vie ... =31&t=1348

The origin of the problem appears to be that, from PHP 5.6.0, its php.ini file sets default_charset="UTF-8" instead of leaving that parameter empty. Every web page that is parsed by PHP will have a header added to the output, specifying Content-Type: text/html; charset=utf-8. You can test this using the PHP function get_headers().

The following text refers to the pound sign, but the description applies to accented characters as well. It refers to the ANSI character set, which is equivalent to Windows 1252. This is similar to ISO-8859-1 but contains a group of accented characters that are missing from ISO-8859-1. Reportedly, browsers are likely to default to Windows 1252 when ISO-8859-1 is specified. But whether that applies to non-Windows browsers, I do not know.

The effects of the problem include the following...
  • If you have a script that includes a pound sign that has been encoded in the ANSI character set (as 0xA3) it will not be recognised by UTF-8, which uses a different (multi-byte) encoding. Your browser probably displays these non-existent characters as a black diamond question mark. The code for that character can be interpreted, by an ANSI system, as �, so you may find your £ symbols appearing as � in your database.
  • If you have a FORM on your web page, and the user enters a pound sign, it will be encoded in UTF-8 as 0xC2A3. If this is later displayed with the ANSI character set it will appear as £. This could happen if your web server generates an email containing the data, or if you write it to a database or text file and process it later, as ANSI text.
  • If your data contains the ANSI pound sign 0xA3 and you use the PHP function htmlentities() to attempt to convert it to &pound; for display in your browser, that will not work because the default character set for htmlentities() is now (from PHP 5.4) UTF-8, in which 0xA3 is not a valid character. The function will silently ignore the error, removing your pound signs from your output. You need to tell htmlentities() what character set you are using, e.g. htmlentities($string, ENT_COMPAT | ENT_HTML401, 'ISO-8859-1'). Note that, from PHP 5.6, this problem might 'go away' because from PHP 5.6 the default character set for htmlentities will follow the system's default_charset setting.
Note that this is a PHP problem, not a web server problem 'as such'. As far as I know, Apache does not try to force a character set on you - but PHP does. The required character set must be set on the server. This can be done in a number of places.
  • Execute the PHP function header('Content-type: text/html;charset=ISO-8859-1'); at the start of each page, to set the character set.
  • Put the directive default_charset="ISO-8859-1" into the file .user.ini but note that this file is processed only by the CGI/FastCGI SAPI - it works on britiac3, but not on my localhost, or on britiac2, for example.
  • Put the directive php_value default_charset ISO-8859-1 into your .htaccess file, but that will cause a 500 Server Error if it is executed in the CGI/FastCGI SAPI, so it works on my localhost, for example, but not on britiac3.
  • Add default_charset="ISO-8859-1" to php.ini if you have sufficient admin privileges.
If you are migrating from britiac2 (or, like me, you have a localhost set-up running PHP as a module) to britiac3, and you want to use all the same files, you can create a .user.ini file (which will be ignored by britiac2) but, additionally, put a suitable <If> clause in .htaccess (which will be ignored by britiac3). (See my further notes).

For further notes, see my web page at http://caves.org.uk/charset_test.html. (But that page was written as an aide memoire and test page for myself, so it might not make easy or exciting reading.
Last edited by David Gibson on Sat 16 Dec 2017 11:16, edited 1 time in total.
Reason: minor layout changes
this is my signature