latin1 and cp1252 inconsistent?

Post by b***@yelp.com
Latin1 has a block of 32 undefined characters.

These characters are not undefined. 0x80-0x9f are the C1 control
codes in Latin-1, much as 0x00-0x1f are the C0 control codes, and
their Unicode mappings are well defined.

http://tools.ietf.org/html/rfc1345

Post by b***@yelp.com
Windows-1252 (aka cp1252) fills in 27 of these characters but leaves five undefined: 0x81, 0x8D, 0x8F, 0x90, 0x9D

In CP 1252, these codes are actually undefined.

http://msdn.microsoft.com/en-us/goglobal/cc305145.aspx

Post by b***@yelp.com
When a user agent [browser] would otherwise use a character encoding given in the first column [ISO-8859-1, aka latin1] of the following table to either convert content to Unicode characters or convert Unicode characters to bytes, it must instead use the encoding given in the cell in the second column of the same row [windows-1252, aka cp1252].
http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#character-encodings-0
The current implementation of windows-1252 isn't usable for this purpose (a replacement of latin1), since it will throw an error in cases that latin1 would succeed.

You can use a non-strict error handling scheme to prevent the error.

b'hello \x81 world'.decode('cp1252')

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "c:\python33\lib\encodings\cp1252.py", line 15, in decode
return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position
6: character maps to <undefined>

b'hello \x81 world'.decode('cp1252', 'replace')

'hello \ufffd world'

b'hello \x81 world'.decode('cp1252', 'ignore')

'hello world'

b***@yelp.com

2012-11-16 23:27:54 UTC

Post by b***@yelp.com
Latin1 has a block of 32 undefined characters.

These characters are not undefined. 0x80-0x9f are the C1 control
codes in Latin-1, much as 0x00-0x1f are the C0 control codes, and
their Unicode mappings are well defined.

They are indeed undefined: ftp://std.dkuug.dk/JTC1/sc2/wg3/docs/n411.pdf

""" The shaded positions in the code table correspond
to bit combinations that do not represent graphic
characters. Their use is outside the scope of
ISO/IEC 8859; it is specified in other International
Standards, for example ISO/IEC 6429.

However it's reasonable for 0x81 to decode to U+81 because the unicode standard says: http://www.unicode.org/versions/Unicode6.2.0/ch16.pdf

""" The semantics of the control codes are generally determined by the application with which they are used. However, in the absence of specific application uses, they may be interpreted according to the control function semantics specified in ISO/IEC 6429:1992.

Post by Ian Kelly
You can use a non-strict error handling scheme to prevent the error.

b'hello \x81 world'.decode('cp1252', 'replace')

'hello \ufffd world'

This creates a non-reversible encoding, and loss of data, which isn't acceptable for my application.

Dave Angel

2012-11-17 00:05:32 UTC

(doublespaced nonsense deleted. GoogleGropups strikes again.)
This creates a non-reversible encoding, and loss of data, which isn't
acceptable for my application.

So tell us more about your application. If you have data which is
invalid, and you encode it to some other form, you have to expect that
it won't be reversible. But maybe your data isn't really characters at
all, and you're just trying to manipulate bytes?

Without a use case, we really can't guess. The fact that you are
waffling between latin1 and 1252 indicates this isn't really character data.

Also, while you're at it, please specify the Python version and OS
you're on. You haven't given us any code to guess it from.

--
DaveA

Ian Kelly

2012-11-17 00:20:24 UTC

Post by b***@yelp.com
They are indeed undefined: ftp://std.dkuug.dk/JTC1/sc2/wg3/docs/n411.pdf
""" The shaded positions in the code table correspond
to bit combinations that do not represent graphic
characters. Their use is outside the scope of
ISO/IEC 8859; it is specified in other International
Standards, for example ISO/IEC 6429.

It gets murkier than that. I don't want to spend time hunting down
the relevant documents, so I'll just quote from Wikipedia:

"""
In 1992, the IANA registered the character map ISO_8859-1:1987, more
commonly known by its preferred MIME name of ISO-8859-1 (note the
extra hyphen over ISO 8859-1), a superset of ISO 8859-1, for use on
the Internet. This map assigns the C0 and C1 control characters to the
unassigned code values thus provides for 256 characters via every
possible 8-bit value.
"""

http://en.wikipedia.org/wiki/ISO/IEC_8859-1#History

Post by Ian Kelly
You can use a non-strict error handling scheme to prevent the error.

b'hello \x81 world'.decode('cp1252', 'replace')

'hello \ufffd world'

This creates a non-reversible encoding, and loss of data, which isn't acceptable for my application.

Well, what characters would you have these bytes decode to,
considering that they're undefined? If the string is really CP-1252,
then the presence of undefined characters in the document does not
signify "data". They're just junk bytes, possibly indicative of data
corruption. If on the other hand the string is really Latin-1, and
you *know* that it is Latin-1, then you should probably forget the
aliasing recommendation and just decode it as Latin-1.

Apparently this Latin-1 -> CP-1252 encoding aliasing is already
commonly performed by modern user agents. What do IE and Firefox do
when presented with a Latin-1 encoding and undefined CP-1252 codings?

Dennis Lee Bieber

2012-11-18 06:48:01 UTC

Post by b***@yelp.com
Latin1 has a block of 32 undefined characters.

These characters are not undefined. 0x80-0x9f are the C1 control
codes in Latin-1, much as 0x00-0x1f are the C0 control codes, and
their Unicode mappings are well defined.

This quote only states that those position do not represent
displayable glyphs, and indicates the 8859 is only concerned with
codings for display. It does NOT say they are "undefined".

--
Wulfraed Dennis Lee Bieber AF6VN
***@ix.netcom.com HTTP://wlfraed.home.netcom.com/

b***@yelp.com

2012-11-16 23:27:54 UTC

Post by b***@yelp.com
Latin1 has a block of 32 undefined characters.

These characters are not undefined. 0x80-0x9f are the C1 control
codes in Latin-1, much as 0x00-0x1f are the C0 control codes, and
their Unicode mappings are well defined.

Post by Ian Kelly
You can use a non-strict error handling scheme to prevent the error.

b'hello \x81 world'.decode('cp1252', 'replace')

'hello \ufffd world'

This creates a non-reversible encoding, and loss of data, which isn't acceptable for my application.

Nobody

2012-11-17 00:33:14 UTC

It goes on to say:

The requirement to treat certain encodings as other encodings according
to the table above is a willful violation of the W3C Character Model
specification, motivated by a desire for compatibility with legacy
content. [CHARMOD]

IOW: Microsoft's "embrace, extend, extinguish" strategy has been too
successful and now we have to deal with it. If HTML content is tagged as
using ISO-8859-1, it's more likely that it's actually Windows-1252 content
generated by someone who doesn't know the difference.

Given that the only differences between the two are for code points which
are in the C1 range (0x80-0x9F), which should never occur in HTML, parsing
ISO-8859-1 as Windows-1252 should be harmless.

If you need to support either, you can parse it as ISO-8859-1 then
explicitly convert C1 codes to their Windows-1252 equivalents as a
post-processing step, e.g. using the .translate() method.

Ian Kelly

2012-11-17 01:08:36 UTC

Post by Nobody
If you need to support either, you can parse it as ISO-8859-1 then
explicitly convert C1 codes to their Windows-1252 equivalents as a
post-processing step, e.g. using the .translate() method.

Or just create a custom codec by taking the one in
Lib/encodings/cp1252.py and modifying it slightly.

Post by Nobody

import codecs
import cp1252a
codecs.register(lambda n: cp1252a.getregentry() if n == "cp1252a" else None)
b'\x81\x8d\x8f\x90\x9d'.decode('cp1252a')

'♕♖♗♘♙'

b***@yelp.com

2012-11-17 16:56:46 UTC

Post by Nobody
IOW: Microsoft's "embrace, extend, extinguish" strategy has been too
successful and now we have to deal with it. If HTML content is tagged as
using ISO-8859-1, it's more likely that it's actually Windows-1252 content
generated by someone who doesn't know the difference.

Yes that's exactly what it says.

Post by Nobody
Given that the only differences between the two are for code points which
are in the C1 range (0x80-0x9F), which should never occur in HTML, parsing
ISO-8859-1 as Windows-1252 should be harmless.

"should" is a wish. The reality is that documents (and especially URLs) exist that can be decoded with latin1, but will backtrace with cp1252. I see this as a sign that a small refactorization of cp1252 is in order. The proposal is to change those "UNDEFINED" entries to "<control>" entries, as is done here:

http://dvcs.w3.org/hg/encoding/raw-file/tip/index-windows-1252.txt

and here:

ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1252.txt

This is in line with the unicode standard, which says: http://www.unicode.org/versions/Unicode6.2.0/ch16.pdf

Post by Nobody
There are 65 code points set aside in the Unicode Standard for compatibility with the C0
and C1 control codes defined in the ISO/IEC 2022 framework. The ranges of these code
points are U+0000..U+001F, U+007F, and U+0080..U+009F, which correspond to the 8-bit
controls 0x00 to 0x1F (C0 controls), 0x7F (delete), and 0x80 to 0x9F (C1 controls),
respectively ... There is a simple, one-to-one mapping between 7-bit (and 8-bit) control
codes and the Unicode control codes: every 7-bit (or 8-bit) control code is numerically
equal to its corresponding Unicode code point.

IOW: Bytes with undefined semantics in the C0/C1 range are "control codes", which decode to the unicode-point of equal value.

This is exactly the section which allows latin1 to decode 0x81 to U+81, even though ISO-8859-1 explicitly does not define semantics for that byte (6.2 ftp://std.dkuug.dk/JTC1/sc2/wg3/docs/n411.pdf)

Ian Kelly

2012-11-17 18:08:49 UTC

Post by b***@yelp.com
http://dvcs.w3.org/hg/encoding/raw-file/tip/index-windows-1252.txt
ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1252.txt

The README for the "BestFit" document states:

"""
These tables include "best fit" behavior which is not present in the
other files. Examples of best fit
are converting fullwidth letters to their counterparts when converting
to single byte code pages, and
mapping the Infinity character to the number 8.
"""

This does not sound like appropriate behavior for a generalized
conversion scheme. It is also noted that the "BestFit" document is
not authoritative at:

http://www.iana.org/assignments/charset-reg/windows-1252

Post by b***@yelp.com
This is in line with the unicode standard, which says: http://www.unicode.org/versions/Unicode6.2.0/ch16.pdf

IOW: Bytes with undefined semantics in the C0/C1 range are "control codes", which decode to the unicode-point of equal value.
This is exactly the section which allows latin1 to decode 0x81 to U+81, even though ISO-8859-1 explicitly does not define semantics for that byte (6.2 ftp://std.dkuug.dk/JTC1/sc2/wg3/docs/n411.pdf)

But Latin-1 explicitly defers to to the control codes for those
characters. CP-1252 does not; the reason those characters are left
undefined is to allow for future expansion, such as when Microsoft
added the Euro sign at 0x80.

Since we're talking about conversion from bytes to Unicode, I think
the most authoritative source we could possibly reference would be the
official ISO 10646 conversion tables for the character sets in
question. I understand those are to be found here:

http://www.unicode.org/Public/MAPPINGS/ISO8859/8859-1.TXT

and here:

http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT

Note that the ISO-8859-1 mapping defines the C0 and C1 codes, whereas
the cp1252 mapping leaves those five codes undefined. This would seem
to indicate that Python is correctly decoding CP-1252 according to the
Unicode standard.

Ian Kelly

2012-11-17 18:13:51 UTC

Post by b***@yelp.com
http://dvcs.w3.org/hg/encoding/raw-file/tip/index-windows-1252.txt
ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1252.txt

"""
These tables include "best fit" behavior which is not present in the
other files. Examples of best fit
are converting fullwidth letters to their counterparts when converting
to single byte code pages, and
mapping the Infinity character to the number 8.
"""
This does not sound like appropriate behavior for a generalized
conversion scheme. It is also noted that the "BestFit" document is
http://www.iana.org/assignments/charset-reg/windows-1252

I meant to also comment on the first link, but forgot. As that
document is published by the W3C, I understand it to be specific to
the Web, which Python is not. Hence I think the more general Unicode
specification is more appropriate for Python.

Nobody

2012-11-17 19:15:15 UTC