Discussion:
unicodedata name for \u000a
(too old to reply)
Ken Beesley
2004-08-21 19:24:04 UTC
Permalink
Newbie question: on unicodedata.name


If I do

import unicodedata
unicodedata.name(u"a")
or
unicodedata.name(u"\u0061")

I get
'LATIN SMALL LETTER A"

as expected; but when I follow that with

unicodedata.name(u"\u000a")

I get

Traceback (most recent call last):
File "<stdin>", line 1, in ?
ValueError: no such name

There is, of course, a Unicode name for \u000a,
which is 'LINE FEED' or perhaps 'LINE FEED (A)'.

Is there a gap in unicodedata? or in my understanding?

Thanks,

Ken
Christos "TZOTZIOY" Georgiou
2004-08-21 19:38:57 UTC
Permalink
On Sat, 21 Aug 2004 21:24:04 +0200, rumours say that Ken Beesley
<***@xrce.xerox.com> might have written:

[snip]
Post by Ken Beesley
unicodedata.name(u"\u000a")
I get
File "<stdin>", line 1, in ?
ValueError: no such name
There is, of course, a Unicode name for \u000a,
which is 'LINE FEED' or perhaps 'LINE FEED (A)'.
Is there a gap in unicodedata? or in my understanding?
It seems that all control characters (u"\u0000" to u"\u001f") have no
names in unicodedata. Don't know if this is an omission (ie bug) or
intentional.
--
TZOTZIOY, I speak England very best,
"Tssss!" --Brad Pitt as Achilles in unprecedented Ancient Greek
Peter Kleiweg
2004-08-21 19:45:53 UTC
Permalink
Post by Ken Beesley
There is, of course, a Unicode name for \u000a,
No, there isn't. Check
http://www.unicode.org/charts/PDF/U0000.pdf
--
Peter Kleiweg L:NL,af,da,de,en,ia,nds,no,sv,(fr,it) S:NL,de,en,(da,ia)
info: http://www.let.rug.nl/~kleiweg/ls.html
Tor Iver Wilhelmsen
2004-08-22 08:53:17 UTC
Permalink
Post by Peter Kleiweg
No, there isn't. Check
http://www.unicode.org/charts/PDF/U0000.pdf
Quoting that document:

Alias names are those for ISO/IEC 6429:1992.
Commonly used alternative aliases are also shown.

000A LF <control>
= LINE FEED (LF)

So the authors of unicodedata.name() could have picked either
'<control>', the ASCII name 'LF' or the alternative 'LINE FEED (LF)'.
Not picking any of them seems strange, and as the OP pointed out,
leads to an error even though the "C0 Controls" part of that page *is*
part of Unicode.
Martin v. Löwis
2004-08-22 09:39:10 UTC
Permalink
Post by Tor Iver Wilhelmsen
000A LF <control>
= LINE FEED (LF)
So the authors of unicodedata.name() could have picked either
'<control>', the ASCII name 'LF' or the alternative 'LINE FEED (LF)'.
No. <control> is not a character name. The unicodedata.name function
returns the official character name, so it MUST NOT return an alias
(which rules out your second alternative).
Post by Tor Iver Wilhelmsen
Not picking any of them seems strange, and as the OP pointed out,
leads to an error even though the "C0 Controls" part of that page *is*
part of Unicode.
Yes. However, this strangeness originates from the Unicode
specification. Control characters simply do not have a name.

If you want to know whether a code point is an unassigned character,
check whether unicodedata.type is "Cn".

Regards,
Martin
Tor Iver Wilhelmsen
2004-08-22 13:41:13 UTC
Permalink
Post by Martin v. Löwis
No. <control> is not a character name. The unicodedata.name function
returns the official character name, so it MUST NOT return an alias
(which rules out your second alternative).
Then why not return None or the empty string instead of raising an
exception?
Martin v. Löwis
2004-08-22 15:15:20 UTC
Permalink
Post by Tor Iver Wilhelmsen
Then why not return None or the empty string instead of raising an
exception?
Why does a dictionary lookup raise a KeyError instead of returning
None or an empty exception? It's easy enough to add a function that
does what you want:

def name(c):
try:
return unicodedata.name
except ValueError:
return None

Python reports failures through exceptions, not through special
return values. It might have been an option initially to return
None. Now, it cannot be changed for backwards compatibility.

Regards,
Martin
Peter Otten
2004-08-22 16:00:54 UTC
Permalink
Post by Tor Iver Wilhelmsen
Post by Martin v. Löwis
No. <control> is not a character name. The unicodedata.name function
returns the official character name, so it MUST NOT return an alias
(which rules out your second alternative).
Then why not return None or the empty string instead of raising an
exception?
What's wrong with
Post by Tor Iver Wilhelmsen
Post by Martin v. Löwis
Post by Ken Beesley
import unicodedata
unicodedata.name(u"\u000a", "my default value")
'my default value'

Peter

Ken Beesley
2004-08-22 09:39:53 UTC
Permalink
Post by Ken Beesley
There is, of course, a Unicode name for \u000a,
Peter Kleiweg wrote:

No, there isn't. Check
http://www.unicode.org/charts/PDF/U0000.pdf



OK. I see that for 000A there is not now an official Unicode name in
4.0, and that "LINE FEED (LF)" is an alias. Such an alias, shown in
uppercase letters, indicates that it _was_ the name of the character in
The Unicode Standard, Version 1.0. See The Unicode Standard 4.0, p. 415
("Aliases"). This seems odd. One intuitively assumes that any defined
Unicode character has a Unicode name.
Martin v. Löwis
2004-08-22 09:52:26 UTC
Permalink
Post by Ken Beesley
OK. I see that for 000A there is not now an official Unicode name in
4.0, and that "LINE FEED (LF)" is an alias. Such an alias, shown in
uppercase letters, indicates that it _was_ the name of the character in
The Unicode Standard, Version 1.0. See The Unicode Standard 4.0, p. 415
("Aliases"). This seems odd. One intuitively assumes that any defined
Unicode character has a Unicode name.
Indeed, this intuition is wrong. Other Unicode characters that don't
have names are:
- surrogates (U+D800..U+DFFF); it is debatable whether these are
characters, though
- private use characters (U+E000..U+F8FF, U+F0000..U+FFFFD,
U+10000..U+10FFFD).

Regards,
Martin
Loading...