unicodedata name for \u000a

Discussion:

(too old to reply)

Ken Beesley

2004-08-21 19:24:04 UTC

Newbie question: on unicodedata.name

If I do

import unicodedata
unicodedata.name(u"a")
or
unicodedata.name(u"\u0061")

I get
'LATIN SMALL LETTER A"

as expected; but when I follow that with

unicodedata.name(u"\u000a")

I get

Traceback (most recent call last):
File "<stdin>", line 1, in ?
ValueError: no such name

There is, of course, a Unicode name for \u000a,
which is 'LINE FEED' or perhaps 'LINE FEED (A)'.

Is there a gap in unicodedata? or in my understanding?

Thanks,

Ken

Christos "TZOTZIOY" Georgiou

2004-08-21 19:38:57 UTC

Permalink

On Sat, 21 Aug 2004 21:24:04 +0200, rumours say that Ken Beesley
<***@xrce.xerox.com> might have written:

[snip]

Post by Ken Beesley
unicodedata.name(u"\u000a")
I get
File "<stdin>", line 1, in ?
ValueError: no such name
There is, of course, a Unicode name for \u000a,
which is 'LINE FEED' or perhaps 'LINE FEED (A)'.
Is there a gap in unicodedata? or in my understanding?

It seems that all control characters (u"\u0000" to u"\u001f") have no
names in unicodedata. Don't know if this is an omission (ie bug) or
intentional.

--
TZOTZIOY, I speak England very best,
"Tssss!" --Brad Pitt as Achilles in unprecedented Ancient Greek

Peter Kleiweg

2004-08-21 19:45:53 UTC

Permalink

Post by Ken Beesley
There is, of course, a Unicode name for \u000a,

No, there isn't. Check
http://www.unicode.org/charts/PDF/U0000.pdf

--
Peter Kleiweg L:NL,af,da,de,en,ia,nds,no,sv,(fr,it) S:NL,de,en,(da,ia)
info: http://www.let.rug.nl/~kleiweg/ls.html

Tor Iver Wilhelmsen

2004-08-22 08:53:17 UTC

Permalink

Post by Peter Kleiweg
No, there isn't. Check
http://www.unicode.org/charts/PDF/U0000.pdf

Quoting that document:

Alias names are those for ISO/IEC 6429:1992.
Commonly used alternative aliases are also shown.

000A LF <control>
= LINE FEED (LF)

So the authors of unicodedata.name() could have picked either
'<control>', the ASCII name 'LF' or the alternative 'LINE FEED (LF)'.
Not picking any of them seems strange, and as the OP pointed out,
leads to an error even though the "C0 Controls" part of that page *is*
part of Unicode.

Martin v. Löwis

2004-08-22 09:39:10 UTC

Permalink

Post by Tor Iver Wilhelmsen
000A LF <control>
= LINE FEED (LF)
So the authors of unicodedata.name() could have picked either
'<control>', the ASCII name 'LF' or the alternative 'LINE FEED (LF)'.

No. <control> is not a character name. The unicodedata.name function
returns the official character name, so it MUST NOT return an alias
(which rules out your second alternative).

Post by Tor Iver Wilhelmsen
Not picking any of them seems strange, and as the OP pointed out,
leads to an error even though the "C0 Controls" part of that page *is*
part of Unicode.

Yes. However, this strangeness originates from the Unicode
specification. Control characters simply do not have a name.

If you want to know whether a code point is an unassigned character,
check whether unicodedata.type is "Cn".

Regards,
Martin

Tor Iver Wilhelmsen

2004-08-22 13:41:13 UTC

Permalink

Post by Martin v. LÃ¶wis
No. <control> is not a character name. The unicodedata.name function
returns the official character name, so it MUST NOT return an alias
(which rules out your second alternative).

Then why not return None or the empty string instead of raising an
exception?

Martin v. Löwis

2004-08-22 15:15:20 UTC

Permalink

Post by Tor Iver Wilhelmsen
Then why not return None or the empty string instead of raising an
exception?

Why does a dictionary lookup raise a KeyError instead of returning
None or an empty exception? It's easy enough to add a function that
does what you want:

def name(c):
try:
return unicodedata.name
except ValueError:
return None

Python reports failures through exceptions, not through special
return values. It might have been an option initially to return
None. Now, it cannot be changed for backwards compatibility.

Regards,
Martin

Peter Otten

2004-08-22 16:00:54 UTC

Permalink

Post by Tor Iver Wilhelmsen

Then why not return None or the empty string instead of raising an
exception?

What's wrong with

Post by Tor Iver Wilhelmsen

Post by Martin v. LÃ¶wis

Post by Ken Beesley
import unicodedata
unicodedata.name(u"\u000a", "my default value")

'my default value'

Peter

Ken Beesley

2004-08-22 09:39:53 UTC

Permalink

Post by Ken Beesley
There is, of course, a Unicode name for \u000a,

Peter Kleiweg wrote:

No, there isn't. Check
http://www.unicode.org/charts/PDF/U0000.pdf

OK. I see that for 000A there is not now an official Unicode name in
4.0, and that "LINE FEED (LF)" is an alias. Such an alias, shown in
uppercase letters, indicates that it _was_ the name of the character in
The Unicode Standard, Version 1.0. See The Unicode Standard 4.0, p. 415
("Aliases"). This seems odd. One intuitively assumes that any defined
Unicode character has a Unicode name.

Martin v. Löwis

2004-08-22 09:52:26 UTC

Permalink

Post by Ken Beesley
OK. I see that for 000A there is not now an official Unicode name in
4.0, and that "LINE FEED (LF)" is an alias. Such an alias, shown in
uppercase letters, indicates that it _was_ the name of the character in
The Unicode Standard, Version 1.0. See The Unicode Standard 4.0, p. 415
("Aliases"). This seems odd. One intuitively assumes that any defined
Unicode character has a Unicode name.

Indeed, this intuition is wrong. Other Unicode characters that don't
have names are:
- surrogates (U+D800..U+DFFF); it is debatable whether these are
characters, though
- private use characters (U+E000..U+F8FF, U+F0000..U+FFFFD,
U+10000..U+10FFFD).

Regards,
Martin