Discussion:
how to transfer my utf8 code saved in a file to gbk code
(too old to reply)
higer
2009-06-07 12:55:08 UTC
Permalink
My file contains such strings :
\xe6\x97\xa5\xe6\x9c\x9f\xef\xbc\x9a

I want to read the content of this file and transfer it to the
corresponding gbk code,a kind of Chinese character encode style.
Everytime I was trying to transfer, it will output the same thing no
matter which method was used.
It seems like that when Python reads it, Python will taks '\' as a
common char and this string at last will be represented as "\\xe6\\x97\
\xa5\\xe6\\x9c\\x9f\\xef\\xbc\\x9a" , then the "\" can be 'correctly'
output,but that's not what I want to get.

Anyone can help me?


Thanks in advance.
R. David Murray
2009-06-07 14:13:45 UTC
Permalink
Post by higer
\xe6\x97\xa5\xe6\x9c\x9f\xef\xbc\x9a
If those bytes are what is in the file (and it sounds like they are),
then the data in your file is not in UTF8 encoding, it is in ASCII
encoded as hexidecimal escape codes.
Post by higer
I want to read the content of this file and transfer it to the
corresponding gbk code,a kind of Chinese character encode style.
You'll have to convert it from hex-escape into UTF8 first, then.

Perhaps better would be to write the original input files in UTF8,
since it sounds like that is what you were intending to do.

--
R. David Murray http://www.bitdance.com
IT Consulting System Administration Python Programming
John Machin
2009-06-07 23:47:22 UTC
Permalink
Post by R. David Murray
Post by higer
\xe6\x97\xa5\xe6\x9c\x9f\xef\xbc\x9a
If those bytes are what is in the file (and it sounds like they are),
then the data in your file is not in UTF8 encoding, it is in ASCII
encoded as hexidecimal escape codes.
OK, I'll bite: what *ASCII* character is encoded as either "\xe6" or
r"\xe6" by what mechanism in which parallel universe?
MRAB
2009-06-08 00:20:43 UTC
Permalink
Post by John Machin
Post by R. David Murray
Post by higer
\xe6\x97\xa5\xe6\x9c\x9f\xef\xbc\x9a
If those bytes are what is in the file (and it sounds like they are),
then the data in your file is not in UTF8 encoding, it is in ASCII
encoded as hexidecimal escape codes.
OK, I'll bite: what *ASCII* character is encoded as either "\xe6" or
r"\xe6" by what mechanism in which parallel universe?
Maybe he means that the file itself is in ASCII.
John Machin
2009-06-08 00:32:58 UTC
Permalink
Post by MRAB
Post by John Machin
Post by R. David Murray
Post by higer
\xe6\x97\xa5\xe6\x9c\x9f\xef\xbc\x9a
If those bytes are what is in the file (and it sounds like they are),
then the data in your file is not in UTF8 encoding, it is in ASCII
encoded as hexidecimal escape codes.
OK, I'll bite: what *ASCII* character is encoded as either "\xe6" or
r"\xe6" by what mechanism in which parallel universe?
Maybe he means that the file itself is in ASCII.
Maybe indeed, but only so because hex escape codes are by design in
ASCII. "in ASCII" is redundant ... I can't imagine how the OP parsed
"ASCII <omitted 'because it is'> encoded" given that his native
tongue's grammar varies from that of English in several interesting
ways :-)
higer
2009-06-08 02:36:57 UTC
Permalink
Post by MRAB
Post by John Machin
Post by R. David Murray
Post by higer
\xe6\x97\xa5\xe6\x9c\x9f\xef\xbc\x9a
If those bytes are what is in the file (and it sounds like they are),
then the data in your file is not in UTF8 encoding, it is in ASCII
encoded as hexidecimal escape codes.
OK, I'll bite: what *ASCII* character is encoded as either "\xe6" or
r"\xe6" by what mechanism in which parallel universe?
Maybe he means that the file itself is in ASCII.
Yes,my file itself is in ASCII.
R. David Murray
2009-06-08 05:10:28 UTC
Permalink
Post by John Machin
Post by R. David Murray
Post by higer
\xe6\x97\xa5\xe6\x9c\x9f\xef\xbc\x9a
If those bytes are what is in the file (and it sounds like they are),
then the data in your file is not in UTF8 encoding, it is in ASCII
encoded as hexidecimal escape codes.
OK, I'll bite: what *ASCII* character is encoded as either "\xe6" or
r"\xe6" by what mechanism in which parallel universe?
Well, you are correct that the OP might have had trouble parsing my
English. My English is more or less valid ("[the file] is _in_ ASCII",
ie: consists of ASCII characters, "encoded as hexideicmal escape codes",
which specifies the encoding used). But better perhaps would have been
to just say that the data is encoded as hexidecimal escape sequences.

--David
John Machin
2009-06-07 15:25:15 UTC
Permalink
Post by higer
\xe6\x97\xa5\xe6\x9c\x9f\xef\xbc\x9a
Are you sure? Does that occupy 9 bytes in your file or 36 bytes?
Post by higer
I want to read the content of this file and transfer it to the
corresponding gbk code,a kind of Chinese character encode style.
Everytime I was trying to transfer, it will output the same thing no
matter which method was used.
It seems like that when Python reads it, Python will taks '\' as a
common char and this string at last will be represented as "\\xe6\\x97\
\xa5\\xe6\\x9c\\x9f\\xef\\xbc\\x9a" , then the "\" can be 'correctly'
output,but that's not what I want to get.
Anyone can help me?
try this:

utf8_data = your_data.decode('string-escape')
unicode_data = utf8_data.decode('utf8')
# unicode derived from your sample looks like this $BF|4|!'(B is that what
you expected?
gbk_data = unicode_data.encode('gbk')

If that "doesn't work", do three things:
(1) give us some unambiguous hard evidence about the contents of your
data:
e.g. # assuming Python 2.x
your_data = open('your_file.txt', 'rb').read(36)
print repr(your_data)
print len(your_data)
print your_data.count('\\')
print your_data.count('x')

(2) show us the source of the script that you used
(3) Tell us what "doesn't work" means in this case

Cheers,
John
higer
2009-06-08 02:32:59 UTC
Permalink
Post by John Machin
Post by higer
\xe6\x97\xa5\xe6\x9c\x9f\xef\xbc\x9a
Are you sure? Does that occupy 9 bytes in your file or 36 bytes?
It was saved in a file, so it occupy 36 bytes. If I just use a
variable to contain this string, it can certainly work out correct
result,but how to get right answer when reading from file.
Post by John Machin
Post by higer
I want to read the content of this file and transfer it to the
corresponding gbk code,a kind of Chinese character encode style.
Everytime I was trying to transfer, it will output the same thing no
matter which method was used.
It seems like that when Python reads it, Python will taks '\' as a
common char and this string at last will be represented as "\\xe6\\x97\
\xa5\\xe6\\x9c\\x9f\\xef\\xbc\\x9a" , then the "\" can be 'correctly'
output,but that's not what I want to get.
Anyone can help me?
utf8_data = your_data.decode('string-escape')
unicode_data = utf8_data.decode('utf8')
# unicode derived from your sample looks like this $BF|4|!'(B is that what
you expected?
You are right , the result is $BF|4|(B which I just expect. If you save the
string in a variable, you surely can get the correct result. But it is
just a sample, so I give a short string, what if so many characters in
a file?
Post by John Machin
gbk_data = unicode_data.encode('gbk')
I have tried this method which you just told me, but unfortunately it
does not work(mess code).
Post by John Machin
(1) give us some unambiguous hard evidence about the contents of your
e.g. # assuming Python 2.x
My Python versoin is 2.5.2
Post by John Machin
your_data = open('your_file.txt', 'rb').read(36)
print repr(your_data)
print len(your_data)
print your_data.count('\\')
print your_data.count('x')
The result is:

'\\xe6\\x97\\xa5\\xe6\\x9c\\x9f\\xef\\xbc\\x9a'
36
9
9
Post by John Machin
(2) show us the source of the script that you used
def UTF8ToChnWords():
f = open("123.txt","rb")
content=f.read()
print repr(content)
print len(content)
print content.count("\\")
print content.count("x")

pass
if __name__ == '__main__':
UTF8ToChnWords()
Post by John Machin
(3) Tell us what "doesn't work" means in this case
It doesn't work because no matter in what way we deal with it we often
get 36 bytes string not 9 bytes.Thus, we can not get the correct
answer.
Post by John Machin
Cheers,
John
Thank you very much,
higer
Mark Tolonen
2009-06-08 05:58:16 UTC
Permalink
Post by higer
Post by John Machin
Post by higer
\xe6\x97\xa5\xe6\x9c\x9f\xef\xbc\x9a
Are you sure? Does that occupy 9 bytes in your file or 36 bytes?
It was saved in a file, so it occupy 36 bytes. If I just use a
variable to contain this string, it can certainly work out correct
result,but how to get right answer when reading from file.
Did you create this file? If it is 36 characters, it contains literal
backslash characters, not the 9 bytes that would correctly encode as UTF-8.
If you created the file yourself, show us the code.
Post by higer
Post by John Machin
Post by higer
I want to read the content of this file and transfer it to the
corresponding gbk code,a kind of Chinese character encode style.
Everytime I was trying to transfer, it will output the same thing no
matter which method was used.
It seems like that when Python reads it, Python will taks '\' as a
common char and this string at last will be represented as "\\xe6\\x97\
\xa5\\xe6\\x9c\\x9f\\xef\\xbc\\x9a" , then the "\" can be 'correctly'
output,but that's not what I want to get.
Anyone can help me?
utf8_data = your_data.decode('string-escape')
unicode_data = utf8_data.decode('utf8')
# unicode derived from your sample looks like this 日期: is that what
you expected?
You are right , the result is 日期 which I just expect. If you save the
string in a variable, you surely can get the correct result. But it is
just a sample, so I give a short string, what if so many characters in
a file?
Post by John Machin
gbk_data = unicode_data.encode('gbk')
I have tried this method which you just told me, but unfortunately it
does not work(mess code).
How are you determining this is 'mess code'? How are you viewing the
result? You'll need to use a viewer that understands GBK encoding, such as
"Chinese Window's Notepad".
Post by higer
Post by John Machin
(1) give us some unambiguous hard evidence about the contents of your
e.g. # assuming Python 2.x
My Python versoin is 2.5.2
Post by John Machin
your_data = open('your_file.txt', 'rb').read(36)
print repr(your_data)
print len(your_data)
print your_data.count('\\')
print your_data.count('x')
'\\xe6\\x97\\xa5\\xe6\\x9c\\x9f\\xef\\xbc\\x9a'
36
9
9
Post by John Machin
(2) show us the source of the script that you used
f = open("123.txt","rb")
content=f.read()
print repr(content)
print len(content)
print content.count("\\")
print content.count("x")
Try:

utf8data = content.decode('string-escape')
unicodedata = utf8data.decode('utf8')
gbkdata = unicodedata.encode('gbk')
print len(gbkdata),repr(gbkdata)
open("456.txt","wb").write(gbkdata)

The print should give:

6 '\xc8\xd5\xc6\xda\xa3\xba'

This is correct for GBK encoding. 456.txt should contain the 6 bytes of GBK
data. View the file with a program that understand GBK encoding.

-Mark
higer
2009-06-08 06:15:39 UTC
Permalink
Thank you Mark,
that works.

Firstly using 'string-escape' to decode the content is the key
point,so I can get the Chinese characters now.




Regards,
-higer

Loading...