[ converting string to unicode in python ]
I am trying to convert a string type to Unicode in Python. I want it to work for any non-english string, for example Japanese, Chinese or Spanish.
For example, japanese_var has some japanese characters [ドキュメントを翻訳します].
Printing it would give,
'\x83h\x83L\x83\x85\x83\x81\x83\x93\x83g\x82\xf0\x96|\x96\xf3\x82\xb5\x82\xdc\x82\xb7'
Checking its type,
type(japanese_var)
<type 'str'>
How can I convert it to type 'unicode'?
Should i use japanese_var.decode('mbcs')? What could be the consequences of using this code as i will be using it on different OS platforms & different foreign Locale?
I am using python 2.5.4
I am reading the parameter which can be any non-english string of a file from its properties.
Answer 1
You need to know the encoding of the input string. There is no reliable universal solution.
The encoding should be available from the source of the input string. For instance, if you're taking text from a web page, the encoding should be indicated as part of the HTTP Content-Type
, either as a HTTP response header from the server or as <meta>
tag in the page source.
Once you know the encoding, use the decode
method.
This string appears to be Shift-JIS:
>>> x = '\x83h\x83L\x83\x85\x83\x81\x83\x93\x83g\x82\xf0\x96|\x96\xf3\x82\xb5\x82\xdc\x82\xb7'
>>> print x.decode( "shift-jis" )
ドキュメントを翻訳します
Answer 2
It worked for me by passing "mbcs" to decode for any locale.
Thanks guys for your help.