TAGS :Viewed: 30 - Published at: a few seconds ago

[ Print a unicode string versus printing each characters, what doesn't it match? ]

s = u'\U0001031e\U0001031d\U0001015c\U0001015d\U00010170\U0001014b\U00010169\U0001016a\U0001016c\xa6\U0001d32c\U0001010c\U0001013a\U00010109\U0001010b\U0001010d\U0001010f\U0001011c\U0001d1ca\U000201b3\U0001016d\U00010184\U00010184'
print s
for a in s: print a, repr(a)

I use mac os x lion, python 272, the string prints fine, but the loop prints '???' along with incorrect u'\u' unicode values.

here are the value printed:

u'\ud800' u'\udf1e' u'\ud800' u'\udf1d' u'\ud800' u'\udd5c' u'\ud800' u'\udd5d' u'\ud800' u'\udd70' u'\ud800' u'\udd4b' u'\ud800' u'\udd69' u'\ud800' u'\udd6a' u'\ud800' u'\udd6c' u'\xa6' u'\ud834' u'\udf2c' u'\ud800' u'\udd0c' u'\ud800' u'\udd3a' u'\ud800' u'\udd09' u'\ud800' u'\udd0b' u'\ud800' u'\udd0d' u'\ud800' u'\udd0f' u'\ud800' u'\udd1c' u'\ud834' u'\uddca' u'\ud840' u'\uddb3' u'\ud800' u'\udd6d' u'\ud800' u'\udd84' u'\ud800' u'\udd84'

Answer 1


You are printing 4-byte Unicode characters in a UCS2 Python build, which internally are stored as 2 characters each, a UTF-16 surrogate pair. One of the side-effects is that any Unicode character above \uffff is printed as such pairs; the lead character is a value between \uD800 and \uDBFF followed by a second character in the \uDC00 to \uDFFF range.

You need to recompile your python to support wide Unicode characters, or upgrade to python 3.3 which has a new internal Unicode representation that switches between 1, 2 and 4 bytes wide characters as needed.

You can test your python unicode support by looking at sys.maxunicode; if that system value is equal to 65535 you have a narrow (default) build, in a wide build that value is 1114111.

Mac's default python is a narrow Unicode build; most recent Linux distributions enable the wide unicode flag.