[ Print a unicode string versus printing each characters, what doesn't it match? ]
s = u'\U0001031e\U0001031d\U0001015c\U0001015d\U00010170\U0001014b\U00010169\U0001016a\U0001016c\xa6\U0001d32c\U0001010c\U0001013a\U00010109\U0001010b\U0001010d\U0001010f\U0001011c\U0001d1ca\U000201b3\U0001016d\U00010184\U00010184'
print s
for a in s: print a, repr(a)
I use mac os x lion, python 272, the string prints fine, but the loop prints '???' along with incorrect u'\u' unicode values.
here are the value printed:
u'\ud800' u'\udf1e' u'\ud800' u'\udf1d' u'\ud800' u'\udd5c' u'\ud800' u'\udd5d' u'\ud800' u'\udd70' u'\ud800' u'\udd4b' u'\ud800' u'\udd69' u'\ud800' u'\udd6a' u'\ud800' u'\udd6c' u'\xa6' u'\ud834' u'\udf2c' u'\ud800' u'\udd0c' u'\ud800' u'\udd3a' u'\ud800' u'\udd09' u'\ud800' u'\udd0b' u'\ud800' u'\udd0d' u'\ud800' u'\udd0f' u'\ud800' u'\udd1c' u'\ud834' u'\uddca' u'\ud840' u'\uddb3' u'\ud800' u'\udd6d' u'\ud800' u'\udd84' u'\ud800' u'\udd84'
Answer 1
You are printing 4-byte Unicode characters in a UCS2 Python build, which internally are stored as 2 characters each, a UTF-16 surrogate pair. One of the side-effects is that any Unicode character above \uffff
is printed as such pairs; the lead character is a value between \uD800
and \uDBFF
followed by a second character in the \uDC00
to \uDFFF
range.
You need to recompile your python to support wide Unicode characters, or upgrade to python 3.3 which has a new internal Unicode representation that switches between 1, 2 and 4 bytes wide characters as needed.
You can test your python unicode support by looking at sys.maxunicode
; if that system value is equal to 65535
you have a narrow (default) build, in a wide build that value is 1114111.
Mac's default python is a narrow Unicode build; most recent Linux distributions enable the wide unicode flag.