encode()/decode()
A Unicode string is a sequence of code points, which are numbers from 0 to 0x10ffff. This sequence needs to be represented as a set of bytes (meaning, values from 0–255) in memory. The rules for translating a Unicode string into a sequence of bytes are called an encoding. Python's default encoding is 'ASCII'. However, UTF-8 is probably the most commonly supported encoding.
UTF stands for “Unicode Transformation Format”, and the ‘8’ means that 8-bit numbers are used in the encoding. (There’s also a UTF-16 encoding, but it’s less frequently used than UTF-8.) UTF-8 uses the following rules:
- If the code point is <128, it’s represented by the corresponding byte value.
- If the code point is between 128 and 0x7ff, it’s turned into two byte values between 128 and 255.
- Code points >0x7ff are turned into three- or four-byte sequences, where each byte of the sequence is between 128 and 255.
UTF-8 has several convenient properties:
- It can handle any Unicode code point.
- A Unicode string is turned into a string of bytes containing no embedded zero bytes. This avoids byte-ordering issues, and means UTF-8 strings can be processed by C functions such as
strcpy()
and sent through protocols that can’t handle zero bytes. - A string of ASCII text is also valid UTF-8 text.
- UTF-8 is fairly compact; the majority of code points are turned into two bytes, and values less than 128 occupy only a single byte.
- If bytes are corrupted or lost, it’s possible to determine the start of the next UTF-8-encoded code point and resynchronize. It’s also unlikely that random 8-bit data will look like valid UTF-8.
Converting from bytes to str and back.
bstr = b'qwerty'
type(bstr)
<class 'bytes'>
str = bstr.decode('utf-8')
type(str)
<class 'str'>
The method encode() returns an encoded version of the string.
bstr2 = str.encode('utf-8')
type(bstr2)
<class 'bytes'>
References: https://docs.python.org/2/howto/unicode.html