String and their encoding decide the languages the code can support.
- We have many languages and their symbols that need more than 8-bits (ASCII) for binary representation.
- Encoding adds semantics to a set of bytes.
- Unicode is a table of all characters and their numeric equivalent.
- Since there are more than 100k symbols, 8-bits are not enough.
What is UTF-8
- Unicode representation of a string needs 12-bits, padded to 32-bit.
- The idea of UTF-8 is a variable encoding of symbols. The common ASCII symbols can use just 8-bits and other extended symbols can use up 1,2,3 or 4-bytes.
- UTF-8 is more memory efficient compared to UTF-16 which uses minimum two bytes for a symbol.
How do Programs Use an Encoding?
- The program picks a library for string encoding.
- UTF is the most common library used.
- The number of bytes in encoded string is decided by number of leading 1s in the array.
见/見 Unicode= \U+FFE8\U+FFA7/\U+FFE8\U+FFA6
Written with StackEdit.