Notes on String & Encoding Techniques

String and their encoding decide the languages the code can support.

Introduction

  • We have many languages and their symbols that need more than 8-bits (ASCII) for binary representation.
  • Encoding adds semantics to a set of bytes.
  • Unicode is a table of all characters and their numeric equivalent.
  • Since there are more than 100k symbols, 8-bits are not enough.

What is UTF-8

  • Unicode representation of a string needs 12-bits, padded to 32-bit.
  • The idea of UTF-8 is a variable encoding of symbols. The common ASCII symbols can use just 8-bits and other extended symbols can use up 1,2,3 or 4-bytes.
  • UTF-8 is more memory efficient compared to UTF-16 which uses minimum two bytes for a symbol.

How do Programs Use an Encoding?

  • The program picks a library for string encoding.
  • UTF is the most common library used.
  • The number of bytes in encoded string is decided by number of leading 1s in the array.

Test

见/見
Unicode= \U+FFE8\U+FFA7/\U+FFE8\U+FFA6

References

Written with StackEdit.

Advertisements