Skip to content
Advertisements

Notes on String & Encoding Techniques

String and their encoding decide the languages the code can support.

Introduction

  • We have many languages and their symbols that need more than 8-bits (ASCII) for binary representation.
  • Encoding adds semantics to a set of bytes.
  • Unicode is a table of all characters and their numeric equivalent.
  • Since there are more than 100k symbols, 8-bits are not enough.

What is UTF-8

  • Unicode representation of a string needs 12-bits, padded to 32-bit.
  • The idea of UTF-8 is a variable encoding of symbols. The common ASCII symbols can use just 8-bits and other extended symbols can use up 1,2,3 or 4-bytes.
  • UTF-8 is more memory efficient compared to UTF-16 which uses minimum two bytes for a symbol.

How do Programs Use an Encoding?

  • The program picks a library for string encoding.
  • UTF is the most common library used.
  • The number of bytes in encoded string is decided by number of leading 1s in the array.

Test

见/見
Unicode= \U+FFE8\U+FFA7/\U+FFE8\U+FFA6

References

Written with StackEdit.

Advertisements

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: