Quick Read on UTF-8 in Golang

Raw Strings

we create a “raw string”, enclosed by back quotes, so it can contain only literal text.
Regular strings, enclosed by double quotes, can contain escape sequences as we showed above.

package main

import (
	"fmt"
)

func main() {

	fmt.Println(`go\\n`)
	fmt.Println("escapedGo\\n")
}

Output

go\\n
escapedGo\n
  • Raw string is always UTF-8 because it is part of the Go source code which is always UTF-8.
  • strings can contain arbitrary bytes, but when constructed from string literals, those bytes are (almost always) UTF-8.

Rune

  • 32 bit integer
  • Used to represent a UTF-8 code point of upto 4 bytes
  • Example
    • The Unicode code point U+0061 is the lower case Latin letter ‘A’: a.
    • There are at least two ways to write letter à:
      • Code point of à is U+00E0
      • Code point of grave accent (U+0300) + U+00E0

So there are different byte sequences for the same character!

How Golang handle literals?

  • A literal is a valid UTF-8 sequence.
  • It is always true for a literal
    `aLiteral`
    

Strings are built from bytes so indexing them yields bytes, not characters. A string might not even hold characters.

References

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: