@ How String works in Golang?

Akash Jain
4 min readMay 17, 2020

Computer understands only two things 0 and 1 and all programming languages know how to show these 0 and 1’s in a monitor sometimes as numeric, alphabets or some special symbols because of their type. Defining variable explicitly tell compiler its type whether we will store integer, string, float in it and when we get a value from a particular variable then the program take it from memory in the form of 0 and 1’s and then based on its type, it converts that 0 and 1s in that type and gives output.

var age int;age = 26

In RAM (memory) it is stored like this `11010` but when compilers fetch it, first it looks at its type and converts accordingly, in this case it converts into 26.

Modern computers support multilingual language. It can show all symbols that exist in the world whether it’s a Chinese, Spanish or any other language symbol and to do so the computer has a list of all symbols with his decimal number. You can see here.

For `$`symbol it stores like 100100 (36) or 㿝 stores like 11111111011101 (16349)

What is UTF-8 encoding

UTF stands for Unicode Transformation Format. The ‘8’ means it uses 8-bit blocks to represent a character.

UTF-8 is just a method to represent a binary number to send any symbol with a minimum number of chunks. In UTF-8, 1 chunk means 8 bit of data. Send $ symbol, software will convert this number to binary 100100 and send this binary in one chunk that’s 8 bits `00100100`.

Like 㿝 symbol, its binary is 11111111011101 so to convert this binary in UTF-8 we need 3 chunks that’s it 3*8 = 24 bits`11100011` `10111111` `10011101`

You will see some more extra bits of data here, it is because when anyone reads these series of bits they will know the number of chunks for a symbol.

1.1 UTF-8 Bytes Allocation

In above picture you will understand if we want to encode any symbol in UTF-8 then we can simple do following steps

  • Just find out decimal number of that symbol
  • Convert that decimal number to binary number
  • Find out how many bytes (chunks) they need to do encoding
  • Replace x with appropriate binary digit

String in Golang

string is its own data type internally it’s a sequence of bytes

https://play.golang.org/p/eEQeskF6sxL

Shocking?? you must be thinking the output should come like this `H e l l o` as I earlier said string is sequence of bytes and bytes is alias of uint8, it means each byte stores only integer value of a character.

H => 72, E => 101 L => 108 L => 108 O => 111

In Go, string uses UTF-8 encoding, lets see how RAM (memory) stored`Hello` string.

If you see the first bit of every character is 0 this represents every character uses one bytes (one chunk).

Lets see some other string which have some special character like

https://play.golang.org/p/a5Py9bLApdm

If you see, “HelᏝo” has 5 characters word in which one is special character which has decimal number 5085 but when we do iteration we got 7 decimal numbers, confuse? First Let’s see memory allocation.

Ꮭ special character uses 3 bytes in UTF-8 encoding, please refer 1.1 image.

when we do iteration over string it reads byte by byte and on a special character loop read it 3 bytes as 3 different characters not single one that’s why they printed 3 different characters (2 characters are non-visible). To avoid this issue we have a concept of `rune`. Actually I will call it a trick instead of concept.

Rune is an alias of int32, as I told you UTF-8 is 1 - 4 bytes range so Go uses int32 for max characters and when we do a range loop over this string then it gives the whole int32 instead of bytes.

https://play.golang.org/p/3MwMdPzEkrA

Interestingly if we see the index it jumps from 3 to 6 because range read byte by byte and understand UTF-8 encoding so its use rune type and get the exact string.

--

--

Akash Jain

Love to write code and discuss technology | If you explain to others in simple words it means you know it very well — akashjain132@gmail.com