22 Encoding Data


Summary

  • A bit is the simplest piece of information stored by a computer: it is either 0 or 1.

  • A byte consists of eight bits.

  • The simplest code for storing characters in strings is the American Standard Code for Information Interchange, or ASCII. Each byte (actually only 7 bits) stores one character.

  • A more complicated scheme for storing characters is called Unicode. This code stores all characters ever used by humans, including newer symbols such as emoji.


At the heart of a modern digital computer are transistors, small semiconductor devices used to amplify or suppress current. A chip such as the M2 can fit billions of transistors on a single chip. At their heart, however, they end up sending either a low or high voltage signal.

This is treated as being either a 0 or a 1, a quantity of information known as a bit.

A bit is short for a binary digit, and is either 0 or 1.

Just like decimal digits such as 8675309 can be placed together to create a number with more possibilities, binary digits can be placed together to create longer pieces of information.

A byte consists of 8 bits.

When a decimal number is written, the position of the digit indicates which power of ten it is representing. For instance, \[ 536 = 5 \cdot 10^2 + 3 \cdot 10^1 + 6 \cdot 10^0. \]

In a similar fashion, for numbers written in binary each digit represents either 0 or 1 times 2 raised to the power of the position of the digit, where the rightmost digit is zero, and the position increases as you move to the left.

For instance, the byte 00100100 is equal to the number \[ 0 \cdot 2^7 + 0 \cdot 2^6 + 1 \cdot 2^5 + 0 \cdot 2^4 + 0 \cdot 2^3 + 1 \cdot 2^2 + 0 \cdot 2^1 + 0 \cdot 2^0 = 32 + 4 = 36 \] in decimal representation.

Most memory sizes in computers are measured in bytes. In 2022 as I type this the computer I am using has a main memory of 32 GB (gigabytes), which is \(32 \cdot 10^9\) bytes.

With a single byte (8 bits), it is possible to represent an integer from 0 to 255. With 4 bits (half a byte, also known as a nibble) one can represent a number from 0 to 15. These 16 numbers form the basis of hexadecimal notation.

A hexadecimal number is written in base 16. The digits of hexadecimal are 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F.

Recall our byte from earlier 00100100 which can be written in nibbles as 0010 0100. The first nibble is 2, and the second is 4, so the hexadecimal representation is 24. To avoid confusion with the decimal number 24, either an h or 0x is written before the number, so \[ \texttt{0x24} = \texttt{36} \] indicates that the hexadecimal number 24 equals the decimal number 36.

For example, the hexadecimal digit \(B\) stands for 11 in decimal, and \(5\) is 5 in decimal. So \[ \texttt{0xB5} = 11 \cdot 16 + 5 = \texttt{181}. \]

22.1 Representing characters using ASCII

Whenever a variable holds a string, it consists of a sequence of characters. But computers, as seen, only hold numbers. So a code must be used to turn letters into numbers.

An early code of the digital age was the American Standard Code for Information Interchange, or ASCII for short.

The American Standard Code for Information Interchange (ASCII for short) assigns the numbers 0 to 127 to characters. These include upper and lower case letters in the Roman alphabet, together with numbers and the symbols typically found on a keyboard.

The charToRaw function takes a character string, and returns the raw ASCII values used to make that string.

charToRaw("Mark Huber")
##  [1] 4d 61 72 6b 20 48 75 62 65 72

Note that the characters are printed using their hexadecimal values. Here 0x4d is \(4 \cdot 16 + 13 = 77\) in decimal, indicating that 77 is the ASCII code for the capital letter "M".

22.2 ISO-Latin

ASCII was created as a standard by the U.S., where extra marks on letters known as accents and diacriticals are rarely used. Europe, however, is another story, and it soon became obvious that characters would be needed to handle these languages.

The ISO-Latin-1 (also known as Latin 1 and ISO-8859-1) is a code that uses 8 bits to store each character.

That eighth bit means that there are 256 possible characters in ISO-Latin, as opposed to 128 possible characters in ASCII. This is enough to gain coverage for many European and African languages.

The ISO is the International Organization for Standardization, and is a nongovernmental organization that includes member countries from around the world. Its goal is to create standards that balance the need for efficiency with the need for expression.

22.3 Unicode

The languages encoded so far use a fixed size alphabet to build words, but many human languages use individual symbols for words instead. The top two spoken languages in the world are English and Mandarin, each with over a billion speakers. ISO-Latin-1 works great for English, but Mandarin has a repertoire of about 50,000 characters, of which 20,000 are regularly used.

The third most common language in the world is Hindi, with over six hundred million speakers. The official alphabet contains 46 characters, and none of these symbols are in ISO-Latin-1.

We’re going to need a bigger code!

The idea of a universal code, or Unicode, goes back to the 1980’s.

Unicode is a system that uses a variable number of bytes that is intended to be universal, and to encode every character used by any human writing system. The implementation of the system is called Unicode Transformation Format, or UTF for short.

An ambitious idea to be sure! The inventors started with modern languages and worked backward to add symbols and languages to the system as needed. It is by far the most widely used standard for storing textual data.

The standard itself is decided by the Unicode Consortium, which is a nonprofit organization that is comprised of tech companies with an interest in processing text effectively.

Some countries or their agencies have been allowed to join the consortium at various times.

22.3.1 UTF-8

By far the most common way of implementing a Unicode Transform Function is the UTF-8 standard. This standard will only use 8 bits (one byte) when storing an ASCII character.

The way it works is as follows. Suppose that the eighth bit in the code is 0. Then the first seven bits are used to encode a hexadecimal number in the exact same way as ASCII does.

For example, consider U+0041. The first (lowest order) byte is hexadecimal 41, which is decimal 65, which stands for A in ASCII. In R, the initial U becomes the escape character \U, and the + is dropped. This can be illustrated with the cat function.

cat("\U0041")
## A

Another way to say that the eighth bit is 0 is to write x for all the bits that can be 0 or 1, and 0 for the bit that has to be 0. A space between the nibbles of the byte helps in the reading. So a number that is 0 through 127 in binary is of the form \[ \texttt{0xxx xxxx}. \]

What if the eighth bit of the lower order byte is 1? Here’s where things get interesting. First, UTF-8 looks at how many ones there are before the next 0. That sets the number of bytes that determine the character.

So for \[ \texttt{1110 xxxx} \] there are three 1’s, which means that the character is determined by three bytes (including the initial one.) For \[ \texttt{1111 0xxx} \] there are four bytes total, and that is the largest number of bytes a UTF-8 encoded character can have.

After the first byte, these extra bytes all have the form \[ \texttt{10xx xxxx} \], which means that each extra byte only adds 6 bits. So the most number of bits in a character in UTF-8 is \(3 + 3(6) = 21\).

So all UTF-8 encodings take one of the following forms:

0xxx xxxx
110x xxxx 10xx xxxx
1110 xxxx 10xx xxxx 10xx xxxx
1111 0xxx 10xx xxxx 10xx xxxx 10xx xxxx

So that means that UTF-8 could possibly encode as many as \(2^{21} = 2\,097\,152\). Because of the way those 21 bits are used, it actually encodes slightly fewer possibilities, \(1\,111\,998\) to be precise.

As of this writing in November 2022, the Unicode standard is in Version 15.0, and sets up \(149\,186\) characters. This is unlikely to grow too much more in the future, so for now UTF-8 has plenty of room.

There are other transform functions for Unicode, UTF-16 and UTF-32 being examples. However, given that the world is unlikely to need anything more than UTF-8 for a while, it should remain the dominant way of coding Unicode characters for the foreseeable future.

For the latest information on Unicode, as well as a sampling of the many characters that can be created, check out https://home.unicode.org/.

Questions

Find the decimal equivalent of the following hexadecimal numbers: \[ \texttt{4d, 87, A4, FF} \]

Find the binary equivalent of the following hexadecimal numbers: \[ \texttt{4d, 87, A4, FF} \]

Find the decimal equivalent of the following binary numbers: \[ \texttt{10101010, 00000000, 11111111, 01011110} \]

Find the hexadecimal equivalent of the following binary numbers: \[ \texttt{10101010, 00000000, 11111111, 01011110} \]

Using charToRaw, find the ASCII values for the following characters as decimal values.

  1. "A"

  2. "Z"

  3. "a"

  4. "z"

  5. "%"

  6. " "

What character is U+0072?

How many bytes does U+0072 require for encoding in UTF-8?

Given the UTF-8 value 0xC0A6, how many bits are used to encode the Unicode character?