Even something as basic as text requires a well-coordinated, clearly-defined system to appear in web browsers. In this post, we’ll explain the basics of text storage and encoding, and discuss how it helps put engaging words across your site. We’ll focus on one technology central to text on the web, UTF-8.
Binary: How Computers Store Information
In binary, all data is represented in sequences of 1s and 0s
- The most basic unit of binary is a bit
- A byte consists of 8 bits
- Every digital asset you’ve ever encountered is built on bytes, which are strung together in a way that makes sense to computers
- When we refer to file sizes, we’re referencing the number of bytes
ASCII: Converting Symbols to Binary
ASCII encoding is the process of converting characters in human languages into binary sequences that computers can process
- The number of characters that ASCII can represent is limited to the number of unique bytes available, since each character gets one byte
- There are 256 different ways of groups eight 1s and 0s together
UTF-8 vs. UTF-16
These differ in the number of bytes they need to store a character in a binary string.
- The binary output for any given character will look different from the encoding methods used for both types of characters because of different algorithms used to map code points to binary strings.
Unicode: A Way to Store Every Symbol, Ever
Unicode assigns a unique code, called a code point, to each character.
- The more sophisticated system can produce over a million code points, more than enough to account for every character in any language.
- So, we now have a standardized way of representing every character used by every human language in a single library.
UTF-8 is a Unicode character encoding method
Takes the code point for a given Unicode character and translates it into a string of binary.
- It also does the reverse, reading in binary digits and converting them back to characters.
- Currently, it is the most popular encoding method on the internet because it can efficiently store text containing any character.
The Final Piece of the Puzzle
UTF-8 is an encoding system for Unicode. It can translate any Unicode character to a matching unique binary string, and can also translate the binary string back to a Unicode character.
- The first 256 characters in the Unicode library are represented as one byte, and characters that appear later are encoded as two, three, and four byte binary units.
UTF-8 Characters in Web Development
The most common character encoding method used on the internet today, and is the default character set for HTML5.
- Over 95% of all websites, likely including your own, store characters this way.
- Since it’s now the standard method for encoding text on the web, all your site pages and databases should use UTF- 8.