How to represent language data in a computer?
See also: Section 1.4 ‘Digital writing’ in Glass, Lelia, Markus Dickinson, Chris Brew, and Detmar Meurers. 2024. Language and Computers. 2nd ed. Textbooks in Language Sciences 14. Berlin: Language Science Press. https://doi.org/10.5281/zenodo.12730906 Links to an external site.
Character encoding
A character is the smallest component of a text. Letters, digits, punctuation, emojis and other symbols are all characters. Also spaces, newlines and other invisible components of a text are characters.
Computer systems ultimately represent all information in binary format, that is, zeros and ones. We therefore need to agree on how to represent characters in binary code. Ideally, every computer and every program must interpret these codes in the same way, so that, for instance, 01100001 is always interpreted as the small letter a.
Here is a small part of the list which has been quite standard and uses eight bits (= one byte) per character.
decimal code | hex code | binary code | character |
---|---|---|---|
... | ... | ... | ... |
33 | 21 | 00100001 | ! |
34 | 22 | 00100010 | " |
33 | 23 | 00100011 | # |
... | ... | ... | ... |
65 | 41 | 01000001 | A |
66 | 42 | 01000010 | B |
67 | 43 | 01000011 | C |
... | ... | ... | ... |
97 | 61 | 01100001 | a |
98 | 62 | 01100010 | b |
99 | 63 | 01100011 | c |
... | ... | ... | ... |
Eight bits cannot represent all the alphabets in the world. Encoding a character in several bytes together (called multibyte encoding) is needed to accommodate all the world's alphabets, emojis and more. This has been achieved in Unicode, Links to an external site. an encoding standard which is supported by most modern computer systems, and currently has a repertoire of about 150 000 characters.
There are different ways of grouping the bytes necessary for every Unicode character. The encoding known as UTF-8 Links to an external site. is a variable-length encoding which uses one to four bytes per character. It has become the new standard on many systems and is used in over 98 % of all web pages. Python versions 3.6 and later store strings in UTF-8. Python can also read files that have other encodings Links to an external site., but it is recommended to encode files as UTF-8 whenever possible. It is not necessary to know the binary codes of characters, as long as you stick to UTF-8.
Even if Unicode can represent just about all characters, not every font has the necessary shapes for all characters of all alphabets, so choosing a suitable font is important for rendering a text correctly. Here are some Runic Links to an external site. characters: ᚠᚢᚦᚨᚱᚲ. Whether these are displayed correctly depends on whether your browser is using a font with the appropriate shapes.
Line endings
A file can be considered as a sequence of bits, arranged in bytes (groups of 8 characters), which can be interpreted as characters, as discussed above. If we want to distinguish lines (also called records) in a file, we need an agreement on which characters delimit lines. Again, there is a standardization problem. The most used character for indicating a new line is the linefeed, often abbreviated as LF and written in code as
\n
. However, DOS/Windows files use a sequence of two characters, carriage return and linefeed (CRLF or \r\n
), which complicates matters. Fortunately, current versions of Python can read either line ending format.
Note: There is a historical reason for CRLF. Old-fashioned typewriters needed two operations: (1) the carriage containing the paper needed to be returned horizontally so that typing can resume at the left margin, and (2) the paper needs to be fed upwards so that the new line comes underneath the previous one.
Questions and exercises
-
- Open a text editor (such as Emacs or TextEdit on MacOS). Check if there are preferences for character encoding and set these to UTF-8 if necessary. Create, edit, save and reopen a plain text file that contains characters outside the English alphabet, such as æ ø å é ñ ἠ. Check if the characters display correctly.
- Some characters may be hard to distinguish, for instance different dashes, which could be a hyphen, minus sign, etc. (‐ - ⁃ -) or quotes, apostrophe and prime (' ʼ ′ ʹ). Try to find a way to examine if they are the same, or if they just look the same but are different.
- Read articles about the Mars rover parachute Links to an external site., which used a binary color coding for its message. Do you see that the parachute code is a simplified version of Unicode?