Writing systems and language conventions
Scholars in language, literature, history, philosophy, religion and other humanities disciplines work with texts, not only in English but in many languages. It is therefore important to consider the large variety of writing systems and some of the language conventions which present challenges for digital scholarship. Only a few examples will be mentioned here.
Writing systems
See also: Section 1.3 ‘Writing systems’ in Glass, Lelia, Markus Dickinson, Chris Brew, and Detmar Meurers. 2024. Language and Computers. 2nd ed. Textbooks in Language Sciences 14. Berlin: Language Science Press. https://doi.org/10.5281/zenodo.12730906
There are thousands of languages and hundreds of scripts to write these languages. Many scripts have special rules. Greek has a different sigma at the end of words than elsewhere in the word, for instance in Ὀδυσσεύς. French has the œ ligature, as in cœur and sœur, but one can also use oe instead. Here are a few other issues which can make language processing a challenging task.
Capitalization
Latin had only capital letters, but most modern languages using the Latin alphabet make a distinction between capitals (uppercase) and small (lowercase) letters. The conventions for capitalization vary. In English, proper names and words at the beginning of a sentence are capitalized. Making a difference between Rose as a proper name and rose as a common noun may be a good thing, but that difference may be obscured if Rose occurs at the beginning of a sentence. In German, all nouns are also capitalized. Some other writing systems, like Chinese, are unicase or unicameral, i.e. make no case distinction.
Note: the terms uppercase and lowercase come from the two compartments in which moveable case was kept in a printer's shop.
Punctuation
In English, punctuation characters, such as the period, question mark, exclamation mark, quotes, etc. are attached to words, so they need to be separated from them in order to treat the words by themselves. Furthermore, a period can not only signal the end of a sentence, but can also be used to end an abbreviation, as a decimal point, as a separator in Norwegian dates, etc. This is a challenge for detecting the beginning and end of sentences.
Words
We generally agree that a string of letters delimited by whitespace can be considered a word, but do numbers such as 2022 count as words? Does spacing and punctuation always signal the end of a word? In English, a colon cannot be inside a word, but in Swedish it can be inside an abbreviation (S:t Eriksplan). Are periods at the end of abbreviations (etc.) a part of the word? Is a genitive like Mary’s a single word, or is it two words? Is a contraction like didn’t a single word, or should we consider it as two words, did and n’t, and should we perhaps expand n’t to not? Compounds can sometimes be written with or without hyphens. Is seven-to-five a single word or three words?
The answer may depend not only on morphological theory about the language in question, but also on the practical purpose of the research: What is the goal of text processing in a particular research context? If the goal is to count all occurrences of names such as Mary, for instance, then splitting Mary’s in two words may be practical. If the goal is to find all street names, then keeping the apostrophe in Vilhelm Bjerknes’ vei and keeping S:t together may be useful.
For practical purposes, many computer systems adopt a simple default definition: a word is only made up of characters that are letters of an alphabet, digits or the underscore _
. All other characters are word delimiters. Clearly, this default is not sufficient for certain applications and we need to use a more refined approach depending on the research goal.
Alphabetic sorting
In alphabetic sorting (collation), language-specific conventions may apply. Here are only a few of the conventions.
- Norwegian: aa is collated with å
- Spanish: ñ is a separate letter; sensible comes before señor
- French: accented characters are not separate letters: denier and dénier are at the same place in the dictionary. The œ ligature is at the same place as oe.
- Dutch: the digraph ij is usually written as the two characters ij, but sometimes as the single character ij and sometimes (historically and in names) as y. If written as two letters, both must be capitalized at the beginning of the word, e.g. IJzer.
- German: ß is collated with ss. It is often substituted by SS in uppercase.
Hyphenation
Some languages, like English, allow splitting words with a hyphen at line breaks. This makes it difficult to find out if super-majority was intended to be supermajority or not, for instance.
Questions and exercises
-
-
Which different kinds of knowledge are necessary to understand language? How much 'common sense' and knowledge about the world is necessary?
- Which roles does capitalization play in various languages? Compare how English and Norwegian capitalize proper names consisting of multiple words. What about names of languages in English and Norwegian? What about capitalization in German? What about Chinese?
-
Which possible goals of natural language processing might influence how to define a word?
-
Can you tell the difference between
ij
andij
? How many characters does each have? What may be the consequences of using a ligature, such asffl
as one character versusffl
as three separate characters?
-