It should be made clear that there is a difference between an alphabet and a character set used for its visual representation. For literate people the alphabet carries much more information than can be seen on paper. The division of words into syllables and further into phonemes is intricately connected to the alphabet that a person has used to learn the language in question. This internal process that is important for understanding the language and fluent mapping between sounds, semantics and text is not necessarily connected to the shape of the alphabet.
While the claim on the alphabet having substance beyond a given visual representation may not seem clear at first, we all have observed some proof for it. Compare, for example, the shape of an upper case R and a lower case r. A person uneducated in the appearance of modern Latin alphabet would probably say that they represent different letters. However, as we know they both map to the same phoneme and should, in many cases, be understood exactly the same way. With a little imagination one can sketch an imaginary world where the phonetic and grammatical structure of English would be exactly the same, but where the visual appearance of the alphabet is totally different.
In some languages two or more separate writing systems are used. In some, like in Serbo-Croatian where Cyrillic and Latin alphabet are used, the alphabet share some characters. In others, like in Japanese the character sets are complete and independent, but are curiously enough often used together. Japanese can be, and often is, written in three different ways. The pronunciation of a word can be written using either the phoneme based Latin alphabet or the syllable based 46-character Kana script. These techniques are often used with computers because the number of Latin and Kana characters is small enough to allow construction of relatively compact keyboards and because the computer allows the correct written form of the word to be chosen from a list of words matching the entered pronunciation. The third way is to write the word level Kanji characters directly. When we consider the multitude of shapes that a word can take in Japan, we should notice that the number of different representations listed above should be multiplied at least by two because the hand-written representations may be substantially different. Thus it seems that at least some people can handle significantly more than one or two writing systems for one language.
While the interrelation of language and written text on the whole is fascinating, it is somewhat beside the point. For the purposes of this study we are only interested in the outbound translation of language into the actions necessary for inputting data that can be shown as Latin characters on a computer display.
When using a keyboard, the mapping is between language and the series of commands needed for causing the motor system to press a certain key or a combination of keys. When using a pen interface equipped with a character recognition software, one maps language into the series of commands needed for moving the pen. These two input modalities should be different enough to persuade one to believe that any reasonably straightforward motor activity can be used for writing after some practice. The length of the practice period and the general convenience of the motor activity is another question. The empirical part of this study tries to measure these variables for one unorthodox text input method.
Some psychologists claim that when we read, the access to the meaning that is associated with a visual representation is more visual than phonological [Lukatela and Turvey1998]. The current understanding of reading process is that we have two somewhat separate mechanisms for decoding text into semantics. One is visual and the other phonological.
The phonological mechanism means first associating the text with phonemes and then accessing the meaning of the word using the sound of the word as a search key. The phonological mechanism is the one we usually learn first. The visual mechanism is learned with years of further practice. It offers a shortcut where the visual representation of a word can be used directly to access the meaning. Which of the two mechanisms dominates the reading process of a given reader reading in a given language depends on several variables. In addition to the experience of the reader, the language is important. Native English speakers have been claimed to use the visual mechanism, while native speakers of Serbo-Croatian use the phonological mechanism [Lukatela and Turvey1998].
It is not clear, whether a similar visual shortcut from meaning to appearance is in operation in the opposite direction when we write. However, a visual representation for a new alphabet is probably in order even if the alphabet is used only for writing and not for reading. The visual appearance can be helpful when utilizing skill transfer in teaching a new alphabet. If the new characters resemble characters known to the user, the learning process will probably benefit as long as the meanings do not conflict. Also, having a visual appearance for the characters will help to describe and discuss them in tangible terms.
Karlsdottir also lists some undesirable qualities for a handwriting alphabet. These are acute angles, high curvature of strokes and unnecessary connecting strokes between letters. In general this includes all trajectories that cause the pen to slow down. The slowdown may be caused by two important reasons. The more obvious one is that if the pen has to suddenly reverse its direction, the average speed of the pen will drop because the instantaneous speed will visit 0. The more intricate reason is that some strokes are more difficult to draw than others. The connecting strokes are difficult because there are so many of them and because writers do not have as much training per connection as they have for each character.
Karlsdottir studied the performance of students on grades from 3 to 6 using four different cursive writing styles. She observed that when speed was important the students tended to change their style so that no strokes were absolutely straight, but not too curved either. Acute angles were replaced with loops and some strokes were simplified by shortening or dropping some curve features. In addition she points out that the styles that are easier to keep close to the model under maximum writing speed are the ones that support a regular hand movement pattern.
Models like GOMS in a general form [Card et al.1983] or with more specific handling of pointing and clicking performance with Fitts' law have been used. In order to accurately predict something as complex as handwriting, these models need to be calibrated for a specific task. Thus the first thing to do is to conduct a small pilot experiment to gain some insight into the task of writing using the text input method that we are interested in. After we have some data to feed to our models, we can make predictions on user performance and build a larger, statistically significant, experiment to validate our predictions. The computed predictions alone are not enough, because the models do not tell us everything. GOMS for example does not predict initial learning rate for motor skills [John1995].
The situation is not as simple as that when we look more closely on what exactly constitutes the ``close resemblance'' that ensures strong transfer. Certainly it means that the features that are critical to performing the two tasks share common characteristics in the internal representation of the tasks in the human mind and motor system. Now, however, we are getting close to the real problem. We simply do not know enough about ourselves as information processors to be able to list the critical features of a task in the order of importance [Lintern1991]. Thus, we are left with the choice of embarking on a long journey of empirical trial and error testing or being satisfied with using the vague notion of general similarity.
In the case of learning to write with a new alphabet, we would naturally be interested in using the considerable skill that most literate people have in working with existing alphabet. The idea has been used by developers of text input methods (for examples see Unistrokes and Graffiti in Chapter 3).
The common characters can be found by running text through programs that count the desired frequencies. Computations like this have been performed by people working with natural languages, data compression, cryptology and text input methods. While previously gathered data is available, we chose to gather some more to verify earlier findings and to make a point on the varying text input tasks. The new data is gathered from two sources. The first one is project Gutenberg text collection [Gutenberg1999] and the second is the Linux kernel source tree that consists mainly of C-code [Red Hat1998]. This data is compared to the table on character and digram frequencies in English given by Soukoreff and MacKenzie soukoreff95.
Kernel Gutenberg Soukoreff Rank Character Frequency Character Frequency Character Frequency 1 sp 0.145832 sp 0.152503 sp 0.186550 2 e 0.054217 e 0.087374 e 0.108321 3 t 0.041608 t 0.061606 t 0.079711 4 i 0.036502 a 0.054373 a 0.066101 5 lf 0.034443 o 0.053207 h 0.062808 6 r 0.032816 n 0.047330 o 0.053881 7 s 0.032546 i 0.045921 s 0.049366 8 n 0.032239 s 0.043563 n 0.048965 9 tab 0.031355 r 0.041673 r 0.047798 10 a 0.029416 h 0.041089 i 0.041987 11 0 0.028855 d 0.030009 l 0.036380 12 o 0.027240 l 0.028298 d 0.035168 13 d 0.025076 lf 0.021562 u 0.024981 14 _ 0.023130 cr 0.020194 w 0.023349 15 c 0.021343 u 0.019984 m 0.020149 16 l 0.017964 c 0.017858 c 0.019151 17 f 0.017488 m 0.016829 g 0.017733 18 u 0.016313 f 0.016643 y 0.017043 19 , 0.015705 w 0.014504 f 0.014561 20 * 0.014940 p 0.013297 b 0.013218 21 p 0.014098 , 0.013222 p 0.012472 22 ) 0.013029 g 0.013181 k 0.008703 23 ( 0.013008 y 0.012954 v 0.008059 24 h 0.011776 . 0.011150 j 0.001296 25 m 0.011410 b 0.010172 x 0.001119 26 - 0.011201 < 0.008856 q 0.000615 27 ; 0.010696 > 0.008795 z 0.000503 28 = 0.009123 v 0.006582 29 x 0.008942 k 0.005138 30 b 0.008808 / 0.004391 31 g 0.008708 1 0.003691 32 E 0.008210 I 0.003555 33 / 0.007591 " 0.003353 34 S 0.007483 0 0.003045 35 1 0.007348 - 0.002898 Sum 0.830475 0.938818 1.0
Of the three frequency listings in Table 2.1 only Soukoreff data is claimed to be representative of English text (without punctuation, numbers and other symbols not listed in Table 2.1). The Soukoreff sample consisted of 107 199 characters.
While the Gutenberg sample is quite a bit larger consisting of 970 428 426 characters, it may not be representative of English. Firstly, because all texts were not in English. In addition to English the sample contains at least Latin, French, Italian and Spanish. Secondly, the Gutenberg data also contains several HTML-files and files consisting mainly of numbers (like pi to the millionth decimal). Thus the Gutenberg sample is more of a mix of natural and computer languages spiced up with some numeric data. The text mixture in the Gutenberg sample is closest to something that a person writing email, html-pages and spreadsheets would type into her desktop computer. The most important feature of the Gutenberg sample is the fact that it gives some idea of the frequencies of period, comma and other punctuation characters not included in Soukoreff's data.
The kernel sample is another mixture of languages. It has a significant bias towards character frequencies typical to C-code, but it also contains natural language (mostly English), gnu make compatible makefiles and small portion of other C and assembler programming related text files. In the 40 919 330 characters that were counted, non-alphabet characters are much more frequent than in the two other samples. The kernel sample represents rather closely the kind of text that a person would type when coding a portable operating system kernel.
The thing to be learned from Table 2.1 is that lower case alphabet and space are not the only frequently needed characters in text input. On the other hand they are a very good guess for the most frequent characters even in C.
Table 2.2 lists the top 35 digrams in the same data from which Table 2.1 is derived. The fact that of the 35 top ranking digrams shown only 11 are shared by all three suggests that optimizing a writing system for digrams is more difficult than it is for single characters. Furthermore, the benefit is likely to be smaller because the transitions are likely to be less complex and time consuming than the characters.
Kernel Gutenberg Soukoreff Rank Character Frequency Character Frequency Character Frequency 1 sp-sp 0.046290 e-sp 0.024210 e-sp 0.045746 2 lf-tab 0.014489 cr-lf 0.020183 sp-t 0.036492 3 tab-tab 0.012626 sp-t 0.019995 t-h 0.035205 4 0-0 0.011660 t-h 0.018517 h-e 0.029431 5 e-sp 0.010648 h-e 0.018147 d-sp 0.024505 6 i-n 0.010423 sp-a 0.014644 t-sp 0.021856 7 ;-lf 0.009324 d-sp 0.013859 s-sp 0.020783 8 ,-sp 0.008969 s-sp 0.012394 sp-a 0.017556 9 lf-sp 0.008284 i-n 0.011860 sp-w 0.016669 10 t-sp 0.007525 t-sp 0.011857 sp-s 0.014888 11 sp-* 0.007232 e-r 0.011415 a-n 0.014701 12 r-e 0.006916 ,-sp 0.011181 r-sp 0.013834 13 d-e 0.006789 a-n 0.010802 sp-h 0.012947 14 s-t 0.006214 sp-o 0.009338 e-r 0.012257 15 e-r 0.005906 n-sp 0.009125 n-d 0.011315 16 sp-i 0.005486 r-e 0.009007 y-sp 0.010923 17 sp-t 0.005392 sp-s 0.008744 h-a 0.010858 18 n-t 0.005294 sp-w 0.008480 n-sp 0.010746 19 0-x 0.005201 sp-h 0.008319 r-e 0.010625 20 =-sp 0.005129 n-d 0.008223 o-u 0.010401 21 sp-0 0.004910 sp-sp 0.008165 i-n 0.010354 22 s-sp 0.004829 o-n 0.007448 sp-f 0.009878 23 s-e 0.004794 r-sp 0.007162 sp-b 0.009636 24 d-sp 0.004713 e-n 0.007000 sp-m 0.008171 25 sp-( 0.004539 a-t 0.006904 sp-c 0.008059 26 --> 0.004535 sp-i 0.006657 h-i 0.007686 27 *-sp 0.004530 y-sp 0.006639 o-r 0.007574 28 sp-s 0.004495 o-u 0.006632 a-r 0.007481 29 --- 0.004367 e-d 0.006575 e-n 0.007453 30 o-n 0.004332 o-r 0.006458 a-t 0.007322 31 n-e 0.004323 h-a 0.006257 n-g 0.007192 32 a-t 0.004307 o-sp 0.006136 e-d 0.007154 33 sp-= 0.004242 t-o 0.005901 s-t 0.007033 34 t-e 0.004212 e-s 0.005815 sp-o 0.006725 35 t-h 0.004131 t-e 0.005674 sp-l 0.006688 Sum 0.267056 0.359723 0.500163
While optimizing a writing method for digrams may be difficult, it may also be worthwhile, for reasons that can be seen in figure 2.1. Digrams less frequent than the 256th most frequent are rather rare in all three examples. The 256 most frequent digrams account for 64% of the text in the kernel data, 81% in Gutenberg data, and 97% in Soukoreff's data. Thus at least in task specific writing method the set of digrams that needs to be taken into account to make the optimization worthwhile is rather small compared to the number of all possible digrams. A big problem, however, is making a universal writing method that is well suited for all writing tasks. From the data discussed above, it seems that writing tasks can be too diverse to allow significant advantages from digram optimization. If we optimize for speed in English text input, C language writing speed will probably suffer and vice versa.
Even though it seems that there is no way to make a universally optimal writing method, the character and digram frequency data is not useless. When designing a writing method for a specific task, we should generate similar figures for the expected text and optimize the method accordingly. Also, with a general purpose method we probably should try to find the most frequent writing tasks and see if the expected text allows optimization. If it seems to be possible to make a writing method better for a task without significant penalty in other tasks, one should obviously do the optimization.
There are differences in the frequencies at which different characters appear in texts. Some characters constitute over 10% of a given text while others almost never appear. Similar results hold for digrams. While it is very unusual for any digram to be more frequent than 2%, the top 256 digrams cover more than half of all normal texts. When designing a writing method one should try to make the frequent characters and digrams easy and fast to write.