next up previous contents
Next: Existing Text Input Methods Up: No Title Previous: Introduction


Issues in Text Input

Writing in the sense of manually transferring data from the human mind to electronic gadgets has two equally important sides. The human information processing on one and the computer on the other side of the common ground of the writing interface. Both have their own characteristics which must be accommodated in a successful writing method. In this chapter we discuss some findings in psychology and linguistics along with the technical nature of the data that must be transferred through a given writing method.

The alphabet and the character sets

It should be made clear that there is a difference between an alphabet and a character set used for its visual representation. For literate people the alphabet carries much more information than can be seen on paper. The division of words into syllables and further into phonemes is intricately connected to the alphabet that a person has used to learn the language in question. This internal process that is important for understanding the language and fluent mapping between sounds, semantics and text is not necessarily connected to the shape of the alphabet.

While the claim on the alphabet having substance beyond a given visual representation may not seem clear at first, we all have observed some proof for it. Compare, for example, the shape of an upper case R and a lower case r. A person uneducated in the appearance of modern Latin alphabet would probably say that they represent different letters. However, as we know they both map to the same phoneme and should, in many cases, be understood exactly the same way. With a little imagination one can sketch an imaginary world where the phonetic and grammatical structure of English would be exactly the same, but where the visual appearance of the alphabet is totally different.

In some languages two or more separate writing systems are used. In some, like in Serbo-Croatian where Cyrillic and Latin alphabet are used, the alphabet share some characters. In others, like in Japanese the character sets are complete and independent, but are curiously enough often used together. Japanese can be, and often is, written in three different ways. The pronunciation of a word can be written using either the phoneme based Latin alphabet or the syllable based 46-character Kana script. These techniques are often used with computers because the number of Latin and Kana characters is small enough to allow construction of relatively compact keyboards and because the computer allows the correct written form of the word to be chosen from a list of words matching the entered pronunciation. The third way is to write the word level Kanji characters directly. When we consider the multitude of shapes that a word can take in Japan, we should notice that the number of different representations listed above should be multiplied at least by two because the hand-written representations may be substantially different. Thus it seems that at least some people can handle significantly more than one or two writing systems for one language.

While the interrelation of language and written text on the whole is fascinating, it is somewhat beside the point. For the purposes of this study we are only interested in the outbound translation of language into the actions necessary for inputting data that can be shown as Latin characters on a computer display.

When using a keyboard, the mapping is between language and the series of commands needed for causing the motor system to press a certain key or a combination of keys. When using a pen interface equipped with a character recognition software, one maps language into the series of commands needed for moving the pen. These two input modalities should be different enough to persuade one to believe that any reasonably straightforward motor activity can be used for writing after some practice. The length of the practice period and the general convenience of the motor activity is another question. The empirical part of this study tries to measure these variables for one unorthodox text input method.

The importance of the visual experience

One would expect that learning an alphabet without seeing the characters would be more difficult than doing it in the traditional way where a seen figure is copied by hand until it can be produced without a prototype for reference.

Some psychologists claim that when we read, the access to the meaning that is associated with a visual representation is more visual than phonological [Lukatela and Turvey1998]. The current understanding of reading process is that we have two somewhat separate mechanisms for decoding text into semantics. One is visual and the other phonological.

The phonological mechanism means first associating the text with phonemes and then accessing the meaning of the word using the sound of the word as a search key. The phonological mechanism is the one we usually learn first. The visual mechanism is learned with years of further practice. It offers a shortcut where the visual representation of a word can be used directly to access the meaning. Which of the two mechanisms dominates the reading process of a given reader reading in a given language depends on several variables. In addition to the experience of the reader, the language is important. Native English speakers have been claimed to use the visual mechanism, while native speakers of Serbo-Croatian use the phonological mechanism [Lukatela and Turvey1998].

It is not clear, whether a similar visual shortcut from meaning to appearance is in operation in the opposite direction when we write. However, a visual representation for a new alphabet is probably in order even if the alphabet is used only for writing and not for reading. The visual appearance can be helpful when utilizing skill transfer in teaching a new alphabet. If the new characters resemble characters known to the user, the learning process will probably benefit as long as the meanings do not conflict. Also, having a visual appearance for the characters will help to describe and discuss them in tangible terms.

Good characteristics of a handwriting alphabet

According to Ragnheidur Karlsdottir karlsdottir97 the two most important qualities of a handwriting alphabet are the attainable writing speed and the legibility of the resulting text. Legibility has a different meaning when we are developing an alphabet for text input. The computer reads the input on-line, which means that the action of writing must be standardized and identifiable by the computer while the visual appearance of the trajectories used for input is largely irrelevant. The computer can synthetise any appearance for the alphabet once it has been properly recognized. Speed, on the other hand, retains its meaning in text input and is very important.

Karlsdottir also lists some undesirable qualities for a handwriting alphabet. These are acute angles, high curvature of strokes and unnecessary connecting strokes between letters. In general this includes all trajectories that cause the pen to slow down. The slowdown may be caused by two important reasons. The more obvious one is that if the pen has to suddenly reverse its direction, the average speed of the pen will drop because the instantaneous speed will visit 0. The more intricate reason is that some strokes are more difficult to draw than others. The connecting strokes are difficult because there are so many of them and because writers do not have as much training per connection as they have for each character.

Karlsdottir studied the performance of students on grades from 3 to 6 using four different cursive writing styles. She observed that when speed was important the students tended to change their style so that no strokes were absolutely straight, but not too curved either. Acute angles were replaced with loops and some strokes were simplified by shortening or dropping some curve features. In addition she points out that the styles that are easier to keep close to the model under maximum writing speed are the ones that support a regular hand movement pattern.

Modeling writing methods

If we were to construct a new writing method, it would be appropriate to test it before putting it into use. A sensible first step could be to create a theoretical model that can be used to predict the performance of the method in human use. Models of human performance do exist.

Models like GOMS in a general form [Card et al.1983] or with more specific handling of pointing and clicking performance with Fitts' law have been used. In order to accurately predict something as complex as handwriting, these models need to be calibrated for a specific task. Thus the first thing to do is to conduct a small pilot experiment to gain some insight into the task of writing using the text input method that we are interested in. After we have some data to feed to our models, we can make predictions on user performance and build a larger, statistically significant, experiment to validate our predictions. The computed predictions alone are not enough, because the models do not tell us everything. GOMS for example does not predict initial learning rate for motor skills [John1995].

Skill transfer

It is a well established fact that skill acquired for one task can be used in other similar tasks. Generally, the more the tasks resemble each other, the better the skill transfer is likely to succeed.

The situation is not as simple as that when we look more closely on what exactly constitutes the ``close resemblance'' that ensures strong transfer. Certainly it means that the features that are critical to performing the two tasks share common characteristics in the internal representation of the tasks in the human mind and motor system. Now, however, we are getting close to the real problem. We simply do not know enough about ourselves as information processors to be able to list the critical features of a task in the order of importance [Lintern1991]. Thus, we are left with the choice of embarking on a long journey of empirical trial and error testing or being satisfied with using the vague notion of general similarity.

In the case of learning to write with a new alphabet, we would naturally be interested in using the considerable skill that most literate people have in working with existing alphabet. The idea has been used by developers of text input methods (for examples see Unistrokes and Graffiti in Chapter 3).

Character and digram frequencies

When constructing a writing method one has to do some compromises. A significant cause for compromises is the shortage of easy and simple characters. There is a very limited supply of simple and distinct shapes, movements or sounds that can be used in writing. One must choose which characters are associated with these easy and simple actions. To make writing method fast and robust one is inclined towards choosing the most common actions to be the simple ones. This way the average speed and ease of the writing method is closer to optimum.

The common characters can be found by running text through programs that count the desired frequencies. Computations like this have been performed by people working with natural languages, data compression, cryptology and text input methods. While previously gathered data is available, we chose to gather some more to verify earlier findings and to make a point on the varying text input tasks. The new data is gathered from two sources. The first one is project Gutenberg text collection [Gutenberg1999] and the second is the Linux kernel source tree that consists mainly of C-code [Red Hat1998]. This data is compared to the table on character and digram frequencies in English given by Soukoreff and MacKenzie soukoreff95.

Character frequencies

Table 2.1 lists the top 35 characters in three text samples. Kernel stands for the Linux kernel source tree version 2.0.36 distributed with Red Hat Linux 5.2 redhat98. Gutenberg is the Project Gutenberg etext base as found in the mirror on April 20th 1999. Three Gutenberg files having a ``.txt'' extension contained zeroes (integer value 0, not character ``0''). This was taken as an indication of that the files were not plain text and the files were excluded from the data. Similarly no files with name extensions other than ``.txt'' were included. The data labeled as Soukoreff is not completely Soukoreff's and MacKenzie's original work. They only added the figures for the space character into a table gathered by Mayzner and Tresselt mayzner65. The Soukoreff sample contains only 27 characters while the other two were computed over all 256 possible eight bit values.

Table 2.1: Top 35 character frequencies.
            Kernel              Gutenberg            Soukoreff
Rank  Character Frequency  Character Frequency  Character Frequency
 1         sp   0.145832        sp   0.152503        sp   0.186550
 2         e    0.054217        e    0.087374        e    0.108321
 3         t    0.041608        t    0.061606        t    0.079711
 4         i    0.036502        a    0.054373        a    0.066101
 5         lf   0.034443        o    0.053207        h    0.062808
 6         r    0.032816        n    0.047330        o    0.053881
 7         s    0.032546        i    0.045921        s    0.049366
 8         n    0.032239        s    0.043563        n    0.048965
 9         tab  0.031355        r    0.041673        r    0.047798
10         a    0.029416        h    0.041089        i    0.041987
11         0    0.028855        d    0.030009        l    0.036380
12         o    0.027240        l    0.028298        d    0.035168
13         d    0.025076        lf   0.021562        u    0.024981
14         _    0.023130        cr   0.020194        w    0.023349
15         c    0.021343        u    0.019984        m    0.020149
16         l    0.017964        c    0.017858        c    0.019151
17         f    0.017488        m    0.016829        g    0.017733
18         u    0.016313        f    0.016643        y    0.017043
19         ,    0.015705        w    0.014504        f    0.014561
20         *    0.014940        p    0.013297        b    0.013218
21         p    0.014098        ,    0.013222        p    0.012472
22         )    0.013029        g    0.013181        k    0.008703
23         (    0.013008        y    0.012954        v    0.008059
24         h    0.011776        .    0.011150        j    0.001296
25         m    0.011410        b    0.010172        x    0.001119
26         -    0.011201        <    0.008856        q    0.000615
27         ;    0.010696        >    0.008795        z    0.000503
28         =    0.009123        v    0.006582            
29         x    0.008942        k    0.005138            
30         b    0.008808        /    0.004391            
31         g    0.008708        1    0.003691            
32         E    0.008210        I    0.003555            
33         /    0.007591        "    0.003353            
34         S    0.007483        0    0.003045            
35         1    0.007348        -    0.002898            
Sum             0.830475             0.938818             1.0 

Of the three frequency listings in Table 2.1 only Soukoreff data is claimed to be representative of English text (without punctuation, numbers and other symbols not listed in Table 2.1). The Soukoreff sample consisted of 107 199 characters.

While the Gutenberg sample is quite a bit larger consisting of 970 428 426 characters, it may not be representative of English. Firstly, because all texts were not in English. In addition to English the sample contains at least Latin, French, Italian and Spanish. Secondly, the Gutenberg data also contains several HTML-files and files consisting mainly of numbers (like pi to the millionth decimal). Thus the Gutenberg sample is more of a mix of natural and computer languages spiced up with some numeric data. The text mixture in the Gutenberg sample is closest to something that a person writing email, html-pages and spreadsheets would type into her desktop computer. The most important feature of the Gutenberg sample is the fact that it gives some idea of the frequencies of period, comma and other punctuation characters not included in Soukoreff's data.

The kernel sample is another mixture of languages. It has a significant bias towards character frequencies typical to C-code, but it also contains natural language (mostly English), gnu make compatible makefiles and small portion of other C and assembler programming related text files. In the 40 919 330 characters that were counted, non-alphabet characters are much more frequent than in the two other samples. The kernel sample represents rather closely the kind of text that a person would type when coding a portable operating system kernel.

The thing to be learned from Table 2.1 is that lower case alphabet and space are not the only frequently needed characters in text input. On the other hand they are a very good guess for the most frequent characters even in C.

Digram frequencies

The character frequencies should be the first to be considered when designing a writing method. The second thing to look at are the transitions between characters. As Karlsdottir's study karlsdottir97 showed, in a sequential writing method such as regular handwriting, the inter-character transitions are one of the biggest factors in writing speed and a major source for errors.

Table 2.2 lists the top 35 digrams in the same data from which Table 2.1 is derived. The fact that of the 35 top ranking digrams shown only 11 are shared by all three suggests that optimizing a writing system for digrams is more difficult than it is for single characters. Furthermore, the benefit is likely to be smaller because the transitions are likely to be less complex and time consuming than the characters.

Table 2.2: Top 35 digram frequencies.
            Kernel              Gutenberg            Soukoreff
Rank  Character Frequency  Character Frequency  Character Frequency
 1      sp-sp   0.046290    e-sp   0.024210      e-sp   0.045746
 2      lf-tab  0.014489    cr-lf  0.020183      sp-t   0.036492
 3      tab-tab 0.012626    sp-t   0.019995      t-h    0.035205
 4      0-0     0.011660    t-h    0.018517      h-e    0.029431
 5      e-sp    0.010648    h-e    0.018147      d-sp   0.024505
 6      i-n     0.010423    sp-a   0.014644      t-sp   0.021856
 7      ;-lf    0.009324    d-sp   0.013859      s-sp   0.020783
 8      ,-sp    0.008969    s-sp   0.012394      sp-a   0.017556
 9      lf-sp   0.008284    i-n    0.011860      sp-w   0.016669
10      t-sp    0.007525    t-sp   0.011857      sp-s   0.014888
11      sp-*    0.007232    e-r    0.011415      a-n    0.014701
12      r-e     0.006916    ,-sp   0.011181      r-sp   0.013834
13      d-e     0.006789    a-n    0.010802      sp-h   0.012947
14      s-t     0.006214    sp-o   0.009338      e-r    0.012257
15      e-r     0.005906    n-sp   0.009125      n-d    0.011315
16      sp-i    0.005486    r-e    0.009007      y-sp   0.010923
17      sp-t    0.005392    sp-s   0.008744      h-a    0.010858
18      n-t     0.005294    sp-w   0.008480      n-sp   0.010746
19      0-x     0.005201    sp-h   0.008319      r-e    0.010625
20      =-sp    0.005129    n-d    0.008223      o-u    0.010401
21      sp-0    0.004910    sp-sp  0.008165      i-n    0.010354
22      s-sp    0.004829    o-n    0.007448      sp-f   0.009878
23      s-e     0.004794    r-sp   0.007162      sp-b   0.009636
24      d-sp    0.004713    e-n    0.007000      sp-m   0.008171
25      sp-(    0.004539    a-t    0.006904      sp-c   0.008059
26      -->     0.004535    sp-i   0.006657      h-i    0.007686
27      *-sp    0.004530    y-sp   0.006639      o-r    0.007574
28      sp-s    0.004495    o-u    0.006632      a-r    0.007481
29      ---     0.004367    e-d    0.006575      e-n    0.007453
30      o-n     0.004332    o-r    0.006458      a-t    0.007322
31      n-e     0.004323    h-a    0.006257      n-g    0.007192
32      a-t     0.004307    o-sp   0.006136      e-d    0.007154
33      sp-=    0.004242    t-o    0.005901      s-t    0.007033
34      t-e     0.004212    e-s    0.005815      sp-o   0.006725
35      t-h     0.004131    t-e    0.005674      sp-l   0.006688
Sum             0.267056           0.359723             0.500163

While optimizing a writing method for digrams may be difficult, it may also be worthwhile, for reasons that can be seen in figure 2.1. Digrams less frequent than the 256th most frequent are rather rare in all three examples. The 256 most frequent digrams account for 64% of the text in the kernel data, 81% in Gutenberg data, and 97% in Soukoreff's data. Thus at least in task specific writing method the set of digrams that needs to be taken into account to make the optimization worthwhile is rather small compared to the number of all possible digrams. A big problem, however, is making a universal writing method that is well suited for all writing tasks. From the data discussed above, it seems that writing tasks can be too diverse to allow significant advantages from digram optimization. If we optimize for speed in English text input, C language writing speed will probably suffer and vice versa.

Figure 2.1: Character and digram frequencies down to the 256th most frequent item.
...gures/skg-frequency-comparison-zoomed.eps,width=12cm} }

Even though it seems that there is no way to make a universally optimal writing method, the character and digram frequency data is not useless. When designing a writing method for a specific task, we should generate similar figures for the expected text and optimize the method accordingly. Also, with a general purpose method we probably should try to find the most frequent writing tasks and see if the expected text allows optimization. If it seems to be possible to make a writing method better for a task without significant penalty in other tasks, one should obviously do the optimization.


When text is input into a computer the input itself does not need to be human legible. As long as the computer interprets the input correctly, any visual appearance can be generated for human viewing. However, there are other features of handwriting that are relevant to text input. Firstly, we must remember that under stress handwriting converges towards a style where no sharp corners or straight lines exist. A writing method should allow this to happen without penalty. Secondly, the alphabet used for text input should have some visual appearance so that it can be easily taught and discussed.

There are differences in the frequencies at which different characters appear in texts. Some characters constitute over 10% of a given text while others almost never appear. Similar results hold for digrams. While it is very unusual for any digram to be more frequent than 2%, the top 256 digrams cover more than half of all normal texts. When designing a writing method one should try to make the frequent characters and digrams easy and fast to write.

next up previous contents
Next: Existing Text Input Methods Up: No Title Previous: Introduction