Version 5, last updated by simonjudge@work at Nov 21 16:18 2008 UTC
Disambiguation process
OK, This page might need updating as we understand more.
Components:
- Word-list : 'Dictionary' of words to search through. Wordlist.txt in DKey software
- .dict file: - keycode to word mappings
- Frequency-list: frequency of these words within a certain corpus
- Corpus: lump of text to get frequency list from.
- Key mapping: List of keys (i.e. physical things on a keyboard) to letters, e.g. 1-abc, 2-def, 3-ghij (we could and a level of abstraction so keys get mapped to these numbers, this allows some re-assignment without rebuilding lookup from wordlist)
Disambiguting Process:
- A sequence of keys is pressed (e.g. 123)
- The possible set of words that could relate to that key sequence is displayed (in order of likelihood by freq).
- The user continues key presses until word complete, if the correct word is highlighted in dis-list, the user selects 'space'...
- If the word is further down the dis-list, the user presses next to cycle to it, then space to select it.
Disambiguation Methods
T9 (Tapir::Exact Only):
- Parse the corpus for full words
- Generate keycode list:
- parse words, for word (e.g. they):
- lookup keycode for each letter of word, parse the word, for each letter (e.g. t, th, the, they)
- add keycode::string to keycode list in order of frequency of occurance (e.g. 08 t, 084 th, 0843 the, 08439 they)
- Compile key-code list for all parsed words
- lookup keycode for each letter of word, parse the word, for each letter (e.g. t, th, the, they)
- e.g. 08439 they view tidy
- parse words, for word (e.g. they):
WHAT TO DO ABOUT CHAR-COMBINATIONS - (unigrams)???? E.g. AR of ARE - are these taken care of in parsing individual letters in word? What to do when adding char-combinations to list, eg.. th - do you add frequencies???
NOT SURE TAPIR DOES THIS 'PROPERLY' - CHECK.
Tapir:
The tapir method is different from T9 for a number of reasons:
- 'Next' itterates over all words with the entered suffix (e.g. th-> the, that, there, their, these) - including longer words that the number of keys already pressed.
- There is a cost equation which can be altered to change the stresses between looking at the prediction list (to choose next) and minising keypresses.
Prefer Exact:
The lookup list stores key sequences and the list of related letter sequences with that exact sequence as the start of the word (suffix). DESCRIBE PROCESS. TBC.
Both:
Sequences are stored in order of probability - determined by parsing the corpus...
Most Probable:
Multitap:
-----
Questions:
- Can we build a look up list from just words rather than part-words
- Is the generated list just letter-strings starting with space?