ngramwords
|
Written by and copyright Neil Smith ( neil@wimp.freeuk.com ) Version 1.0, 8 August 2005
IntroductionThe program reads a set of words and generates an n-gram model of the language. It then uses that model to generate new random words. The idea is that the preceeding few letters in a word determine what the next letter could be. Let's say we're looking at bigrams, sequences of two letters (n = 2). If we take all the words in the language sample we've got, we can list all the bigrams that occur in all the words. We can also list, for each bigram, the letter that comes after it. We also record 'end of word' as being a possible successor letter for a bigram. We end up with a list of all the bigrams in the language sample, how frequent they are, and what letter follows. We also keep a list of the initial bigrams, so we know how words are allowed to start. This is our model of the language. To generate new words, we pick a random starting bigram from the list of initial bigrams. This gives us the first two letters of our word. We then look up that bigram in our main list of bigrams, which gives us a list of letters that can follow this bigram. We pick one of those at random, and that gives us the third letter of our word. We then take the bigram of the second and third letters and look it up in the list of bigrams; from this, we generate the fourth letter. We then use the third and fourth letter to generate the fifth, and so on until we choose an 'end of word' marker. Using larger values for n means that the generated words conform more closely to the words in the language sample, but there is a tendency to recycle the exising words if the sample is small. I find that using trigrams n = 3) works well when there's a few hundred words. Installation and Use
Copy the To invoke the program, call it with:
where input-file is a text file containing a list of words to build the language model from. Words must be separated by whitespace. If any words contain any characters outside the range [A-Z]|[a-z], including accented characters, those characters must be listed in the Options
Ligatures and accented charactersMany languages, when transliterated in to the Latin alphabet, use more than one Latin letter for each native letter. To accommodate this, this program allows each element of the n-gram to be a multi-letter 'token.' This is done by including in the input file a line such as:
(this must be on a separate line in the input file). This will mean that 'ú' will be recognized as a valid character in this language, and that 'th' and 'ch' will be treated as a single letter. See the sample input file for an example. Lines in the input file that start with a # character are treated as comments and ignored by the program. |
This page maintained by Neil Smith (webmaster@wimp.freeuk.com)