Ngrams (n=1 to 4) based on an English language corpus of news, blogs and Twitter text were created. Stopwords were kept, but very infrequent ngrams were removed for performance purposes. Unigrams are included since the app has an auto-complete functionality (given just a few characters the app also suggests completions).
A variant of the Stupid Backoff method (Brants et. al 2007) was implemented. If the model does not find a matching quadgram, it will “back off” to a tri-, bi- or unigram model.
Furthermore, words in the first part of a bigram, trigram and quadgram that were not among the ~6000 most frequent unigrams were masked with “<UNK>”. The masking was also done with the user’s input string. Thus, if the user inputs “I saw the Titanic” it might be matched to “saw the Gladiator movie” (suggestion =“movie”), provided that both Titanic and Gladiator were masked.