关于拼写检查的文献?

时间:2011-05-31 17:18:32

标签: nlp machine-learning spell-checking

我想知道是否有关于如何实施拼写检查的文献清单。我能找到的一个例子是Peter Norvig的“如何编写拼写纠正器 - http://norvig.com/spell-correct.html非常不切实际。”

我感兴趣的事情很少:

  • 构建拼写检查器而不诉诸字典(通过使用现有的语料库,N-gram转储,例如Google NGram转储)。
  • 语境拼写检查。

2 个答案:

答案 0 :(得分:1)

这是一篇经典论文:Church & Gale (1991)。关于上下文敏感的错误纠正的工作较少,但可能值得关注的两篇论文是Golding (1995)Carlson & Fette (2007)

答案 1 :(得分:0)

从以下链接引用

How does it Work?
The Basic Model
The basic technology works as follows: The documents that the search engine is providing access to are added both to the search index and a language model. The language model stores seen phrases and maintains statistics about them. When a query is submitted, the src/QuerySpellCheck.java class looks for possible character edits, namely substitutions, insertions, replacements, transpositions, and deletions, that make the query a better fit for the lanaguage model. So if you type 'Gretski' as a query, and the underlying data is data from rec.sport.hockey, the language model will be much more familliar with the mildly edited 'Gretzky' and suggests it as an alternative.
Domain Sensitivity
The big advantage of this approach over dictionary-based spell checking is that the corrections are motivated by data in the search index. So "trt" will be corrected to "tort" in a legal domain, "tart" in a cooking domain, and "TRt" in a bio-informatics domain. On Google, there is no suggested correction, presumably because of web domains "trt.com", Thessaly Radio Television as well as Turkiye Radyo Televizyon, both aka TRT, etc.
Context-Sensitive Correction
Both Yahoo and Google perform context-sensitive correction. For instance, the query frod (an Old English term from the German meaning wise or experienced) has a suggested correction of ford (the automotive company, among others), whereas the query frod baggins has the corrected query frodo baggins (a 20th century English fictional character). That's the Yahoo behavior. Google doesn't correct frod baggins, even though there are about 785 hits for it versus 820,000 for Frodo Baggins. On the other hand, Google does correct frdo and frdo baggins. Amazon behaves similarly, but MSN corrects frd baggins to ford baggins rather than frodo baggins.
LingPipe's model supports exactly this kind of context-sensitive correction.

read this great tutorial

相关问题