Contemporaneous text as side-information in statistical language modeling

Khudanpur, Sanjeev; Kim, Woo-Sung

doi:10.1016/j.csl.2003.09.001

Cited by 4 publications

(1 citation statement)

References 18 publications

(13 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Informally, the insight here is that the initial decodings of p and q, particularly in portions of high confidence, carry useful information about (1) the genres of p and q (e.g., English email), (2) the particular topics covered in p and q (e.g., oil futures), and (3) the particular n-grams that tend to recur in p and q specifically. For example, for (2), one could use a search-engine query to retrieve a small corpus of documents that appear similar to the first-pass decodings of p and q, and use them to help build "story-specific" language models Pr1 and Pr2 [10] that better predict the ngrams of documents on these topics and hence can retrieve more accurate versions of p and q on a second pass.…”

Section: Smoothed N-gram Language Modelsmentioning

confidence: 99%

A natural language approach to automated cryptanalysis of two-time pads

Mason

Watkins

Eisner

et al. 2006

Proceedings of the 13th ACM Conference on Computer and Communications Security

View full text Add to dashboard Cite

While keystream reuse in stream ciphers and one-time pads has been a well known problem for several decades, the risk to real systems has been underappreciated. Previous techniques have relied on being able to accurately guess words and phrases that appear in one of the plaintext messages, making it far easier to claim that "an attacker would never be able to do that." In this paper, we show how an adversary can automatically recover messages encrypted under the same keystream if only the type of each message is known (e.g. an HTML page in English). Our method, which is related to HMMs, recovers the most probable plaintext of this type by using a statistical language model and a dynamic programming algorithm. It produces up to 99% accuracy on realistic data and can process ciphertexts at 200ms per byte on a $2,000 PC. To further demonstrate the practical effectiveness of the method, we show that our tool can recover documents encrypted by Microsoft Word 2002 [22].

show abstract

Section: Smoothed N-gram Language Modelsmentioning

confidence: 99%