Automatic segmentation of text into structured records

Borkar, Vinayak; Deshmukh, Kaustubh; Sarawagi, Sunita

doi:10.1145/375663.375682

Cited by 127 publications

(46 citation statements)

References 18 publications

(20 reference statements)

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…In contrast, Borkar et al [20] replaced each word in each input addresses with symbols based on a simple rational expression grouping eg 3-digit number, 5-digit number, single character, multi-character word, mixed alphanumeric word. These symbols contain much less semantic information than the lexicon-based symbols used in Febrl, although they have the advantage of not requiring look-up tables (lexicons).…”

Section: Discussionmentioning

confidence: 99%

“…Statistical models, particularly hidden Markov models, have been used extensively in the computer science fields of speech recognition and natural language processing to help solve problems such as word-sense disambiguation and part-of-speech tagging [14]. More recently, hidden Markov and related models have been applied to the problem of extracting structured information from unstructured text [15-20]. …”

Section: Introductionmentioning

confidence: 99%

“…This paper describes an implementation of lexicon-based tokenisation with hidden Markov models for name and address standardisation – an approach strongly influenced by the work of Borkar et al [20]. This implementation is part of a free, open source [21] record linkage package known as Febrl (Freely extensible biomedical record linkage) [22].…”

Section: Introductionmentioning

confidence: 99%

“…Such zero probabilities can cause problems when the model is presented with new data, so smoothing techniques are used to assign small probabilities (in this case 0.01) to all unencountered observation symbols for all states. Traditionally Laplace smoothing is used [26], but Borkar et al have also described the use of absolute discounting as an alternative when there are a large number of distinct observation symbols [20]. The Febrl package offers both types of smoothing.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Preparation of name and address data for record linkage using hidden Markov models

Churches

Christen

Lim

et al. 2002

BMC Med Inform Decis Mak

View full text Add to dashboard Cite

BackgroundRecord linkage refers to the process of joining records that relate to the same entity or event in one or more data collections. In the absence of a shared, unique key, record linkage involves the comparison of ensembles of partially-identifying, non-unique data items between pairs of records. Data items with variable formats, such as names and addresses, need to be transformed and normalised in order to validly carry out these comparisons. Traditionally, deterministic rule-based data processing systems have been used to carry out this pre-processing, which is commonly referred to as "standardisation". This paper describes an alternative approach to standardisation, using a combination of lexicon-based tokenisation and probabilistic hidden Markov models (HMMs).MethodsHMMs were trained to standardise typical Australian name and address data drawn from a range of health data collections. The accuracy of the results was compared to that produced by rule-based systems.ResultsTraining of HMMs was found to be quick and did not require any specialised skills. For addresses, HMMs produced equal or better standardisation accuracy than a widely-used rule-based system. However, acccuracy was worse when used with simpler name data. Possible reasons for this poorer performance are discussed.ConclusionLexicon-based tokenisation and HMMs provide a viable and effort-effective alternative to rule-based systems for pre-processing more complex variably formatted data such as addresses. Further work is required to improve the performance of this approach with simpler data such as names. Software which implements the methods described in this paper is freely available under an open source license for other researchers to use and improve.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Preparation of name and address data for record linkage using hidden Markov models

Churches

Christen

Lim

et al. 2002

BMC Med Inform Decis Mak

View full text Add to dashboard Cite

show abstract

“…Na primer, kućne adrese su obično zadate u nestrukturiranom obliku a potrebno ih je razdvojiti na adresu, broj grada, poštanski broj, itd. Ovakav način preprocesiranja podataka omogućava lakše pretraživanje skladišta kao i otklanjanje nekonzistentnosti podataka do koje dolazi kada se ista kućna adresa čuva kao više različitih slogova u skladištu [Borkar 2001 …”

Section: Pročišćavanje Podataka (Data Cleaning)unclassified