Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data 2001
DOI: 10.1145/375663.375682
|View full text |Cite
|
Sign up to set email alerts
|

Automatic segmentation of text into structured records

Abstract: In this paper we present a method for automatically segmenting unformatted text records into structured elements. Several useful data sources today are human-generated as continuous text whereas convenient usage requires the data to be organized as structured records. A prime motivation is the warehouse address cleaning problem of transforming dirty addresses stored in large corporate databases as a single text field into subfields like "City" and "Street". Existing tools rely on hand-tuned, domain-specific ru… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
43
0
3

Year Published

2002
2002
2007
2007

Publication Types

Select...
3
2
2

Relationship

0
7

Authors

Journals

citations
Cited by 127 publications
(46 citation statements)
references
References 18 publications
(20 reference statements)
0
43
0
3
Order By: Relevance
“…In contrast, Borkar et al [20] replaced each word in each input addresses with symbols based on a simple rational expression grouping eg 3-digit number, 5-digit number, single character, multi-character word, mixed alphanumeric word. These symbols contain much less semantic information than the lexicon-based symbols used in Febrl, although they have the advantage of not requiring look-up tables (lexicons).…”
Section: Discussionmentioning
confidence: 99%
See 3 more Smart Citations
“…In contrast, Borkar et al [20] replaced each word in each input addresses with symbols based on a simple rational expression grouping eg 3-digit number, 5-digit number, single character, multi-character word, mixed alphanumeric word. These symbols contain much less semantic information than the lexicon-based symbols used in Febrl, although they have the advantage of not requiring look-up tables (lexicons).…”
Section: Discussionmentioning
confidence: 99%
“…Statistical models, particularly hidden Markov models, have been used extensively in the computer science fields of speech recognition and natural language processing to help solve problems such as word-sense disambiguation and part-of-speech tagging [14]. More recently, hidden Markov and related models have been applied to the problem of extracting structured information from unstructured text [15-20]. …”
Section: Introductionmentioning
confidence: 99%
See 2 more Smart Citations
“…Na primer, kućne adrese su obično zadate u nestrukturiranom obliku a potrebno ih je razdvojiti na adresu, broj grada, poštanski broj, itd. Ovakav način preprocesiranja podataka omogućava lakše pretraživanje skladišta kao i otklanjanje nekonzistentnosti podataka do koje dolazi kada se ista kućna adresa čuva kao više različitih slogova u skladištu [Borkar 2001 …”
Section: Pročišćavanje Podataka (Data Cleaning)unclassified