Address standardization with latent semantic association

Guo, Honglei; Zhu, Hengyue; Guo, Zhiyong; Zhang, XiaoXun; Su, Zhong

doi:10.1145/1557019.1557144

Cited by 12 publications

(8 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For instance in (Kothari et al, 2010) a new address standardization technique for countries such as India is shown which is able to deal with substantial variants in address structures within a country. Also for addresses, (Guo et al, 2009) proposes latent semantic association instead of the traditional rule and dictionary-based approaches for data standardization. The data standardization step is typically applied before matching and data consolidation algorithms are applied to match records with the intent to find duplicates.…”

Section: Dq Methods For Improvementmentioning

confidence: 99%

A classification of data quality assessment and improvement methods

Woodall

Oberhofer²,

Borek³

2014

IJIQ

View full text Add to dashboard Cite

Data quality (DQ) assessment and improvement in larger information systems would often not be feasible without using suitable "DQ methods", which are algorithms that can be automatically executed by computer systems to detect and/or correct problems in datasets. Currently, these methods are already essential, and they will be of even greater importance as the quantity of data in organisational systems grows. This paper provides a review of existing methods for both DQ assessment and improvement and classifies them according to the DQ problem and problem context. Six gaps have been identified in the classification, where no current DQ methods exist, and these show where new methods are required as a guide for future research and DQ tool development.

show abstract

Section: Dq Methods For Improvementmentioning

confidence: 99%

A classification of data quality assessment and improvement methods

Woodall

Oberhofer²,

Borek³

2014

IJIQ

View full text Add to dashboard Cite

show abstract

“…-Company. We also crawl addresses on two food review websites 2 3 and one company information query website 4 . This database contains 10k company addresses.…”

Section: Datasets and Metricsmentioning

confidence: 99%

“…The process of translating manually written addresses into a certain digital format is known as address standardization. There are also some researches on Address Standardization [4,6,8]. A method based on trie-tree and finite state machine is proposed in [10] which focuses on the problem of inaccurate word segmentation.…”

Section: Table 3 An Example Of Web Contextsmentioning

confidence: 99%

DeepAM: Deep Semantic Address Representation for Address Matching

Shan

Yang

et al. 2019

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Address matching is a crucial task in various location-based businesses like take-out services and express delivery, which aims at identifying addresses referring to the same location in address databases. It is a challenging one due to various possible ways to express the address of a location, especially in Chinese. Traditional address matching approaches relying on string similarities and learning matching rules to identify addresses referring to the same location, could hardly solve the cases with redundant, incomplete or unusual expression of addresses. In this paper, we propose to map every address into a fixed-size vector in the same vector space using state-of-the-art deep sentence representation techniques and then measure the semantic similarity between addresses in this vector space. The attention mechanism is also applied to the model to highlight important features of addresses in their semantic representations. Last but not least, we novelly propose to get rich contexts for addresses from the web through web search engines, which could strongly enrich the semantic meaning of addresses that could be learned. Our empirical study conducted on two real-world address datasets demonstrates that our approach greatly improves both precision (up to 5%) and recall (up to 8%) of the state-of-the-art existing methods.

show abstract

“…The aim of this model is to extract hidden (unknown) information from a string of visible parameters. Particularly novel is the work of Guo et al (2009), which analyzes postal addresses using a model of Latent Semantic Association (LaSA). LaSA model is built to minimize the human efforts and the size of the control data.…”

Section: Introductionmentioning

confidence: 99%

“…Some tests to solve the problem of normalization of addresses were done years ago, but the greatest difficulty was the necessary computing power, not very developed at that time (Fernández et al, 1993). Current processors overcome this difficulty and, besides, new studies emerge every day analyzing the feasibility of different algorithms for data management (Navarro et al, 2003;Patman and Thompson, 2003;Christen and Belacic, 2005;Guo et al, 2009). Although these studies are not applied to Bibliometrics, they employ different techniques that can be used for present and future improvements.…”

Section: Introductionmentioning

confidence: 99%

Towards the automation of address identification

et al. 2012

View full text Add to dashboard Cite

A new semi-automatic method is presented to standardize or codify addresses, in order to produce bibliometric indicators from bibliographic databases. The hypothesis is that this new method is very trustworthy to normalize authors' addresses, easy and quick to obtain. As a way to test the method, a set of already hand-coded data is chosen to verify its reliability: 136,821 Spanish documents (2006-2008) downloaded previously from the Web of Science database. Unique addresses from this set were selected to produce a list of keywords representing various institutional sectors. Once the list of terms is obtained, addresses are standardized with this information and the result is compared to the previous hand-coded data. Some tests are done to analyze possible association between both systems (automatic and hand-coding), calculating measures of recall and precision, and some statistical directional and symmetric measures. The outcome shows a good relation between both methods. Although these results are quite general, this first overview of the address at the institutional sector level is a good way to develop a second approach for the selection of particular centers. This system has some new features because it provides a method based on the previous non-existence of master tables and it has a certain impact on the automation of tasks. The validity of the hypothesis has been proved taking into account not only the statistical measures, but also considering that the obtaining of general and detailed scientific output is less time-consuming and will be even less due to the feedback of the master tables reused for the same kind of data.

show abstract

Address standardization with latent semantic association

Cited by 12 publications

References 19 publications

A classification of data quality assessment and improvement methods

A classification of data quality assessment and improvement methods

DeepAM: Deep Semantic Address Representation for Address Matching

Towards the automation of address identification

Contact Info

Product

Resources

About