Fast Phonetic Similarity Search over Large Repositories

Tissot, Hegler; Peschl, Gabriel; Fabro, Marcos Didonet Del

doi:10.1007/978-3-319-10085-2_6

Cited by 4 publications

(6 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, to avoid any information loss, the parameters t and e are the original input token and entry, not the phonetic representation used along the previous step. The approach does not define a new similarity function, but it uses existing ones, such as the Jaro-Winkler or String Sim [13] metrics. The result of the similarity function is used in two filtering rules.…”

Section: Filtering the Resultsmentioning

confidence: 99%

Integrating Approximate String Matching with Phonetic String Similarity

Ferri

Tissot

Fabro

2018

Advances in Databases and Information Systems

Self Cite

View full text Add to dashboard Cite

Well-defined dictionaries of tagged entities are used in many tasks to identify entities where the scope is limited and there is no need to use machine learning. One common solution is to encode the input dictionary into Trie trees to find matches on an input text. However, the size of the dictionary and the presence of spelling errors on the input tokens have a negative influence on such solutions. We present an approach that transforms the dictionary and each input token into a compact well-known phonetic representation. The resulting dictionary is encoded in a Trie that is about 72 percent smaller than a non-phonetic Trie. We perform inexact matching over this representation to filter a set of initial results. Lastly, we apply a second similarity measure to filter the best result to annotate a given entity. The experiments showed that it achieved good F1 results. The solution was developed as an entity recognition plug-in for GATE, a well-known information extraction framework.

show abstract

Section: Filtering the Resultsmentioning

confidence: 99%

Integrating Approximate String Matching with Phonetic String Similarity

Ferri

Tissot

Fabro

2018

Advances in Databases and Information Systems

Self Cite

View full text Add to dashboard Cite

show abstract

“…One of the earliest systems for calculating phonetic similarity is Soundex, first used to classify and disambiguate personal names in studies of the United States Census in the 1930s (Stephenson 1974). Soundex-like systems that calculate phonetic similarity based on orthography alone are used in informatics applications such as information retrieval and spell-check (Philips 1990;Tissot, Peschl, and Fabro 2014). Approaches to quantifying phonetic similarity that specifically involve phonetic features have been used by linguists to study synchronic language variation (Ladefoged 1969), diachronic language change (Nerbonne 2010), and sound patterning in phonology (Mielke 2012).…”

Section: Background and Related Researchmentioning

confidence: 99%

Poetic Sound Similarity Vectors Using Phonetic Features

Parrish

2021

AIIDE

View full text Add to dashboard Cite

A procedure that uses phonetic transcriptions of words to produce a continuous vector-space model of phonetic sound similarity is presented. The vector dimensions of words in the model are calculated using interleaved phonetic feature bigrams, a novel method that captures similarities in sound that are difficult to model with orthographic or phonemic information alone. Measurements of similarity between items in the resulting vector space are shown to perform well on established tests for predicting phonetic similarity. Additionally, a number of applications of vector arithmetic and nearest-neighbor search are presented, demonstrating potential uses of the vector space in experimental poetry and procedural content generation.

show abstract

“…ED operates between two input strings – ED ( w 1 , w 2 ) – and returns the minimum number of operations (single-character edits) required to transform string w 1 into w 2 . Other examples and variations of string similarity metrics include Jaro-Winkler Distance [9], Hamming Distance [13], and String Sim [14]. However, string distance measures tend to ignore the relative likelihood errors.…”

Section: Approximate String Matchmentioning

confidence: 99%

“…Soundex [15] is an example of a phonetic matching scheme initially designed for English that uses codes based on the sound of each letter to translate a string into a canonical form of at most four characters, preserving the first letter. In addition, phonetic similarity metrics are able to assign a high score even though comparing dissimilar pairs of strings that produce similar sounds [14, 16]. As the result, phonetically similar entries will have the same (or similar) keys and they can be indexed for efficient search using some hashing method.…”

Section: Approximate String Matchmentioning

confidence: 99%

“…In addition, fast similarity search approaches have been proposed in order to match free text against large dictionaries or databases, being supported by either indexed database structures [14, 19, 20] or Trie-based (prefix index) approximate matching [21–23]. In an initial experiment, Fuzzy Keyword Search [22] has proved to be efficient by combining Trie-based search with string similarity functions.…”

Section: Approximate String Matchmentioning

confidence: 99%

See 1 more Smart Citation

Combining string and phonetic similarity matching to identify misspelt names of drugs in medical records written in Portuguese

Tissot

Dobson

2019

J Biomed Semant

Self Cite

View full text Add to dashboard Cite

BackgroundThere is an increasing amount of unstructured medical data that can be analysed for different purposes. However, information extraction from free text data may be particularly inefficient in the presence of spelling errors. Existing approaches use string similarity methods to search for valid words within a text, coupled with a supporting dictionary. However, they are not rich enough to encode both typing and phonetic misspellings.ResultsExperimental results showed a joint string and language-dependent phonetic similarity is more accurate than traditional string distance metrics when identifying misspelt names of drugs in a set of medical records written in Portuguese.ConclusionWe present a hybrid approach to efficiently perform similarity match that overcomes the loss of information inherit from using either exact match search or string based similarity search methods.

show abstract

Fast Phonetic Similarity Search over Large Repositories

Cited by 4 publications

References 12 publications

Integrating Approximate String Matching with Phonetic String Similarity

Integrating Approximate String Matching with Phonetic String Similarity

Poetic Sound Similarity Vectors Using Phonetic Features

Combining string and phonetic similarity matching to identify misspelt names of drugs in medical records written in Portuguese

Contact Info

Product

Resources

About