Proceedings of the 28th International Conference on Computational Linguistics 2020
DOI: 10.18653/v1/2020.coling-main.582
|View full text |Cite
|
Sign up to set email alerts
|

Exploring the Language of Data

Abstract: We set out to uncover the unique grammatical properties of an important yet so far underresearched type of natural language text: that of short labels typically found within structured datasets. We show that such labels obey a specific type of abbreviated grammar that we call the Language of Data, with properties significantly different from the kinds of text typically addressed in computational linguistics and NLP, such as 'standard' written language or social media messages. We analyse orthography, parts of … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
2
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
2
2
1

Relationship

2
3

Authors

Journals

citations
Cited by 5 publications
(4 citation statements)
references
References 9 publications
0
2
0
Order By: Relevance
“…Khishigsuren et al (2022) used results from in-depth, local field studies to better understand the meaning of family relations in order to produce accurate kinship terminologies in no less than 600 languages. In Bella et al (2020), an about 10-thousand-word formal lexicon of Scottish Gaelic was co-created by local language experts, including locally specific terms not directly translatable to English or most other languages.…”
Section: Linguistic Diversitymentioning
confidence: 99%
See 1 more Smart Citation
“…Khishigsuren et al (2022) used results from in-depth, local field studies to better understand the meaning of family relations in order to produce accurate kinship terminologies in no less than 600 languages. In Bella et al (2020), an about 10-thousand-word formal lexicon of Scottish Gaelic was co-created by local language experts, including locally specific terms not directly translatable to English or most other languages.…”
Section: Linguistic Diversitymentioning
confidence: 99%
“…LiveLanguage and the UKC integrate both existing third-party resources and linguistic data collected through collaborations with universities. Examples of such collaborations include Mongolia (Batsuren et al, 2019), Scotland (Bella et al, 2020), India (Chandran Nair et al, 2022, Palestine (Khalilia et al, 2023), and South Africa (Dibitso et al, 2019). Striving to ensure that such collaborations are beneficial to local speaker communities and to avoid exploitative practices, LiveLanguage collaborations adopt a methodology based on co-creation and local empowerment, with the following characteristics: (a) representatives of local communities are leading the formulation of problems and needs, as well as the subsequent specifications of the language resources to be developed; (b) tools, infrastructure, and know-how are provided to local communities if needed, in order to embed solutions sustainably; (c) intellectual property rights stay with the local community; (d) language resources are integrated into the global LiveLanguage ecosystem, giving worldwide visibility to the results through the UKC database and the LiveLanguage data catalog.…”
Section: Addressing Epistemic Injustice In Language Technology: the L...mentioning
confidence: 99%
“…The objective is to build the EG, what we call EG Generation, integrating the ETG with the data resources. To do that, the ETG and datasets are provided in input to a specific data mapping tool, called KarmaLinker, which consists of the Karma data integration tool [10] extended to perform Natural Language Processing on short sentences (i.e., what we usually call the language of data) [1]. The process is described in some detail in [7].…”
Section: The Processmentioning
confidence: 99%
“…During this phase the prior knowledge codified in the UKC is heavily exploited (see the previous section for details). This activity is performed with the help of the Word Sense Disambiguation (WSD) component SCROLL, a multilingual NLP library and pipeline which is specialised to handle the Language of Data, as defined in [14], namely the type of NLP sentences that are usually found in data. At the moment SCROLL supports seven languages (including various European languages but also Mongolian and Chinese) but, as we have found out, because of the similarity of many languages, of the relatively simple structure of the language of data, and of the fact that the system processing is in control of the Data Scientist which validates each step, SCROLL can also be useful in various other similar languages (where similar here means not diverse, with language diversity being defined as in [12]).…”
Section: Processing Diversitymentioning
confidence: 99%