2016
DOI: 10.7494/csci.2016.17.1.23
|View full text |Cite
|
Sign up to set email alerts
|

Adapting a Constituency Parser to User-Generated Content in Polish Opinion Mining

Abstract: The paper focuses on the adjustment of NLP tools for Polish; e.g., morphological analyzers and parsers, to user-generated content (UGC). The authors discuss two rule-based techniques applied to improve their efficiency: preprocessing (text normalization) and parser adaptation (modified segmentation and parsing rules). A new solution to handle OOVs based on inflectional translation is also offered.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2016
2016
2019
2019

Publication Types

Select...
1
1

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(1 citation statement)
references
References 14 publications
0
1
0
Order By: Relevance
“…Due to the nature of this kind of data, the texts commonly include many mistakes and problematic phenomena. A linguistic analysis of the data as well as the results of similar research (see e.g (Pluwak et al, 2016)) helped us to define distinctive features, which are related to various text levels such as: notation (e.g. lack of diacritics, spelling mistakes, typos, omissions of capital letters, incorrectly connected or disconnected segments, lack of or poor punctuation), morphology and syntax (e.g.…”
Section: Computer-mediated Communication Corpusmentioning
confidence: 99%
“…Due to the nature of this kind of data, the texts commonly include many mistakes and problematic phenomena. A linguistic analysis of the data as well as the results of similar research (see e.g (Pluwak et al, 2016)) helped us to define distinctive features, which are related to various text levels such as: notation (e.g. lack of diacritics, spelling mistakes, typos, omissions of capital letters, incorrectly connected or disconnected segments, lack of or poor punctuation), morphology and syntax (e.g.…”
Section: Computer-mediated Communication Corpusmentioning
confidence: 99%