Neural Word Segmentation with Rich Pretraining

Yang, Jie; Zhang, Yue; Dong, Fangmin

doi:10.18653/v1/p17-1078

Cited by 104 publications

(74 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…For word-based models, segmentation is necessary. We take two segmentors with different performances, including the Jieba segmentor and the model of Yang et al (2017), which we name Jieba and YZ, respectively. To verify their accuracy, we manually segment the first 100 sentences from the test set.…”

Section: Methodsmentioning

confidence: 99%

A Pilot Study for Chinese SQL Semantic Parsing

Min¹,

Shi²,

Zhang

2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

Self Cite

View full text Add to dashboard Cite

The task of semantic parsing is highly useful for dialogue and question answering systems. Many datasets have been proposed to map natural language text into SQL, among which the recent Spider dataset provides crossdomain samples with multiple tables and complex queries. We build a Spider dataset for Chinese, which is currently a low-resource language in this task area. Interesting research questions arise from the uniqueness of the language, which requires word segmentation, and also from the fact that SQL keywords and columns of DB tables are typically written in English. We compare character-and wordbased encoders for a semantic parser, and different embedding schemes. Results show that word-based semantic parser is subject to segmentation errors and cross-lingual word embeddings are useful for text-to-SQL.

show abstract

Section: Methodsmentioning

confidence: 99%

A Pilot Study for Chinese SQL Semantic Parsing

Min¹,

Shi²,

Zhang

2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

Self Cite

View full text Add to dashboard Cite

show abstract

“…As a result, OntoNotes is leveraged for studying oracle situations where gold segmentation is given. We use the neural word segmentor of Yang et al (2017a) to automatically segment the development and test sets for word-based NER. In particular, for the OntoNotes and MSRA datasets, we train the segmentor using gold segmentation on their respective training sets.…”

Section: Experimental Settingsmentioning

confidence: 99%

Chinese NER Using Lattice LSTM

Zhang

Yang

2018

Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Self Cite

653

403

View full text Add to dashboard Cite

We investigate a lattice-structured LSTM model for Chinese NER, which encodes a sequence of input characters as well as all potential words that match a lexicon. Compared with character-based methods, our model explicitly leverages word and word sequence information. Compared with word-based methods, lattice LSTM does not suffer from segmentation errors. Gated recurrent cells allow our model to choose the most relevant characters and words from a sentence for better NER results. Experiments on various datasets show that lattice LSTM outperforms both word-based and character-based LSTM baselines, achieving the best results.

show abstract

“…We benefit from this as we perform a search in the space of complete outputs and there is a combinatorial explosion in the output space for a linear increase in the input space (Doppa et al, 2014). The pretraining of the edge vectors with external knowledge in the form of morphological constraints is effective in reducing the task specific training size (Yang et al, 2017;Andor et al, 2016).…”

Section: Resultsmentioning

confidence: 99%

Free as in Free Word Order: An Energy Based Model for Word Segmentation and Morphological Tagging in Sanskrit

Krishna

Santra

Bandaru³

et al. 2018

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

The configurational information in sentences of a free word order language such as Sanskrit is of limited use. Thus, the context of the entire sentence will be desirable even for basic processing tasks such as word segmentation. We propose a structured prediction framework that jointly solves the word segmentation and morphological tagging tasks in Sanskrit. We build an energy based model where we adopt approaches generally employed in graph based parsing techniques (McDonald et al., 2005a;Carreras, 2007). Our model outperforms the state of the art with an F-Score of 96.92 (percentage improvement of 7.06%) while using less than one tenth of the task-specific training data. We find that the use of a graph based approach instead of a traditional lattice-based sequential labelling approach leads to a percentage gain of 12.6% in F-Score for the segmentation task. 1

show abstract

Neural Word Segmentation with Rich Pretraining

Cited by 104 publications

References 28 publications

A Pilot Study for Chinese SQL Semantic Parsing

A Pilot Study for Chinese SQL Semantic Parsing

Chinese NER Using Lattice LSTM

Free as in Free Word Order: An Energy Based Model for Word Segmentation and Morphological Tagging in Sanskrit

Contact Info

Product

Resources

About