NEZHA: Neural Contextualized Representation for Chinese Language Understanding

Wei, Victor Junqiu; Ren, Xiaozhe; Li, Xiaoguang; Huang, Wenyong; Liao, Yi; Wang, Yasheng; Lin, Jiashu; Jiang, Xin; Chen, Xiao; Liu, Qun

doi:10.48550/arxiv.1909.00204

Cited by 32 publications

(26 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To capture the sequential features in languages, previous PLMs adopt position embedding in either input representations (Devlin et al, 2019;Lan et al, 2020) or attention weights (Yang et al, 2019;Wei et al, 2019;Ke et al, 2020). For the input-level position embedding, the inputs of the first layer are h in,0 i = h in,0 i + P i , where P i is the embedding of the i th position.…”

Section: Lattice-bertmentioning

confidence: 99%

“…From the perspective of reporting strategies, we report the performance of base-size models together with lite-size models. As far as we know, all previous Chinese PLMs only report baseor large-size settings (Wei et al, 2019;Diao et al, 2020;Cui et al, 2020;. Thus, the followers have to implement at least a 12-layer pre-training model to make a fair comparison.…”

Section: A Ethical Considerationsmentioning

confidence: 99%

“…the whole word masking trick and external pretraining corpus, known as RoBERTa-wwm-ext. 6 NEZHA(Wei et al, 2019) is one of the best Chinese PLMs with a bag of tricks, which also explores attention-level position embedding. AMBERT is the state-ofthe-art multi-granularity Chinese PLM, with two separated encoders for words and characters.…”

mentioning

confidence: 99%

See 2 more Smart Citations

Lattice-BERT: Leveraging Multi-Granularity Representations in Chinese Pre-trained Language Models

Lai

Liu

Feng

et al. 2021

Preprint

View full text Add to dashboard Cite

Chinese pre-trained language models usually process text as a sequence of characters, while ignoring more coarse granularity, e.g., words. In this work, we propose a novel pre-training paradigm for Chinese -Lattice-BERT, which explicitly incorporates word representations along with characters, thus can model a sentence in a multi-granularity manner. Specifically, we construct a lattice graph from the characters and words in a sentence and feed all these text units into transformers. We design a lattice position attention mechanism to exploit the lattice structures in self-attention layers. We further propose a masked segment prediction task to push the model to learn from rich but redundant information inherent in lattices, while avoiding learning unexpected tricks. Experiments on 11 Chinese natural language understanding tasks show that our model can bring an average increase of 1.5% under the 12-layer setting, which achieves new state-of-the-art among base-size models on the CLUE benchmarks. Further analysis shows that Lattice-BERT can harness the lattice structures, and the improvement comes from the exploration of redundant information and multigranularity representations. 1

show abstract

Section: Lattice-bertmentioning

confidence: 99%

Section: A Ethical Considerationsmentioning

confidence: 99%

See 1 more Smart Citation

Lattice-BERT: Leveraging Multi-Granularity Representations in Chinese Pre-trained Language Models

Lai

Liu

Feng

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…By this way, we can obtain ten different base models. Besides, we also replace the Encoder of different models with different pretrained language models, including BERT, RoBERTa-wwm-ext [7], and NEZHA [15]. Accordingly, another kinds of base models can be trained.…”

Section: Model Enhancement Techniquesmentioning

confidence: 99%

An Effective System for Multi-format Information Extraction

Liu

Zhang

Yin

et al. 2021

Preprint

View full text Add to dashboard Cite

The multi-format information extraction task in the 2021 Language and Intelligence Challenge is designed to comprehensively evaluate information extraction from different dimensions. It consists of an multiple slots relation extraction subtask and two event extraction subtasks that extract events from both sentence-level and document-level. Here we describe our system for this multi-format information extraction competition task. Specifically, for the relation extraction subtask, we convert it to a traditional triple extraction task and design a voting based method that makes full use of existing models. For the sentencelevel event extraction subtask, we convert it to a NER task and use a pointer labeling based method for extraction. Furthermore, considering the annotated trigger information may be helpful for event extraction, we design an auxiliary trigger recognition model and use the multi-task learning mechanism to integrate the trigger features into the event extraction model. For the document-level event extraction subtask, we design an Encoder-Decoder based method and propose a Transformer-alike decoder. Finally, our system ranks No.4 on the test set leader-board of this multi-format information extraction task, and its F1 scores for the subtasks of relation extraction, event extractions of sentence-level and document-level are 79.887%, 85.179%, and 70.828% respectively. The codes of our model are available at https://github.com/neukg/MultiIE.

show abstract

“…The statistics of these datasets are presented in Experimental Settings. Among on these text classification datasets, except for the Chinese dataset iflytek that we use pre-trained model Nezha base(Wei et al 2019) as our baseline, we all use BERT base(Devlin et al 2018) as our baseline model on the other English datasets. According to the text length distribution of different datasets, as well as the maximum word length limited by Bert, we set the hyperparameters as shown in the following table 4.…”

mentioning

confidence: 99%

Sparse-softmax: A Simpler and Faster Alternative Softmax Transformation

Sun¹,

Zhang²,

Huang³

et al. 2021

Preprint

View full text Add to dashboard Cite

The softmax function is widely used in artificial neural networks for the multiclass classification problems, where the softmax transformation enforces the output to be positive and sum to one, and the corresponding loss function allows to use maximum likelihood principle to optimize the model. However, softmax leaves a large margin for loss function to conduct optimizing operation when it comes to high-dimensional classification, which results in low-performance to some extent. In this paper, we provide an empirical study on a simple and concise softmax variant, namely sparse-softmax, to alleviate the problem that occurred in traditional softmax in terms of high-dimensional classification problems. We evaluate our approach in several interdisciplinary tasks, the experimental results show that sparse-softmax is simpler, faster, and produces better results than the baseline models.

show abstract

NEZHA: Neural Contextualized Representation for Chinese Language Understanding

Cited by 32 publications

References 20 publications

Lattice-BERT: Leveraging Multi-Granularity Representations in Chinese Pre-trained Language Models

Lattice-BERT: Leveraging Multi-Granularity Representations in Chinese Pre-trained Language Models

An Effective System for Multi-format Information Extraction

Sparse-softmax: A Simpler and Faster Alternative Softmax Transformation

Contact Info

Product

Resources

About