2022
DOI: 10.48550/arxiv.2203.06378
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

MarkBERT: Marking Word Boundaries Improves Chinese BERT

Abstract: We present a Chinese BERT model dubbed MarkBERT that uses word information. Existing word-based BERT models regard words as basic units, however, due to the vocabulary limit of BERT, they only cover high-frequency words and fall back to character level when encountering out-of-vocabulary (OOV) words. Different from existing works, MarkBERT keeps the vocabulary being Chinese characters and inserts boundary markers between contiguous words. Such design enables the model to handle any words in the same way, no ma… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
1
1

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(4 citation statements)
references
References 13 publications
0
3
0
Order By: Relevance
“…(2) The need for data augmentation: they need to train a Doc2query model to provide the exact matching signal for improving the BERT re-ranker while our strategy does not need any extra overhead in terms of data augmentation. A few recent, but less related examples are Al-Hajj et al [23], who experiment with the use of different supervised signals into the input of the cross-encoder to emphasize target words in context and Li et al [24], who insert boundary markers into the input between contiguous words for Chinese named entity recognition. Additionally, there are various studies that show modifying cross-encoder re-ranker inputs by adding additional information represented as splitter tokens can improve their effectiveness [25][26][27][28].…”
Section: Modifying the Input Of Re-rankersmentioning
confidence: 99%
“…(2) The need for data augmentation: they need to train a Doc2query model to provide the exact matching signal for improving the BERT re-ranker while our strategy does not need any extra overhead in terms of data augmentation. A few recent, but less related examples are Al-Hajj et al [23], who experiment with the use of different supervised signals into the input of the cross-encoder to emphasize target words in context and Li et al [24], who insert boundary markers into the input between contiguous words for Chinese named entity recognition. Additionally, there are various studies that show modifying cross-encoder re-ranker inputs by adding additional information represented as splitter tokens can improve their effectiveness [25][26][27][28].…”
Section: Modifying the Input Of Re-rankersmentioning
confidence: 99%
“…(2) The need for data augmentation: they need to train a Doc2query model to provide the exact matching signal for improving the BERT re-ranker while our strategy does not need any extra overhead in terms of data augmentation. A few recent, but less related examples are Al-Hajj et al [4], who experiment with the use of different supervised signals into the input of the cross-encoder to emphasize target words in context and Li et al [30], who insert boundary markers into the input between contiguous words for Chinese named entity recognition. Numerical information in Transformer models.…”
Section: Related Workmentioning
confidence: 99%
“…(2) The need for data augmentation: they need to train a Doc2query model to provide the exact matching signal for improving the BERT re-ranker while our strategy does not need any extra overhead in terms of data augmentation. A few recent, but less related examples are Al-Hajj et al [23], who experiment with the use of different supervised signals into the input of the cross-encoder to emphasize target words in context and Li et al [24], who insert boundary markers into the input between contiguous words for Chinese named entity recognition.…”
Section: Cross-encoder Cat (Ce Cat )mentioning
confidence: 99%