Legal norm retrieval with variations of the bert model combined with TF-IDF vectorization

Wehnert, Sabine; Sudhi, Viju; Dureja, Shipra; Kutty, Libin; Shahania, Saijal; Luca, Ernesto William De

doi:10.1145/3462757.3466104

Cited by 23 publications

(3 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Interestingly, from this result, pretrained models tend to achieve higher performance than the non-pretrained model (i.e., Attentive CNN). Table 4.7 presents the final performance on the test set after ensembling with the lexical score by the optimal value of α. Paraformer outperforms other models and achieves state-of-the-art results in Precision (0.7901) and Macro-F2 (0.7407) and surpasses the current state-of-the-art system by Wehnert et al [63]. The best recall belongs to the systems of Nguyen et al [45] and Wehnert et al [63].…”

Section: Methodsmentioning

confidence: 84%

Toward Improving Attentive Neural Networks in Legal Text Processing

Nguyen¹

2022

Preprint

View full text Add to dashboard Cite

In recent years, thanks to breakthroughs in neural network techniques especially attentive deep learning models, natural language processing has made many impressive achievements. However, automated legal word processing is still a difficult branch of natural language processing. Legal sentences are often long and contain complicated legal terminologies. Hence, models that work well on general documents still face challenges in dealing with legal documents. We have verified the existence of this problem with our experiments in this work. In this dissertation, we selectively present the main achievements in improving attentive neural networks in automatic legal document processing. Language models tend to grow larger and larger, though, without expert knowledge, these models can still fail in domain adaptation, especially for specialized fields like law.This dissertation has three main tasks to achieve the goal of improving attentive models in legal document processing. First, we survey and verify the factors affecting the performance of the models when operating on a specific domain such as law. This investigation is to provide clearer insights to improve models in this domain. Second, as pretrained language models are recently the most well-known attentive approaches in natural language processing, we provide methods to create language models specific to the legal domain, producing state-of-the-art results on reliable datasets. These models are built on features from the data of legal documents, with the goal of overcoming the challenges found in our previous survey. Third, besides the approach to let the model learn completely from raw data, we propose and prove the effectiveness of using different knowledge sources to inject into the model in different ways to adjust their output. This approach not only increases explainability but also allows humans to control pretrained language models and take advantage of the knowledge resources available during the development of the field such as vocabulary, grammar, logic and law.

show abstract

Section: Methodsmentioning

confidence: 84%

Toward Improving Attentive Neural Networks in Legal Text Processing

Nguyen¹

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Several approaches are available to transform the textual data into a numeric format. For our case, we have nominated the count vectorizer (CV) [18] and Tf-Idf vectorizer [19] for the said purpose due to their high effectiveness in the area of NLP.…”

Section: Data Preparationmentioning

confidence: 99%

Fake News Classification using Machine Learning: Count Vectorizer and Support Vector Machine

Khan

Anwar

Qayyum

et al. 2023

JCBI

View full text Add to dashboard Cite

The quick advancement of the internet facility and rapid uptake of social networking sites like Twitter and Facebook has led to the generation of an extensive amount of data never seen previously in the history of humanity. Users are producing and disseminating more content than ever because of the widespread use of digital platforms, some of which are false and spreading wrong narratives. Accurate categorization of textual documents as fabrication or falsehood is a complicated job. The majority of the research emphasizes certain databases or topics, most notably the area of politics. As a result, the trained approaches perform effectively on a specific category of documents field and do not generate robust performance when evaluated to articles from other areas. We proposed a solution to the false news identification task in the presented study by combining NLP and machine learning technologies. In the first phase, the data is pre-processed by employing steps like removing null, duplicate, values, and punctuation marks. After this, the clean data is converted into numeric representation using a count vectorizer (CV) and Tf-Idf vectorizer. While for the classification task, we have used the SVM classifier. Our proposed solution performed well and catered to all possible fake news domains.

show abstract

“…Nowadays, many BERT language models take advantage of their underlying transformer approach to produce a specific BERT model fine-tuned for NER tasks in different languages (Souza et al, 2019;Labusch et al, 2019;Jia et al, 2020;Taher et al, 2020). There is also research done for BERT in the legal domain that uses BERT for various legal tasks such as topic modeling (Silveira et al, 2021), legal norm retrieval (Wehnert et al, 2021), and legal case retrieval (Shao et al, 2020).…”

Section: Introductionmentioning

confidence: 99%

German BERT Model for Legal Named Entity Recognition

Darji,

Mitrović,

Granitzer

2023

Preprint

View full text Add to dashboard Cite

The use of BERT, one of the most popular language models, has led to improvements in many Natural Language Processing (NLP) tasks. One such task is Named Entity Recognition (NER) i.e. automatic identification of named entities such as location, person, organization, etc. from a given text. It is also an important base step for many NLP tasks such as information extraction and argumentation mining. Even though there is much research done on NER using BERT and other popular language models, the same is not explored in detail when it comes to Legal NLP or Legal Tech. Legal NLP applies various NLP techniques such as sentence similarity or NER specifically on legal data. There are only a handful of models for NER tasks using BERT language models, however, none of these are aimed at legal documents in German. In this paper, we fine-tune a popular BERT language model trained on German data (German BERT) on a Legal Entity Recognition (LER) dataset.To make sure our model is not overfitting, we performed a stratified 10-fold cross-validation. The results we achieve by fine-tuning German BERT on the LER dataset outperform the BiLSTM-CRF+ model used by the authors of the same LER dataset. Finally, we make the model openly available via HuggingFace.

show abstract

Legal norm retrieval with variations of the bert model combined with TF-IDF vectorization

Cited by 23 publications

References 11 publications

Toward Improving Attentive Neural Networks in Legal Text Processing

Toward Improving Attentive Neural Networks in Legal Text Processing

Fake News Classification using Machine Learning: Count Vectorizer and Support Vector Machine

German BERT Model for Legal Named Entity Recognition

Contact Info

Product

Resources

About