Data sharing is a central aspect of judicial systems. The openly accessible documents can make the judiciary system more transparent. On the other hand, the published legal documents can contain much sensitive information about the involved persons or companies. For this reason, the anonymization of these documents is obligatory to prevent privacy breaches. General Data Protection Regulation (GDPR) and other modern privacy-protecting regulations have strict definitions of private data containing direct and indirect identifiers. In legal documents, there is a wide range of attributes regarding the involved parties. Moreover, legal documents can contain additional information about the relations between the involved parties and rare events. Hence, the personal data can be represented by a sparse matrix of these attributes. The application of Named Entity Recognition methods is essential for a fair anonymization process but is not enough. Machine learning-based methods should be used together with anonymization models, such as differential privacy, to reduce re-identification risk. On the other hand, the information content (utility) of the text should be preserved. This paper aims to summarize and highlight the open and symmetrical problems from the fields of structured and unstructured text anonymization. The possible methods for anonymizing legal documents discussed and illustrated by case studies from the Hungarian legal practice.
One of the most time-consuming parts of an attorney’s job is finding similar legal cases. Categorization of legal documents by their subject matter can significantly increase the discoverability of digitalized court decisions. This is a multi-label classification problem, where each relatively long text can fit into more than one legal category. The proposed paper shows a solution where this multi-label classification problem is decomposed into more than a hundred binary classification problems. Several approaches have been tested, including different machine-learning and text-augmentation techniques to produce a practically applicable model. The proposed models and the methodologies were encapsulated and deployed as a digital-twin into a production environment. The performance of the created machine learning-based application reaches and could also improve the human-experts performance on this monotonous and labor-intensive task. It could increase the e-discoverability of the documents by about 50%.
Many machine learning-based document processing applications have been published in recent years. Applying these methodologies can reduce the cost of labor-intensive tasks and induce changes in the company’s structure. The artificial intelligence-based application can replace the application of trainees and free up the time of experts, which can increase innovation inside the company by letting them be involved in tasks with greater added value. However, the development cost of these methodologies can be high, and usually, it is not a straightforward task. This paper presents a survey result, where a machine learning-based legal text labeler competed with multiple people with different legal domain knowledge. The machine learning-based application used binary SVM-based classifiers to resolve the multi-label classification problem. The used methods were encapsulated and deployed as a digital twin into a production environment. The results show that machine learning algorithms can be effectively utilized for monotonous but domain knowledge- and attention-demanding tasks. The results also suggest that embracing the machine learning-based solution can increase discoverability and enrich the value of data. The test confirmed that the accuracy of a machine learning-based system matches up with the long-term accuracy of legal experts, which makes it applicable to automatize the working process.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.