FQuAD: French Question Answering Dataset

d'Hoffschmidt, Martin; Belblidia, Wacim; Brendlé, Tom; Heinrich, Quentin; Vidal, Max

doi:10.48550/arxiv.2002.06071

Cited by 7 publications

(8 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In Table 4 significantly lower (10 absolute F1 points) than that reported by (d'Hoffschmidt et al, 2020) on the FQUAD hidden test set. 19 This can be explained by a number of factors, including: hyperparameters' setup (not reported in the FQUAD paper); the use of additional answers for computing evaluation scores on the hidden test set (although, for reference, this factor only justifies 3-4 points of difference on the SQuAD dev set).…”

Section: Dataset Analysismentioning

confidence: 66%

See 1 more Smart Citation

Project PIAF: Building a Native French Question-Answering Dataset

Keraron,

Lancrenon,

Bras

et al. 2020

Preprint

View full text Add to dashboard Cite

show abstract

Section: Dataset Analysismentioning

confidence: 66%

“…In (d'Hoffschmidt et al, 2020), the authors rely on Camem-BERT (Martin et al, 2019) for their evaluations, but do not report the hyper-parameters used. For all our experiments, we use batch size = 8, learning rate = 3e −5 , n epochs = 2, max seq len = 384, doc stride = 128.…”

Section: Dataset Analysismentioning

confidence: 99%

Project PIAF: Building a Native French Question-Answering Dataset

Keraron,

Lancrenon,

Bras

et al. 2020

Preprint

View full text Add to dashboard Cite

show abstract

“…Question answering (QA) was evaluated on FQuAD (French Question Answering Dataset) [30], a dataset inspired by the English SQuAD equivalent [31]. The models were evaluated on the validation subset, which contains 3188 human-curated question-answer pairs, based on 768 high-quality French Wikipedia articles.…”

Section: Question Answering (Qa)mentioning

confidence: 99%

“…GPT-3 (davinci) was not evaluated on this task for cost reasons, as OpenAI did not support our request for funding at the time of writing. The results may be contrasted to a finetuned version of Camem-BERT [32] which yields F1 of 88% and best match of 78% on this dataset [30].…”

Section: Modelmentioning

confidence: 99%

Cedille: A large autoregressive French language model

Müller¹,

Laurent²

2022

Preprint

View full text Add to dashboard Cite

Scaling up the size and training of autoregressive language models has enabled novel ways of solving Natural Language Processing tasks using zero-shot and few-shot learning. While extreme-scale language models such as GPT-3 offer multilingual capabilities, zero-shot learning for languages other than English remain largely unexplored. Here, we introduce Cedille, a large open source auto-regressive language model, specifically trained for the French language. Our results show that Cedille outperforms existing French language models and is competitive with GPT-3 on a range of French zero-shot benchmarks. Furthermore, we provide an in-depth comparison of the toxicity exhibited by these models, showing that Cedille marks an improvement in language model safety thanks to dataset filtering.

show abstract

“…That poses a problem for many languages; even though a number of datasets are available in English [6,[31][32][33], resources in other languages are rather scarce. While a few other languages have received some attention, such as Chinese [8,18], French [11], and German [27], and while one can find multilingual datasets around [2,23,25], many languages, such as Portuguese, still lag behind. Question answering (QA) in non-English languages suffers from an additional difficulty: many, and in some cases most, of the documents used to answer questions are only available in English.…”

Section: Introductionmentioning

confidence: 99%

Pirá: A Bilingual Portuguese-English Dataset for Question-Answering about the Ocean

Paschoal

Pirozelli

Freire

et al. 2021

Proceedings of the 30th ACM International Conference on Information &Amp; Knowledge Management

View full text Add to dashboard Cite

Current research in natural language processing is highly dependent on carefully produced corpora. Most existing resources focus on English; some resources focus on languages such as Chinese and French; few resources deal with more than one language. This paper presents the Pirá dataset, a large set of questions and answers about the ocean and the Brazilian coast both in Portuguese and English. Pirá is, to the best of our knowledge, the first QA dataset with supporting texts in Portuguese, and, perhaps more importantly, the first bilingual QA dataset that includes this language. The Pirá dataset consists of 2261 properly curated question/answer (QA) sets in both languages. The QA sets were manually created based on two corpora: abstracts related to the Brazilian coast and excerpts of United Nation reports about the ocean. The QA sets were validated in a peer-review process with the dataset contributors. We discuss some of the advantages as well as limitations of Pirá, as this new resource can support a set of tasks in NLP such as question-answering, information retrieval, and machine translation. CCS CONCEPTS• Applied computing → Document searching; Annotation.

show abstract

FQuAD: French Question Answering Dataset

Cited by 7 publications

References 12 publications

Project PIAF: Building a Native French Question-Answering Dataset

Project PIAF: Building a Native French Question-Answering Dataset

Cedille: A large autoregressive French language model

Pirá: A Bilingual Portuguese-English Dataset for Question-Answering about the Ocean

Contact Info

Product

Resources

About