2020
DOI: 10.1109/tg.2018.2883773
|View full text |Cite
|
Sign up to set email alerts
|

Dual Indicators to Analyze AI Benchmarks: Difficulty, Discrimination, Ability, and Generality

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
8
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
6

Relationship

2
4

Authors

Journals

citations
Cited by 16 publications
(10 citation statements)
references
References 24 publications
0
8
0
Order By: Relevance
“…Finally, some performance results are obtained by specializing on some subpockets of problems, while failing to achieve results across all problems. There are already some proposals for defining a generality dimension 40 . We leave this more comprehensive analysis for future work; there is much to be explored in the activity that happens below the SOTA front.…”
Section: Discussionmentioning
confidence: 99%
“…Finally, some performance results are obtained by specializing on some subpockets of problems, while failing to achieve results across all problems. There are already some proposals for defining a generality dimension 40 . We leave this more comprehensive analysis for future work; there is much to be explored in the activity that happens below the SOTA front.…”
Section: Discussionmentioning
confidence: 99%
“…First, Felten et al (2018)measure AI progress on one particular platform, the Electronic Frontier Foundation (EFF) 4 , which is restricted to a more limited set of AI benchmarks. The benchmarks in the present framework further rely on our own previous analysis and annotation of papers (Hernández-Orallo, 2017b;Martínez-Plumed et al, 2018;Martinez-Plumed and Hernandez-Orallo, 2018) as well as on open resources such as Papers With Code 5 , which include data and results from a comprehensive set of AI benchmarks, challenges, competitions and tasks. This ensures a broad coverage of AI tasks, also providing insight into AI performance in cognitive abilities that go beyond perception, such as language processing, planning, information retrieval or automated deduction/induction.…”
Section: Related Workmentioning
confidence: 99%
“…For the present framework we generate a comprehensive repository of AI benchmarks(Martínez- Plumed et al, 2020a,b) based on our previous compilation, analysis and annotation of AI papers and benchmarking results(Hernández-Orallo, 2017a;Martínez-Plumed et al, 2018;Martinez- Plumed and Hernandez-Orallo, 2018; Martínez-Plumed et al, 2020a,b) as well as open resources such as Papers With Code 16 (the largest, up to date, free and open repository of machine learning code and results), which includes data from several AI-related repositories (e.g., EFF 17 , NLPprogress 18 , SQuAD 19 , RedditSota 20 , etc.). All these repositories draw on data from multiple (verified) sources, including academic literature, review articles and code platforms focused on machine learning and AI.For the purposes of this study, from the aforementioned sources we track the reported evaluation results (when available or sufficient data is provided) on different metrics of AI performance across separate AI benchmarks (e.g., datasets, competitions, challenges, awards, etc.)…”
mentioning
confidence: 99%
“…The key idea was defining intelligence test items using algorithmic information theory (Hernández-Orallo and Minaya-Collado, 1998), an approach that was followed by many other proposals in the next two decades, from the very influential "universal intelligence" (Legg and Hutter, 2007) to the recent "measure of intelligence" (Chollet, 2019). However, while some of these proposals have had an important impact on the understanding of what intelligence is, its relation to compression (Dowe et al, 2011), difficulty (Hernández-Orallo, 2015Hernandez-Orallo, 2015) and generality (Martinez-Plumed and Hernandez-Orallo, 2018), the adoption of some of these tests (or associated definitions) in practice has been very limited.…”
Section: Anymentioning
confidence: 99%
“…Whereas the development of measurement instruments that follow the adversarial testing is still incipient, and has not progressed significantly since (Hernández-Orallo and Dowe, 2010;, it adapts according to one or more dimensions, as per the transitional and universal cases in Figure 2. Assuming each dimension is defined by a difficulty metric (Mishra et al, 2013;Hernandez-Orallo, 2015;Martinez-Plumed and Hernandez-Orallo, 2018;Martínez-Plumed et al, 2019;Hernández-Orallo, 2020), we have a multidimensional space for which the adversarial testing can derive the location of the testee in this space. By doing this, similarities and clustering are calculated in this space, with no need of exploring all the n¢pn¡1q 2 combinations when n agents are being analysed.…”
Section: Building Behavioural Taxonomiesmentioning
confidence: 99%