Dual Indicators to Analyze AI Benchmarks: Difficulty, Discrimination, Ability, and Generality

Martínez-Plumed, Fernando; Hernández-Orallo, José

doi:10.1109/tg.2018.2883773

Cited by 16 publications

(10 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Finally, some performance results are obtained by specializing on some subpockets of problems, while failing to achieve results across all problems. There are already some proposals for defining a generality dimension 40 . We leave this more comprehensive analysis for future work; there is much to be explored in the activity that happens below the SOTA front.…”

Section: Discussionmentioning

confidence: 99%

Research community dynamics behind popular AI benchmarks

et al. 2021

Self Cite

View full text Add to dashboard Cite

Section: Discussionmentioning

confidence: 99%

Research community dynamics behind popular AI benchmarks

et al. 2021

Self Cite

View full text Add to dashboard Cite

“…First, Felten et al (2018)measure AI progress on one particular platform, the Electronic Frontier Foundation (EFF) 4 , which is restricted to a more limited set of AI benchmarks. The benchmarks in the present framework further rely on our own previous analysis and annotation of papers (Hernández-Orallo, 2017b;Martínez-Plumed et al, 2018;Martinez-Plumed and Hernandez-Orallo, 2018) as well as on open resources such as Papers With Code 5 , which include data and results from a comprehensive set of AI benchmarks, challenges, competitions and tasks. This ensures a broad coverage of AI tasks, also providing insight into AI performance in cognitive abilities that go beyond perception, such as language processing, planning, information retrieval or automated deduction/induction.…”

Section: Related Workmentioning

confidence: 99%

“…For the present framework we generate a comprehensive repository of AI benchmarks(Martínez- Plumed et al, 2020a,b) based on our previous compilation, analysis and annotation of AI papers and benchmarking results(Hernández-Orallo, 2017a;Martínez-Plumed et al, 2018;Martinez- Plumed and Hernandez-Orallo, 2018; Martínez-Plumed et al, 2020a,b) as well as open resources such as Papers With Code 16 (the largest, up to date, free and open repository of machine learning code and results), which includes data from several AI-related repositories (e.g., EFF 17 , NLPprogress 18 , SQuAD 19 , RedditSota 20 , etc.). All these repositories draw on data from multiple (verified) sources, including academic literature, review articles and code platforms focused on machine learning and AI.For the purposes of this study, from the aforementioned sources we track the reported evaluation results (when available or sufficient data is provided) on different metrics of AI performance across separate AI benchmarks (e.g., datasets, competitions, challenges, awards, etc.)…”

mentioning

confidence: 99%

Measuring the Occupational Impact of AI: Tasks, Cognitive Abilities and AI Benchmarks

Tolan

Pesole

Martínez-Plumed

et al. 2021

jair

View full text Add to dashboard Cite

In this paper we develop a framework for analysing the impact of Artificial Intelligence (AI) on occupations. This framework maps 59 generic tasks from worker surveys and an occupational database to 14 cognitive abilities (that we extract from the cognitive science literature) and these to a comprehensive list of 328 AI benchmarks used to evaluate research intensity across a broad range of different AI areas. The use of cognitive abilities as an intermediate layer, instead of mapping work tasks to AI benchmarks directly, allows for an identification of potential AI exposure for tasks for which AI applications have not been explicitly created. An application of our framework to occupational databases gives insights into the abilities through which AI is most likely to affect jobs and allows for a ranking of occupations with respect to AI exposure. Moreover, we show that some jobs that were not known to be affected by previous waves of automation may now be subject to higher AI exposure. Finally, we find that some of the abilities where AI research is currently very intense are linked to tasks with comparatively limited labour input in the labour markets of advanced economies (e.g., visual and auditory processing using deep learning, and sensorimotor interaction through (deep) reinforcement learning). This article appears in the special track on AI and Society.

show abstract

“…The key idea was defining intelligence test items using algorithmic information theory (Hernández-Orallo and Minaya-Collado, 1998), an approach that was followed by many other proposals in the next two decades, from the very influential "universal intelligence" (Legg and Hutter, 2007) to the recent "measure of intelligence" (Chollet, 2019). However, while some of these proposals have had an important impact on the understanding of what intelligence is, its relation to compression (Dowe et al, 2011), difficulty (Hernández-Orallo, 2015Hernandez-Orallo, 2015) and generality (Martinez-Plumed and Hernandez-Orallo, 2018), the adoption of some of these tests (or associated definitions) in practice has been very limited.…”

Section: Anymentioning

confidence: 99%

“…Whereas the development of measurement instruments that follow the adversarial testing is still incipient, and has not progressed significantly since (Hernández-Orallo and Dowe, 2010;, it adapts according to one or more dimensions, as per the transitional and universal cases in Figure 2. Assuming each dimension is defined by a difficulty metric (Mishra et al, 2013;Hernandez-Orallo, 2015;Martinez-Plumed and Hernandez-Orallo, 2018;Martínez-Plumed et al, 2019;Hernández-Orallo, 2020), we have a multidimensional space for which the adversarial testing can derive the location of the testee in this space. By doing this, similarities and clustering are calculated in this space, with no need of exploring all the n¢pn¡1q 2 combinations when n agents are being analysed.…”

Section: Building Behavioural Taxonomiesmentioning

confidence: 99%

Twenty Years Beyond the Turing Test: Moving Beyond the Human Judges Too

Hernández-Orallo

2020

Minds & Machines

Self Cite

View full text Add to dashboard Cite

In the last twenty years the Turing test has been left further behind by new developments in artificial intelligence. At the same time, however, these developments have revived some key elements of the Turing test: imitation and adversarialness. On the one hand, many generative models, such as generative adversarial networks (GAN), build imitators under an adversarial setting that strongly resembles the Turing test (with the judge being a learnt discriminative model). The term "Turing learning" has been used for this kind of setting. On the other hand, AI benchmarks are suffering an adversarial situation too, with a 'challenge-solve-and-replace' evaluation dynamics whenever human performance is 'imitated'. The particular AI community rushes to replace the old benchmark by a more challenging benchmark, one for which human performance would still be beyond AI. These two phenomena related to the Turing test are sufficiently distinctive, important and general for a detailed analysis. This is the main goal of this paper. After recognising the abyss that appears beyond superhuman performance, we build on Turing learning to identify two different evaluation schemas: Turing testing and adversarial testing. We revisit some of the key questions surrounding the Turing test, such as 'understanding', commonsense reasoning and extracting meaning from the world, and explore how the new testing paradigms should work to unmask the limitations of current and future AI. Finally, we discuss how behavioural similarity metrics could be used to create taxonomies for artificial and natural intelligence. Both testing schemas should complete a transition in which humans should give way to machines -not only as references to be imitated but also as judges-when pursuing and measuring machine intelligence.

show abstract

Dual Indicators to Analyze AI Benchmarks: Difficulty, Discrimination, Ability, and Generality

Cited by 16 publications

References 24 publications

Research community dynamics behind popular AI benchmarks

Research community dynamics behind popular AI benchmarks

Measuring the Occupational Impact of AI: Tasks, Cognitive Abilities and AI Benchmarks

Twenty Years Beyond the Turing Test: Moving Beyond the Human Judges Too

Contact Info

Product

Resources

About