Efficient and Robust Model Benchmarks with Item Response Theory and Adaptive Testing.

Authors

DOI:

https://doi.org/10.9781/ijimai.2021.02.009

Keywords:

Item Response Theory, Adaptive Testing, Model Evaluation, Benchmark

Abstract

Progress in predictive machine learning is typically measured on the basis of performance comparisons on benchmark datasets. Traditionally these kinds of empirical evaluation are carried out on large numbers of datasets, but this is becoming increasingly hard due to computational requirements and the often large number of alternative methods to compare against. In this paper we investigate adaptive approaches to achieve better efficiency on model benchmarking. For a large collection of datasets, rather than training and testing a given approach on every individual dataset, we seek methods that allow us to pick only a few representative datasets to quantify the model’s goodness, from which to extrapolate to performance on other datasets. To this end, we adapt existing approaches from psychometrics: specifically, Item Response Theory and Adaptive Testing. Both are well-founded frameworks designed for educational tests. We propose certain modifications following the requirements of machine learning experiments, and present experimental results to validate the approach.

Downloads

Download data is not yet available.

References

[1] M. Hutson, “Artificial intelligence faces reproducibility crisis,” 2018. [Online]. Available: https://science.sci-encemag.org/content/359/6377/725, doi: 10.1126/sci-ence.359.6377.725.

[2] J. Vanschoren, J. N. Van Rijn, B. Bischl, L. Torgo, “OpenML: networked science in machine learning,” ACM SIGKDD Explorations Newsletter, vol. 15, no. 2, pp. 49–60, 2014.

[3] G. A. Morris, L. Branum-Martin, N. Harshman, S. D. Baker, E. Mazur, S. Dutta, T. Mzoughi, V. McCauley, “Testing the test: Item response curves and test quality,” American Journal of Physics, vol. 74, no. 5, pp. 449–453, 2006.

[4] W. J. van der Linden, R. K. Hambleton, Handbook of modern item response theory. Springer Science & Business Media, 2013.

[5] B. F. Green, R. D. Bock, L. G. Humphreys, R. L. Linn, M. D. Reckase, “Technical guidelines for assessing computerized adaptive tests,” Journal of Educational Measurement, vol. 21, no. 4, pp. 347–360, 1984.

[6] D. J. Weiss, G. G. Kingsbury, “Application of computerized adaptive testing to educational problems,” Journal of Educational Measurement, vol. 21, no. 4, pp. 361–375, 1984.

[7] Y. Chen, T. S. Filho, R. B. Prudencio, T. Diethe, P. Flach, “β3-IRT: A new item response model and its applications,” in AISTATS 2019, vol. 89 of Proceedings of Machine Learning Research, 2019, pp. 1013–1021.

[8] B. P. Veldkamp, M. Matteucci, “Bayesian computerized adaptive testing,” Ensaio: Avaliação e Políticas Públicas em Educação, vol. 21, no. 78, pp. 57–82, 2013.

[9] R. R. Meijer, M. L. Nering, “Computerized adaptive testing: Overview and introduction,” 1999. [Online]. Available: https://doi.org/10.1177/01466219922031310, doi: 10.1177/01466219922031310.

[10] H.-H. Chang, “Psychometrics behind computerized adaptive testing,” Psychometrika, vol. 80, no. 1, pp. 1–20, 2015.

[11] H.-H. Chang, Z. Ying, “A global information approach to computerized adaptive testing,” Applied Psychological Measurement, vol. 20, no. 3, pp. 213–229, 1996.

[12] R. B. Prudêncio, J. Hernández-Orallo, A. Martínez-Usó, “Analysis of instance hardness in machine learning using item response theory,” in Second International Workshop on Learning over Multiple Contexts in ECML 2015. Porto, Portugal, 11 September 2015, 2015.

[13] F. Martínez-Plumed, R. B. Prudêncio, A. Martínez-Usó, J. HernándezOrallo, “Making sense of item response theory in machine learning,” in Proceedings of the Twentysecond European Conference on Artificial Intelligence, 2016, pp. 1140–1148, IOS Press.

[14] K. SHOJIMA, “A noniterative item parameter solution in each em cycle of the continuous response model,” Educational technology research, vol. 28, no. 1-2, pp. 11–22, 2005.

[15] F. Samejima, “Graded response model,” in Handbook of modern item response theory, Springer, 1997, pp. 85–100.

[16] R. S. Olson, W. La Cava, P. Orzechowski, R. J. Urbanowicz, J. H. Moore, “Pmlb: a large benchmark suite for ma-chine learning evaluation and comparison,” BioData mining, vol. 10, no. 1, p. 36, 2017.

[17] M. Feurer, A. Klein, K. Eggensperger, J. Springenberg, M. Blum, F. Hutter, “Efficient and robust automated machine learning,” in Advances in neural information processing systems, 2015, pp. 2962–2970.

[18] J. Sympson, R. Hetter, “Controlling item-exposure rates in computerized adaptive testing,” pp. 973–977, 1985.

Downloads

Published

2021-03-01
Metrics
Views/Downloads
  • Abstract
    224
  • PDF
    43

How to Cite

Song, H. and Flach, P. (2021). Efficient and Robust Model Benchmarks with Item Response Theory and Adaptive Testing. International Journal of Interactive Multimedia and Artificial Intelligence, 6(5), 110–118. https://doi.org/10.9781/ijimai.2021.02.009