Investigating comparability of Ability Parameter Estimation in Computerized Adaptive and Paper-Pencil Tests

Document Type : Original Article



This study aimed to investigate  the comparability of ability parameter estimation in computerized adaptive with paper-pencil testing and finding the algorithm for optimal computerized adaptive testing based on different kinds of ability estimation (maximum likelihood and expected a posteriori) and test termination criterion (fixed standard error and fixed length of test) in high stakes tests. The target population consisted of mathematics and engineering subgroup examinees of the Iranian university entrance exam in 2013. One thousand examinees were selected using random sampling method and mathematics questions were analyzed using 3-parameter logistic model. Equal to real numbers, 40 data sets were simulated and post hoc simulation of computerized adaptive testing was applied. The results indicated a strong correlation between ability estimation using computerized adaptive and paper-pencil testing of mathematics subscale. Furthermore, bias values, average absolute difference between ability estimation in computerized adaptive and paper-pencil testing and the mean squared root of the difference showed that the ability estimation in computerized adaptive testing using expected a posteriori is consistent with the ability estimation using the whole exam. The results suggested that computerized adaptive testing is capable of recovering the ability in mathematics subscale. It was concluded that expected a posteriori and test stopping rule of fixed 0.3 standard error was the optimal algorithm for suitable reliability, appropriate test length and the recovery of the ability estimation in computerized adaptive testing of mathematics subscale.


بابایی، محمود (1389). مقدمه‌ای بر یادگیری الکترونیکی. تهران: انتشارات پژوهشگاه علوم و فن‌آوری اطلاعات ایران، نشر چاپار.
مینایی، اصغر و فلسفی‌نژاد، محمدرضا (1389). روش‌های سنجش تک‌بعدی بودن سؤال‌ها در مدل‌های دو ارزشی IRT. فصلنامه اندازه‌گیری تربیتی، 1 (3)، 71 – 100.
Alkhadher, O.; Clarke, D. D. & Anderson, N. (1998). Equivalence and predictive validity of paper‐and‐pencil and computerized adaptive formats of the Differential Aptitude Tests. Journal of occupational and organizational psychology71 (3), 205-217.
American Educational Research Association, American Psychological Association, National Council on Measurement in Education, Joint Committee on Standards for Educational, & Psychological Testing (US). (1999). Standards for educational and psychological testing. Amer Educational Research Assn.
Ayhan, A. S. (2015). Comparability of Scores from CAT and Paper Pencil Implementations of Students Selection Examination to Higher Education. Master’s thesis, Ihsan Doğramacı Bilkent University.
Babcock, B. & Weiss, D. J. (2009). Termination criteria in computerized adaptive tests: Variable-length CATs are not biased. In Proceedings of the 2009 GMAC conference on computerized adaptive testing (Vol. 14).
Bennett, R. E. (2002). Inexorable and inevitable: The continuing story of technology and assessment. Computer-based testing and the Internet, 201-217.
Bergstrom, B. A. (1992). Ability measure equivalence of computer adaptive and pencil and paper tests: A research synthesis. ERIC Clearinghouse.
Bock, R. D. & Mislevy. R. J. (1982). Adaptive EAP Estimation of Ability in a Microcomputer Environment. Applied Psychological Measurement, 6, 431-444.
Bringsjord, EL. (2001). Computer-Adaptive Versus Paper-and-Pencil Testing Environments: An Experimental Analysis of Examinee Experience. Doctoral dissertation, University at Albany.
Bulut, O. & Kan, A. (2012) Application of computerized adaptive testing to entrance examination for graduate studies in Turkey. Egitim Arastirmalari-EurasianJournal of Educational Research, 49, 61-80.
Choi, S. W.; Podrabsky, T. & McKinney, N. (2011). Firestar-D: Computerized Adaptive Testing Simulation Program for Dichotomous Item Response Theory Models. Applied Psychological Measurement, 0146621611406107.
Davey, T. & Pitoniak, M. J. (2006). Designing computerized adaptive tests. Handbook of test development. Mahwah, NJ: Lawrence Erlbaum Associates.
Davidson, P. (2003). Why technology had had only a minimal impact on testing in education? Proceedings from the 2nd Education Technology Conference and Exhibition. Oman: Sultan Qaboos University
Eggen, T. J. H. M. (2007). Choices in CAT models in the context of educational testing. In Proceedings of the 2007 GMAC conference on computerized adaptive testing
Han, K. T. (2007). WinGen: Windows software that generates IRT parameters and item responses. Applied Psychological Measurement, 31 (5), 457-459.
Han, K. T., & Hambleton, R. K. (2007). User's Manual: WinGen (Center for Educational Assessment Report No. 642).  Amherst, MA: University of Massachusetts, School of Education.
Harwell, M.; Stone, C. A.; Hsu, T. C. & Kirisci, L. (1996). Monte Carlo studies in item response theory. Applied psychological measurement20 (2), 101-125.
Henly, S. J.; Klebe, K. J.; McBride, J. R. & Cudeck, R. (1989). Adaptive and conventional versions of the DAT: The first complete test battery comparison. Applied Psychological Measurement, 13 (4), 363-371.
Joubert, T., & Kriek, H. J. (2009). Psychometric comparison of paper-and-pencil and online personality assessments in a selection setting. SA Journal of Industrial Psychology35 (1), 78-88.
Kalender, Ilker (2011). Effect of Different Computerized Adaptive Testing Strategies on Recovery of Ability. Doctoral dissertation, Middle East Technical University.
McDonald, R. P. & Fraser, C. (2003). NOHARM: A Windows program for fitting both unidimensional and multidimensional normal ogive models of latent trait theory. Niagara College. Welland, Ontario. Retrieved from <>
Paek, P. (2005). Recent trends in comparability studies. Pearson Educational Measurement. Retrieved February, 1, 2010.
poggio, J.; Glasnapp, D. R.; Yang, X. & Poggio, A. J. (2005). A comparative evaluation of score results from computerized and paper & pencil mathematics testing in a large scale state assessment program. The Journal of Technology, Learning and Assessment3 (6).
Pommerich, M. (2004). Developing computerized versions of paper tests: Mode effects for passage-based tests. The Journal of Technology, Learning, and Assessment, 2 (6), 1-44.
Reckase, M. D. (2010). Designing item pools to optimize the functioning of a computerized adaptive test. Psychological Test and Assessment Modeling, 52 (2), 127-141.
Ponsoda, V. (2000). Overview of the computerized adaptive testing special section. Psicológica21, 115-120.
Schaeffer, G. A.; Bridgeman, B.; Golub‐Smith, M. L.; Lewis, C.; Potenza, M. T. & Steffen, M. (1998). Comparability of Paper-and-Pencil and Computer Adaptive Test Scores on the GRE” General Test. ETS Research Report Series, 2, i-25.
Schaeffer, G. A.; Reese, C. M.; Steffen, M.; McKinley, R. L. & Mills, C. N. (1993). Field Test of a Computer-Based GRE General Test. ETS Research Report Series, 1, i-47.
Schaeffer, G. A.; Steffen, M.; Golub‐Smith, M. L.; Mills, C. N. & Durso, R. (1995). The introduction and comparability of the computer adaptive GRE general test. ETS Research Report Series, 1, i-48.
Stocking, M. L. (1987). Two simulated feasibility studies in computerized adaptive testing. Applied Psychology: An International Review, 36, 263-277.
Texas Education Agency. (2008). A review of literature on the comparability of scores obtained from examinees on computer-based and paper-based tests. Retrieved from <
Tsai, T. H., & Shin, C. D. (2013). A Score Comparability Study for the NBDHE Paper–Pencil versus Computer Versions. Evaluation & the health professions36 (2), 228-239.
Wainer, H.; Dorans, N. J.; Flaugher, R.; Green, B. F. & Mislevy, R. J. (2000). Computerized adaptive testing: A primer. Routledge.
Wang, H., & Shin, C. D. (2010). Comparability of computerized adaptive and paper-pencil tests. Test, Measurement and Research Service Bulletin13, 1-7.
Wang, T., & Kolen, M. J. (2001). Evaluating comparability in computerized adaptive testing: Issues, criteria and an example. Journal of Educational Measurement,38 (1), 19-49.
Wang, X. B.; Pan, W. & Harris, V. (1999). Computerized Adaptive Testing Simulations Using Real Test Taker Responses. Law School Admission Council Computerized Testing Report. LSAC Research Report Series.
Weiss, D. J., & Gibbons, R. D. (2007). Computerized adaptive testing with the bifactor model. In D. J. Weiss (Ed.). Proceedings of the 2007 GMAC Conference on Computerized Adaptive Testing, URL. files/cat07weiss&gibbons.pdf
Zimowski, M. F.; Muraki, E.; Mislevy, R. J. & Bock, R. D. (2003). BILOG-MG3: Multiple-group IRT analysis and test maintenance for binary items
Computer software]. Chicago: Scientific Software International.