The Effect of Sample Size and Test Length on Equated Scores and Error of Equating: The Case of Iranian National Tests

Document Type : Original Article



The purpose of this research was to assess the effect of sample size and test length on the equated scores and equating error of Kernel method (using different methods of chain and poststratification smoothing) as well as the merits and demerits of this method compared to classical equating techniques. Therefore, the population and sample participants were those who took part in Iranian National Tests (TOLIMO, Comprehensive Tests of Iran Educational Testing Service) administered in 2012-2013. TOLIMO had a number of 123 items including 17 anchor tests in each form. To analyze data collected from Comprehensive Tests of Iran Educational Testing Service, only those items related to common general-domain subjects of mathematics and physics, science and humanities were utilized. To investigate the effect of sample size on the accuracy of equating the above mentioned tests, three samples of 200, 500, and 1000 people were randomly selected from among data collected from all participants and analyzed. A 40-item sample (10 items from each subject) was randomly chosen from general subjects of comprehensive tests to examine the effect of test length on the accuracy of the results of equating. Thus, in comprehensive tests, two 100-item and 40 items sample tests were analyzed with samples of different sizes. The proper equating design in TOLIMO was NEAT design, but in Comprehensive Tests it was homogeneous groups design. Equating methods in the respective tests have been mean, linear, equipercentile, Circle arc and Kernel methods. On the whole, the larger the examinees sample whose scores were taken into account in the analyses was, the lower standard error of equating would be. The findings also showed that whenever there was an increase in both sample size and test length, an improvement was observed in the fitness related to Kernel smoothing as well. Generally, with small sample sizes, Kernel method is more advantageous than other methods of classical equating.


سرمد، زهره؛ بازرگان، عباس و حجازی، الهه (1384). روش‌های تحقیق در علوم رفتاری. تهران: نشر آگاه.
لرد، فردریک (1980). کاربردهای نظریه سؤال- پاسخ؛ ترجمه علی دلاور و جلیل یونسی. تهران: انتشارات رشد.
Brennan, R. L (2006). (Ed.). Educational measurement (4th ed.). Westport, CT: Praeger.
 Godfrey, K. E. (2007). A comparison of Kernel equating and IRT true score equating methods. Unpublished doctoral dissertation, University of North Carolina, Greensboro. Retrieved from ProQuest. (AAT 3273329).
 Grant, M. C.; Zhang, L.; Damiano, M. & Lonstein, L. (2006). An evaluation of the kernel equating method: Small sample equating in non-equivalent groups. Paper presented at the national conference of AERA/NCME, 2006.
  Hanson, B. A. & Béguin, A. A. (2002). Obtaining a common scale for IRT item parameters using separate versus concurrent estimation in the common item nonequivalent groups equating design. Applied Psychological Measurement, 26 (1), 3-24.
Kolen, M. J. & Brennan, R. L. (2004). Test Equating Methods and Practices. New York: Springer-Verlag.
Lee, Y., & von Davier, A. A. (2010). Equating through alternative kernels. In A. A. von Davier (Ed.), Statistical models for test equating, scaling, and linking (pp. 159-173). New York: Springer.
Lee, Y. H. (2007). Contributions to the statistical analysis of item response time in educational testing. Unpublished doctoral dissertation, Columbia University, New York.
Livingston, S. A., Dorans, N. J. & Wright, N. K. (1990). What combination of sampling and equating methods work best? Applied Measurement in Education, 3, 73-95.
Peterson, N. S and Cook L.L (1989). Problems related to the use of conventional and item response theory equating methods in less than optimal circumstances. Applied psychological measurement. 11, 225- 244.
R Development Core Team (2010). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL
Sinharay, S. & Holland, P. W. (2007). Is it necessary to make anchor tests mini-versions of the tests being equated or can some restrictions be relaxed? Journal of Educational Measurement, 44 (3), 249-275.
Von Davier, A. A., Holland, P. W., Thayer, D. T. (2004). The Kernel Method of Test Equating. New York: Springer-Verlag.