A Comparison between Benchmarking and Bookmarking to Classification of Performance Levels in Large-scale Study of Mathematics Assessment

Document Type : Original Article


Assistant Professor of Research Institute for Education



Objective: Standard setting is one of the assessment techniques to create valid classifications of examinees. In present study, the effect of two standard setting methods, benchmark and bookmarking, was examined in results of a large-scale study, which was planned for assessing mathematics learning in sixth grade students of Tehran city.
Methods: Two methods were compared using data of a provincial large-scale assessment which carried out on 9720 sixth grade students in Tehran city. They asked 264 mathematics items and their response were analyzed by plausible values. 
Results: Results of applying benchmark showed that 75, 48, 18, and 2 percent of students attained minimum scores in low, mediate, high, and advanced levels; respectively. In addition, 23.9 percent of items located in the same level that identified by content experts. In contrast, quality of classification by content experts in bookmarking was critiqued due to comparing of successive averages with standard deviations of location parameters. Moreover, effect of using five response probabilities: 0.52, .057, 0.62, 0.67, and 0.75 in classification of students indicated that, in spite of recommendation of response probability 0.67 in literature, the lowest response probability (0.52) produced the most realistic results rather than other response probabilities, however, this is still a strictly standard comparing benchmarking methods.
Conclusion: Standard setting should be considered as a technical issue in all assessments that grading or pass/fail is consequent of the test.


Cartwright, F. (2015). Item and test analysis. In G. Shiel & F. Cartwright (Eds.), Analyzing data from a national assessment of educational achievement (vol. 4) (pp. 125-257). Washington DC: International Bank for Reconstruction and Development / The World Bank.
Cizek, G., & Bunch, M. (2007). Standard setting: A guide to establishing and evaluating performance standards on tests. California: SAGE Publications, Inc.
Foy, P., & Yin, L. (2016). Scaling the TIMSS 2015 achievement data. In M. Martin, I. Mullis, & M. Hooper (Eds.), Methods and procedures in TIMSS 2015 (pp. 13.11-13.63). Boston: TIMSS & PIRLS International Study Center.
Habibzadeh, S., Delavar, A., Farrokhi, N., Minaei, A., & Jalili, M. (2019). The use of Rasch and item mapping in determining cut score of comprehensive pre internship exam. Research in Medicine Education; 11 (3), 59-70. [in Persian].
Jalili, M., & Mortaz Hejri, S., (2012). Standard setting for objective structured clinical exam using four methods: Pre-fixed score, Angoff, borderline regression and Cohen, Strides in Development of Medical Education, 9(1), 77-84. [in Persian].
Jalalizadeh, M., Delavar, A., Farokhi, N., & Askari, M. (2020). Comparison of ANGOF-based IRT method and Bookmark method for standard Setting of MSRT language test, Journal of Research in Teaching, 7(4), 49-69. [in Persian]
Kabiri, M. (2019). BARAKAAT: Program for monitoring educational quality in Tehran – Iran’s first provincial large-scale assessment, UNESCO-NEQMAP website. Retrieved in https://neqmap.bangkok.unesco.org/barakaat-program-for-monitoring-educational-quality-in-tehran-irans-first-provincial-large-scale-assessment/
Kabiri, M. (2020). Program for monitoring educational quality in Tehran (BARAKAAT): specifying the quality of math education in 6th grade (vol. 1), Research report: Tehran city department of Education. [in Persian].
Kabiri, M. (in press). Quality of math and science education in Iran, comparing with others coantries: Results of TIMSS 2019, Tehran: Madresseh Pub. [in Persian].
Makarem, A., Mahdavifard, H., & Gholami, H. (2017). Evaluation of passing scores in semiotics: An objective structured clinical examination for medical students of Mashhad University of Medical Sciences, Strides in Development of Medical Education, 14(1), 42-50. [in Persian].
Mortaz Hejri, S., Jalili, M., & Labaf, A. (2012). Setting standard threshold scores for an objective structured clinical examination using Angoff method and assessing the impact of reality checking and discussion on actual scores. Iranian Journal of Medical Education. 11 (8) :885-894. [in Persian].
LaRoche, S., Joncas, M., & Foy, P. (2016). Sample design in TIMSS 2015. In M. O. Martin, I. V. S. Mullis, & M. Hooper (Eds.), Methods and Procedures in TIMSS 2015 (pp. 3.1-3.37). Boston: TIMSS & PIRLS International Study Center and International Association for the Evaluation of Educational Achievement (IEA).
Lewis, D. M., Mitzel, H. C., & Schulz, M. (2012). The bookmark standard setting procedure. In G. J. Cizek (Ed.), Setting performance standards: Foundations, methods, and innovations (pp. 225-254): Routledge.
Lissitz, R. W. (2013). Standard setting: Past, present, and perhaps future. In M. Simon, K. Ercikan, & a. M. Rousseau (Eds.), Improving large-scale assessment in education: Theory, issues, and practice (pp. 154-174). New York: Routledge.
Mullis, I. V. S., Cotter, K. E., Centurino, V. A. S., Fishbein, B. G., & Liu, J. (2016). Using scale anchoring to interpret the TIMSS 2015 achievement scales. In I. V. S. M. M. O. Martin, & M. Hooper (Ed.), Methods and Procedures in TIMSS 2015 (pp. 14.11-14.47). Boston: TIMSS & PIRLS International Study Center and International Association for the Evaluation of Educational Achievement (IEA).
OECD. (2017). PISA 2015 Technical Report. Paris: OECD Publishing.
Olsen, R. V., & Nilsen, T. (2017). Standard setting in PISA and TIMSS and how these procedures can be used nationally. In S. Blömeke & J.-E. Gustafsson (Eds.), Standard Setting in Education (pp. 69-84): Springer.
Phillips, G. W. (2012). The benchmark method of standard setting. In G. J. Cizek (Ed.), Setting performance standards: Foundations, methods, and innovations (pp. 323-346): Routledge.
Price, L. R. (2017). Psychometric methods: Theory into practice. New York: Guilford Publications.
Rutkowski, L., Gonzalez, E., Joncas, M., & von Davier, M. (2010). International large-scale assessment data. Educational Researcher, 39(2), 142.
Shiel, G., & Cartwright, F. (2015). National Assessments of Educational Achievement, Volume 4: Analyzing Data from a National Assessment of Educational Achievement. Washington DC: The World Bank.
UIS. (2017). Constructing UIS proficiency scales and linking to assessment to support SDG indicator 4.1.1 reporting. Retrieved from UNESCO-UIS: http://uis.unesco.org/sites/default/files/documents/gaml4-constructing-uis-proficiency-scales-linking-assessments-support-sdg-indicator4.1.1-reporting.pdf