بررسی رتبه‌دهی ارزیابان آزمون‌های عملکردی سراسری (طراحی صنعتی، شناخت موسیقی، نمایش عروسکی، طراحی معماری و اسکیس معماری) بر اساس روش های کلاسیک و مدل های چندوجهی راش

ایزانلو, بلال; حاجت پور قلعه رودخانی, سارا

doi:10.22034/emes.2023.528161.2244

بررسی رتبه‌دهی ارزیابان آزمون‌های عملکردی سراسری (طراحی صنعتی، شناخت موسیقی، نمایش عروسکی، طراحی معماری و اسکیس معماری) بر اساس روش های کلاسیک و مدل های چندوجهی راش

نوع مقاله : مقاله پژوهشی

نویسندگان

¹ استادیار دانشکده روانشناسی و علوم تربیتی، دانشگاه خوارزمی، تهران، ایران

² کارشناس ارشد تحقیقات آموزشی، دانشگاه خوارزمی، تهران، ایران

10.22034/emes.2023.528161.2244

چکیده

هدف: پژوهش حاضر به منظور بررسی میزان جدیت/تساهل و گرایش به مرکز ارزیابان در نمرهگذاری آزمونهای عملکردی سراسری سازمان سنجش انجام شد.
روش پژوهش: برای این منظور از دادههای ثانویه آزمونهای طراحی معماری سالهای 1396 (5437 نفر)، اسکیس معماری 1397 (7459 نفر)، طراحی صنعتی سراسری سال 1396 (1365نفر)، موسیقی سال 1397 (569 نفر) و نمایش عروسکی سال 1397 (97 نفر) استفاده شد. دادهها با استفاده از روشهای کلاسیک و مدلهای چند وجهی راش تحلیل و نتایج استخراج شد.
یافته‌ها: در هر دو آزمون طراحی معماری، در کل میزان همسانی (مطابقت نسبی بین درجهبندی ارزیابان) قابل قبول، ولی در آزمونهای طراحی صنعتی، شناخت موسیقی و نمایشنامهنویسی میزان همسانی پایین است. میزان اجماع (توافق) در هر پنج آزمون مورد بررسی نیز پایین است.
نتیجه‌گیری: نتایج حاصل از مدلهای چندوجهی نشان داد اثر سختگیری ارزیاب و استفاده از نمرههای کرانهای پایین در نمرهگذاری، در هر دو آزمون طراحی معماری وجود داشت، ولی مطابق انتظار اثر گرایش به مرکز وجود نداشت. به دلیل استفاده از طرحهای جمعآوری دادههای نامناسب (مثلا در همه پاسخها یا تکالیف یک آزمون، هر داور فقط 2 مورد مجزا از هم را ارزیابی میکند و هیچ همپوشی بین موارد ارزیابی شده توسط ارزیابان وجود ندارد تا با هم به صورت درست مقایسه شوند یا این که در یک آزمون هر تکلیف یا سوال توسط داوران متفاوت ارزیابی میشود) در آزمونهای طراحی صنعتی، شناخت موسیقی و نمایشنامه نویسی امکان تحلیل با مدلهای چندوجهی وجود نداشت. با توجه به یافتهها توصیه میشود به هنگام ارزیابی آزمونهای عملکردی سراسری، اولا از طرح مناسب برای ارزیابی استفاده شود و دوما با آموزش ارزیابان در زمینه نمرهگذاری آزمونهای عملکردی از تاثیر عواملی مثل جدیت یا تساهل و کاهش توافق جلوگیری به عمل آید.

کلیدواژه‌ها

20.1001.1.24762865.1402.13.42.5.4

موضوعات

سنجش و اندازه‌گیری آموزش عالی

عنوان مقاله [English]

An Investigation of the Evaluators' Ratings of the Performance Exams in the Field of Arts Using Multi-Faceted Rasch Model

نویسندگان [English]

Balal Ezanloo ¹
Sara Hajatpour ²

¹ Assistant Professor, faculty of Psychology and Education, Kharazmi University, tehran, iran

² Master of Educational Research, Kharazmi University, Tehran, Iran

چکیده [English]

Objective: The present study was done in order to examine the severity/leniency and the central tendency level of raters in scoring of performance tests performed by National Organization for Educational Testing (NOET).
Methods: For this purpose, the secondary data in Sketch Architecture Test (1396 and 1397 solar, respectively with 5437 and 7459 people), Industrial design test (1396 solar, 1365 people), Music recognition test (1397 solar, 569 people), playwriting test (1396 solar, 97 people). The data were analyzed by classical methods and many-faceted Rasch models and the results extracted.
Results: The results from classical methods show that in both Sketch Architecture Tests, raters’ consistency is generally acceptable, but in other tests (Industrial design, music cognition and playwriting) homogeneity is low. Raters’ consensus is low in all five examined tests. Results from many-facet Rasch models show that in both Sketch Architecture Tests rater's severity and use of lower scores of rating scale effects are present, but as expected, there was not any effect for central tendency. Unfortunately, due to the nature of incorrect data collection designs in Industrial design, music cognition and playwriting tests analysis with many-facet Rasch models was not possible.
Conclusion: Based on findings it is recommended that when global performance tests are evaluated by the NOET organization raters, firstly; the proper design for evaluating have been selected and used, and secondly; to prevent the effect of severity or leniency and agreement (consensus) reduction between raters, the training of them for scoring performance tests to be considered

کلیدواژه‌ها [English]

Keywords: Multi-faceted Rush models
severity
leniency
central tendency
performance tests tests

مراجع

References

Andrich, D. (2005). Rasch, Georg. Encyclopedia of Social Measurement, 3, 299–306. Angeles, CA: Sage.

Bijani, H. (2018). Effectiveness of A Training Program on Oral Performance Assessment: The Analysis of Tasks Using the Multifaceted Rasch Analysis. Journal of Modern Research in English Language Studies, 5(4), 27-53.

Bond, T. G., & Fox, C. M. (2015). Applying the Rasch model: Fundamental measurement in the human sciences (3rd ed.). New York, NY: Routledge.

DeCotiis, T. A. (1977). An analysis of the external validity and applied relevance of three rating formats. Organizational Behavior and Human Performance, 19, 247-266.

Eckes, T. (2011). Introduction to many-facet Rasch measurement. Franfurt am Main: Peter Lang.

Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Mahwah, NJ: Erlbaum.

Engelhard Jr, G. (1994). Examining rater errors in the assessment of written composition with a many‐faceted Rasch model. Journal of Educational Measurement, 31(2), 93-112.

Esfandiari, R. (2014). A Many-Facet Rasch Measurement of Bias among Farsi-Native Speaking Raters toward Essays Written by Non-Native Speakers of Farsi. Journal of Teaching Persian to Speakers of Other Languages, 3(VOL.3,NO.3,(TOME 8)), 25-54.

Esfandiari, R. & Myford, C. M. (2013). Severity Differences Among Self-Assessors, Peer-Assessors, and Teacher Assessors Rating EFL Essays. Assessing Writing, 18(2): 111-131.

Fleiss, J. L., Levin, B., & Paik, M. C. (2003). Statistical methods for rates and proportions (3rd ed.). Hoboken, NJ: Wiley.

Gamer, M., Lemon, J., Fellows, I., & Singh, P. (2019). IRR: various coefficients of interrater reliability and agreement. 2012. R package version 0.84, 1.

Hays, W. L. (1994). Statistics (5th ed.). Belmont, CA: Wadsworth.

Kempf, W. F. (1972). Probabilistische Modelle experimentalpsychologischer Versuchssituationen [Probabilistic models of designs in experimental psychology]. Psychologische Beiträge, 14, 16–37.

Kim, S. C., & Wilson, M. (2009). A comparative analysis of the ratings in performance assessment using generalizability theory and the many-facet Rasch model. Journal of applied measurement, 10(4), 408–423.

Knoch, U. (2011). Investigating the effectiveness of individualized feedback to rating behavior: A longitudinal study. Language Testing, 28, 179–200.

Linacre, J. M. (1989). Many-facet Rasch measurement. Chicago: MESA Press.

Linacre, J. M. (2006a). Demarcating category intervals. Rasch Measurement Transactions, 19, 1041–1043.

Linacre, J. M. (2014b). A user’s guide to FACETS: Rasch-model computer programs. Chicago: Winsteps.com. Retrieved from http://www.winsteps.com/facets.

Linacre, J. M., & Wright, B. D. (1989). The length of a logit. Rasch Measurement Transactions, 3, 54–5.

Linacre, J. M., & Wright, B. D. (2002). Construction of measures from many-facet data. Journal of Applied Measurement, 3, 484–509.

Ludlow, L. H., & Haley, S. M. (1995). Rasch model logits: Interpretation, use, and transformation. Educational and Psychological Measurement, 55, 967–975.

Masters, G. N. (2010). The partial credit model. In M. L. Nering & R. Ostini (Eds.), Handbook of polytomous item response theory models (pp. 109–122). New York, NY: Routledge.

Micko, H. C. (1970). Eine Verallgemeinerung des Meßmodells von Rasch mit
einer Anwendung auf die Psychophysik der Reaktionen [A generalization of
Rasch’s measurement model with an application to the psychophysics of reactions]. Psychologische Beiträge, 12, 4–22.

Myers, J. L., Well, A. D., & Lorch, R. F. (2010). Research design and statistical.

Myford, C. M., & Wolfe, E. W. (2003). Detecting and measuring rater effects using many-facet Rasch measurement: Part I. Journal of applied measurement, 4(4), 386-422.

Ostini, R., & Nering, M. L. (2006). Polytomous item response theory models. Thousand Oaks, CA: Sage.

Penfield, R. D. (2014). An NCME instructional module on polytomous item response theory models. Educational Measurement: Issues and Practice, 33(1), 36–48.

Robitzsch, A., & Steinfeld, J. (2018a). immer: Item response models for multiple ratings. R package version 1.1-35.

Robitzsch, A., & Steinfeld, J. (2018b). Item response models for human ratings: Overview, estimation methods, and implementation in R. Psychological Test and Assessment Modeling, 60(1), 101-139.

Smith Jr, E. V., & Kulikowich, J. M. (2004). An application of generalizability theory and many-facet Rasch measurement using a complex problem-solving skills assessment. Educational and Psychological Measurement, 64(4), 617-639.

Stemler, S. E., & Tsai, J. (2008). Best practices in interrater reliability: Three common approaches. In J. W. Osborne (Ed.), Best practices in quantitative methods (pp. 29–49). Los

Tinsley, H. E. A., & Weiss, D. J. (2000). Interrater reliability and agreement. In Wolfe, E. W. (1997). The relationship between essay reading style and scoring proficiency in a psychometric scoring system. Assessing Writing, 4, 83–106.

Wolfe, E. W., & McVay, A. (2012). Application of latent trait models to identifying substantively interesting raters. Educational Measurement: Issues and Practice,31(3), 31–37.

تعداد مشاهده مقاله: 55
تعداد دریافت فایل اصل مقاله: 113

An Investigation of the Evaluators' Ratings of the Performance Exams in the Field of Arts Using Multi-Faceted Rasch Model

مراجع

References

دوره 13، شماره 42
تیر 1402
صفحه 100-123

فایل ها

هم رسانی

ارجاع به این مقاله

آمار

An Investigation of the Evaluators' Ratings of the Performance Exams in the Field of Arts Using Multi-Faceted Rasch Model

مراجع

References

دوره 13، شماره 42تیر 1402صفحه 100-123

فایل ها

هم رسانی

ارجاع به این مقاله

آمار

دوره 13، شماره 42
تیر 1402
صفحه 100-123