Establishing a physics concept inventory using computer marked free-response questions

Mark A. J. Parker 1 * , Holly Hedgeland 1 2, Sally E. Jordan 1, Nicholas St. J. Braithwaite 1
More Detail
1 School of Physical Sciences, The Open University, Walton Hall, Milton Keynes, MK7 6AA, UK
2 Clare Hall, University of Cambridge, Herschel Road, Cambridge, CB3 9AL, UK
* Corresponding Author
EUR J SCI MATH ED, Volume 11, Issue 2, pp. 360-375.
Published Online: 05 December 2022, Published: 01 April 2023
OPEN ACCESS   803 Views   450 Downloads
Download Full Text (PDF)


The study covers the development and testing of the alternative mechanics survey (AMS), a modified force concept inventory (FCI), which used automatically marked free-response questions. Data were collected over a period of three academic years from 611 participants who were taking physics classes at high school and university level. A total of 8,091 question responses were gathered to develop and test the AMS. The AMS questions were tested for reliability using classical test theory (CTT). The AMS computer marking rules were tested for reliability using inter-rater reliability (IRR). Findings from the CTT and IRR studies demonstrated that the AMS questions and marking rules were overall reliable. Therefore, the AMS was established as a physics concept inventory which uses automatically-marked, free-response questions. The approach used to develop and test the AMS could be used in further attempts to develop concept inventories which make use of automatically-marked, free-response questions.


Parker, M. A. J., Hedgeland, H., Jordan, S. E., & Braithwaite, N. S. J. (2023). Establishing a physics concept inventory using computer marked free-response questions. European Journal of Science and Mathematics Education, 11(2), 360-375.


  • Artstein, R., & Poesio, M. (2008). Inter-coder agreement for computational linguistics. Computational Linguistics, 34(4), 555-596.
  • Butcher, P. G., & Jordan, S. E. (2010). A comparison of human and computer marking of short free-text student responses. Computers and Education, 55, 489-499.
  • Cohen, J. (1960). A coefficient for nominal scales. Educational and Psychological Measurement, 20, 37-46.
  • Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory. Wadsworth Group/Thompson Learning.
  • Ding, L., & Beichner, R. (2009). Approaches to data analysis of multiple-choice questions. Physical Review Special Topics-Physics Education Research, 5, 020103.
  • Ding, L., Chaby, R., Sherwood, B., & Beichner, R., (2006). Evaluating an electricity and magnetism assessment tool: Brief electricity and magnetism assessment. Physical Review Special Topics-Physics Education Research, 2, 010105.
  • Doran, R. (1980). Basic measurement and evaluation of science instruction. NSTA.
  • Eaton, P., (2021). Evidence of measurement invariance across gender for the force concept inventory. Physical Review Physics Education Research, 17, 010130.
  • Garvin-Doxas, K., Klymkowsky, M., & Elrod, S. (2007). Building, using, and maximizing the impact of concept inventories in the biological sciences: Report on a National Science Foundation-sponsored conference on the construction of concept inventories in the biological sciences. CBE Life Sciences Education, 6(4), 277-282.
  • Han, J., Bao, L., Chen, L., Cai, T., Pi, Y., Zhou, S., Tu, Y., & Koenig, K. (2015). Dividing the force concept inventory into two equivalent half-length tests. Physical Review Special Topics-Physics Education Research, 11, 010112.
  • Han, J., Koenig, K., Cui, L., Fritchman, J., Li, D., Sun, W., Fu, Z., & Bao, L. (2016). Experimental validation of the half-length force concept inventory. Physical Review Special Topics-Physics Education Research, 12, 020122.
  • Hestenes, D., Wells, M., & Swackhamer, G. (1992). Force concept inventory. The Physics Teacher, 30, 141-158.
  • Hufnagel, B. (2002). Development of the astronomy diagnostic test. Astronomy Education Review, 1(1), 47-51.
  • Hunt, T. (2012). Computer-marked assessment in Moodle: Past, present, and future. In Proceedings of Computer Assisted Assessment 2012 International Conference.
  • Jordan, S. (2012). Short-answer e-assessment questions: Five years on. In Proceedings of the 2012 International Computer Assisted Assessment Conference.
  • Kline, P. (1986). A handbook of test construction: Introduction to psychometric design. Methuen.
  • Lee, N. W., Shamsuddin, W. N. F. W, Wei, L. C., Anuardi, M. N. A. M., Heng, C. S., & Abdullah, A. N. (2021). Using online multiple choice questions with multiple attempts: A case for self-directed learning among tertiary students. International Journal of Evaluation and Research in Education, 10(2), 553-568.
  • Mitchell, T., Aldridge, N., Williamson, W., & Broomhead, P. (2003). Computer based testing of medical knowledge. In Proceedings of the 7th International Computer Assisted Assessment Conference.
  • Nicol, D., (2007). E‐assessment by design: Using multiple‐choice tests to good effect. Journal of Further and Higher Education, 31(1), 53-64.
  • Porter, L., Taylor, C., & Webb, K. (2014). Leveraging open source principles for flexible concept inventory development. In Proceedings of the 2014 Conference on Innovation & Technology in Computer Science Education (pp. 243-248).
  • Rebello, N., & Zollman, D. (2004). The effect of distractors on student performance on the force concept inventory. American Journal of Physics, 72, 116.
  • Scott, T. F., & Schumayer, D. (2017). Conceptual coherence of non-Newtonian worldviews in force concept inventory data. Physical Review Physics Education Research, 13, 010126.
  • Simon, & Snowdon, S. (2014). Multiple-choice vs free-text code-explaining examination questions. In Proceedings of the 14th Koli Calling International Conference on Computing Education Research (pp. 91-97).
  • Smith, J. I., & Tanner, K. (2010). The problem of revealing how students think: Concept inventories and beyond. CBE Life Sciences Education, 9(1), 1-5.
  • Sychev, O., Anikin, A., & Prokudin, A. (2020) Automatic grading and hinting in open-ended text questions. Cognitive Systems Research, 59, 264-272.
  • Thornton, R., & Sokoloff, D. (1998). Assessing student learning of Newton’s laws: The force and motion conceptual evaluation and the evaluation of active learning laboratory and lecture curricula. American Journal of Physics, 66, 338.
  • Yasuda, J., Mae, N., Hull, M. M., & Taniguchi, M., (2021). Optimizing the length of computerized adaptive testing for the force concept inventory. Physical Review Physics Education Research, 17, 010115.
  • Zehner, F., Salzer, C., & Goldhammer, F. (2016). Automatic coding of short text responses via clustering in educational assessment. Educational and Psychological Measurement, 76(2), 280-303.
  • Zeilik, M., (2003). Birth of the astronomy diagnostic test: Prototest evolution. Astronomy Education Review, 1(2), 46-52.
  • Zhang, L., & VanLehn, K., (2021). Evaluation of auto-generated distractors in multiple choice questions from a semantic network. Interactive Learning Environments, 29(6), 1019-1036.
  • Zwick, R. (1988). Another look at interrater agreement. Psychological Bulletin, 103(3), 374-378.