Development of a Test Instrument to Investigate Secondary School Students’ Declarative Knowledge of Quantum Optics

https://doi.org/10.30935/scimath/10946 Abstract: This article reports the development and validation of a test instrument to assess secondary school students’ declarative quantum optics knowledge. With that, we respond to modern developments from physics education research: Numerous researchers propose quantum optics-based introductory courses in quantum physics, focusing on experiments with heralded photons. Our test instrument’s development is based on test development standards from the literature, and we follow a contemporary conception of validity. We present results from three studies to test various assumptions that, taken together, justify a valid test score interpretation, and we provide a psychometric characterization of the instrument. The instrument is shown to enable a reliable ( α = 0.78) and valid survey of declarative knowledge of quantum optics focusing on experiments with heralded photons with three empirically separable Study I. We interviewed N = 8 secondary school students (grade 12) in one-on-one settings using a Think-aloud protocol to investigate students’ cognitive processes when answering the test items (Ericsson & Simon, 1998). Before, the students had participated in an introductory quantum physics course according to Bitzenbauer and Meyn (2020) in classroom. In order to ensure a standardised implementation of the method, a guideline was developed. The conversations were recorded as audio files and transcribed afterwards. The evaluation of the Think-aloud interviews was based on categories from the literature (Meinhardt, 2018) related to the comprehensibility of the items, the cognitive processes occurring during test processing, and the suitability of the response format using scaling content analysis (Mayring, 2010, p. 102). The coding was carried out by a second independent coder for 12.5% of the data with a high level of agreement κ = 0.72 (95% CI [0.58; 0.85]). During the Think-aloud interviews, participants are asked to verbalize their thoughts and reflections, and so the cognitive processes of respondents in a test situation become accessible (van Someren et al., 1994). In order to check whether the test persons correctly understand the items, the evaluation of a developed test instrument with the help of the Think-aloud method makes sense: Particularly in the case of closed questions (here single-choice), an examination of the test items appears to be useful, because the distractors sometimes do not cover all conceivable participants’ responses, they influence each other or do not correspond to the participant’s natural response (Schnell, 2016); this can have a negative influence on the validity of the survey results (Rost, 2004). During the Think-aloud interviews, some previously unnoticed difficulties and misconceptions were revealed.


INTRODUCTION
The improvement of physics teaching is a central goal of physics education research. One of its most important areas is curriculum development research (Henderson, 2018). Today, there is an ongoing tradition of curriculum research not only on topics of classical physics, such as mechanics (Spatz et al., 2020) or electricity (Burde & Wilhelm, 2020), but also on advanced topics of modern physics, such as quantum physics (Kohnle et al., 2014;Müller & Wiesner, 2002).
Meanwhile, teaching proposals to quantum physics have been developed for more than twenty years to foster a detailed conceptual understanding of quantum physics among learners in schools and universities. Given emerging technical advances in the preparation and detection of single-photon states, diverse experiment-based approaches for teaching quantum physics have been developed (Bronner et al., 2009;Galvez et al., 2005;Pearson & Jackson, 2010;Thorn et al., 2004). Most of these experiment-based teaching sequences focus on quantum optics experiments with heralded photons: The quantum behaviour of single photons at the beam splitter is demonstrated in such experiments, making non-classical effects tangible (Bitzenbauer & Meyn, 2020;Holbrow et al., 2002;Pearson & Jackson, 2010). Such non-classical effects, e.g., antibunching (Kimble et al., 1977), are revealed by measuring intensity correlations of light at outputs of a beamsplitter (Grangier et al., 1986;Hanbury Brown & Twiss, 1956) and allow for the demonstration of light's quantum behaviour.
Quantum optics-based teaching approaches are promising in many ways: for example, Marshman and Singh (2017) argue that quantum optical experiments can "elegantly illustrate the fundamental concepts of quantum mechanics such as the wave-particle duality of a single photon, single-photon interference, and the probabilistic nature of quantum measurement" (p. 1). Building the doctrine of quantum physics on such single-photon experiments leads to a conception of photons consistent with quantum electrodynamics, namely that of quantum objects, as "quanta of various continuous space-filling fields" (Hobson, 2005, p. 61). According to Jones (1991), this "physical picture of the radiation field produced by quantum electrodynamics (QED) is satisfactory" (p. 97). Thus, single-photon experiments "provide the simplest method to date for demonstrating the essential mystery of quantum physics" (Pearson & Jackson, 2011, p. 1) without using historical approaches or mechanistic analogies in quantum physics lessons. These are known to lead to fundamental misunderstandings about quantum physics among students (Henriksen et al., 2018;Olsen, 2002).
Development research requires empirical research: Many studies have investigated typical learners' difficulties in quantum physics (Fischler & Lichtfeldt, 1992;Mashhadi & Woolnough, 1999;Sing & Marshman, 2015;Styer, 1996). Most of these studies referred to university students, particularly concerning quantum measurements and time evolution (Zhu & Singh, 2012a), the quantum mechanical formalism in general (Singh, 2007), or wave functions in one spatial dimension (Zhu & Singh, 2012b). However, there have been few studies with relatively small samples on learning and teaching quantum mechanics in the context of experiments with single photons (Marshman & Sing, 2017), especially on the secondary school level (Bitzenbauer, 2021;Bitzenbauer & Meyn, 2020). In particular, there is a lack of psychometrically characterized test instruments to investigate students' learning gains in quantum optics-based teaching sequences on quantum physics using experiments with heralded photons.
In this article, we present the development of a test instrument that can be used to economically elicit learners' declarative knowledge in the context of quantum optical experiments with heralded photons at the secondary school level. First, we give an overview of the test instruments that have emerged from research in the field of quantum physics education. Then, we describe the development as well as the psychometric characterization of the new test instrument. In future studies, this instrument can be used to investigate the learning efficacy of quantum optics-based instructional approaches to quantum physics in schools.

LITERATURE REVIEW
Test instruments on quantum physics. Research on students' conceptions of quantum physics has a long tradition: In their paper, Fischler and Lichtfeldt (1992) called for a departure from analogies to classical physics in the teaching of quantum physics. They justified this by evaluating a teaching course on quantum physics and the students' conceptions found. Subsequent works (Ireson, 1999;Mannila et al., 2002) took up some of Fischler's and Lichtfeldt's ideas, but no standard test instrument was developed. Today, there are many test instruments for quantum physics of different format, with different thematic foci and mainly with the students at universities as a target group. The overview in Table 1 shows a clear need: The developed test instruments are predominantly unsuitable for evaluating quantum physics teaching concepts at schools. In particular, no instrument exists with an explicit focus on quantum optics and experiments with heralded photons.
Students' conceptions of the nature of light. Studies on the nature of light generally refer to the dualism of waves and particles. An interview study with N = 25 students (Ayene et al., 2011) led to three clusters of conceptions, which the authors titled and described as follows: -Classical description: objects are described either as waves or as particles in the classical sense.
Particles are described as localized, compact objects, visualized "as a billiard ball which carries energy and momentum" (Ayene et al., 2011, p. 6).
-Mixed description: Photons are considered objects with the properties of classical particles and waves.
-Quasiquantum description: Learners' representations are predominantly dualistic but sometimes resort to viewing quantum objects as either waves or particles.  (Ireson, 1999(Ireson, , 2000 Items with rating scale Quantum phenomena and models students (partly also suitable for secondary school students) Quantum measurement test (Singh, 2001) Open-ended questions Measurement process and time evolution advanced undergraduate students QMVI (Cataloglu & Robinett, 2002)  This study's results are consistent with previous findings (Ireson, 1999(Ireson, , 2000. Ireson (1999) conducted a multivariate analysis on student understandings of quantum physics based on a survey of N = 225 learners and grouped subjects into three clusters according to their perceptions: mechanistic thinking, intermediate thinking, and quantum thinking. These three levels in learners' conceptions between classical thinking and quantum thinking have also been reported throughout other studies (Ke et al., 2005).

Development of the Test Instrument
To adapt instruction to learners' needs, learners' level of knowledge must be elicited (cf. Tyson et al., 1997;Özdemir & Clark, 2007), for example, using tests. At the beginning of test development, very central and pragmatic questions must be clarified, which influence the development process of the instrument itself and the results of the survey later on (Mummendey & Grau, 2014). To develop our instrument, we used standards from the literature as a guide (e.g., Adams & Wiemann, 2011;Haladyna & Downing, 1989): Determination of the target group. The primary target group are secondary school students.
Determination of the test objective. Typically, tests are differentiated with respect to their test objective, i.e., according to whether their objective is firstly to measure ability, secondly to classify individuals, or thirdly to record knowledge. The test instrument presented in this article falls into this third category: it is intended to survey declarative knowledge about quantum optics, focusing on experiments with heralded photons. Here, we use the term declarative knowledge to describe knowledge about objects, content, or facts (Anderson, 1996). Hence, most of the items primarily relate to concepts, some to facts.

Description of the knowledge domain.
A specification of the knowledge domain quantum optics focusing on experiments with heralded photons from literature is not possible because comparable test instruments have not been published yet and quantum optics is not yet firmly anchored in international school curricula either (Stadermann et al., 2019). Accordingly, the theoretically based operationalization of the construct declarative knowledge on quantum optics is not possible. Therefore, in a field like quantum optics -or other modern physics topics, which have not yet been empirically explored -it is even more challenging to determine a substructure of the learners' knowledge than it is in classical subject areas.
There is no standard procedure in physics education research for how to empirically approach such an area. In developing the test instrument presented here, the following approach has proven fruitful: First, the newly developed test instrument was based on a model containing the three evident sub-aspects theoretical aspects, experimental aspects and photons. These three sub-aspects should be represented in the test instrument. Thus, basic knowledge (sub-aspect theoretical aspects), general knowledge about quantum objects using the example of the photon (sub-aspect photons) as well as technical-experimental considerations (sub-aspect experimental aspects) are queried (cf. Table 2). One possibility of developing items with fit to this structural model of the knowledge domain is preparing a blueprint. A blueprint is a matrix containing the content to be tested in the test on the one hand and the students' performance levels to be achieved in these individual content areas on the other hand (Krebs, 2008). Such a blueprint was created at the initial stage of test development according to the steps outlined by Flateby (2013).
Decision of task format. Usually, test instruments for surveying (declarative or conceptual) knowledge consist of single-or multiple-choice items for economic reasons. In this test, we used two-tier singlechoice items. In the first tier, students choose exactly one out of three response options. In the second tier, they are additionally asked to indicate how certain they were about the answer in tier one on a fivepoint Likert scale (1 = I guessed, …, 4 = sure, 5 = very sure). This additional indication of response confidence is primarily intended to minimize rate influence (Brell et al., 2005). For this purpose, a point is assigned if the correct answer was chosen in tier one, and the respondent was at least sure (tier two). This leads to underestimating the participants' test scores but prevents an overestimation of the learning gains when evaluating teaching concepts. Taken together, this format allows for objective data evaluation but requires the development of attractive distractors (Theyßen, 2014).
Formulation of appropriate distractors. For the quality of items, the distractors' quality, i.e., the wrong answer options, is crucial. Moosbrugger and Kelava (2012) define distractors as "answer alternatives that seem plausible but do not apply" (p. 418), expressing the difficulty in finding such distractors: they have to be recognized as wrong by knowledgeable people and should seem right for others (Glug, 2009). Therefore, distractors are usually based on widespread student conceptions. Because quantum optics-based teaching approaches to quantum physics are primarily intended to promote elaborate conceptions of the (quantum) nature of light among students, we started developing our new test instrument with a literature review of student conceptions in this domain (cf. literature review). Nevertheless, no broad studies on students' conceptions of quantum physics exist that relate more narrowly to the context of quantum optics and experiments with heralded photons. In such cases, a standard method for obtaining appropriate distractors is the use of relevant questions in an open-ended format first, for instance, in a preliminary study. Frequent errors or answers close to the correct solution can then be used as distractors for the test instrument (Krebs, 2008). Therefore, in developing the test instrument presented here, 21 items -distributed across the three sub-aspects theoretical aspects, photons and experimental aspects according to the previously developed blueprint -were first formulated as openended questions. These were given to N = 23 pre-service physics teachers. An initial set of test items was obtained from the pre-service physics teachers' answers because partially correct or conspicuously frequent incorrect answers served as distractors for our instrument (cf. Table 3). Interference is the superposition of exactly two waves." Interference II 8 "Conducting the double-slit experiment with single electrons leads to two well-defined detection locations on a screen behind the double slit."

Experimental aspects
Non-linear crystal 2 "Laser light incident on a non-linear crystal is split into two partial beams." Single-photon detector 6 "A single-photon detector counts the number of registered photons within some time interval." Interferometer in singlephoton experiments 9 "Interference of single photons actually shows that they only exist in some experiments." Anticorrelation factor 11 -Coincidence technique 12 "For experiments with heralded photons, one always needs exactly two detectors." Photons Eye as photon detector 3 "We cannot see photons because they are too small." Localizability of photons 10 "The photon splits at the beam splitter cube because classical light is also reflected and transmitted at the beam splitter." Photons as energy quanta 13 "For me, photons are small spherical particles which sometimes show wave-like behaviour."

Test Score Interpretation
Validity. In empirical research, the validation of a test instrument is often referred to in the context of test development. The debate about the test quality criterion validity led to a shift in the conception of validity. Thus, today validity is not seen as a property of a test: "Validity is not a property of the test. Rather, it is a property of the proposed interpretations and uses of the test scores. Interpretations and uses that make sense and are supported by appropriate evidence are considered to have high validity [...]" (Kane, 2013, p. 3). Valid test score interpretation is at the centre of assuring test quality: "Validity refers to the degree to which evidence and theory support the interpretations of test scores for proposed uses of the tests" (AERA, 2014, p. 11). The fact that a test procedure allows valid test score interpretation must be derived by arguments: So, theorists of educational measurement more contemporary have articulated validity as an evidence-based argument (Haertel, 2004;Kane, 2001Kane, , 2013. A prerequisite for this process is that it is first determined which test score interpretation is intended. This is then referred to as the intended test score interpretation. Besides, it must be determined on which assumptions this intended test value interpretation is based. Methods and procedures must then be used to check the validity of these assumptions (cf. Kane, 2001). The necessary strands of argumentation cannot be standardized .
In the development process of the test instrument presented in this article we used an iterative process of development, pilot studies and refinement of the items. The extent to which a valid test score interpretation is possible is derived argumentatively: Results from a Think-aloud study and an expert survey are combined with results from a quantitative pilot study to yield a valid interpretation of the test score.
Intended test score interpretation. The test score is meant to indicate the extent to which the concepts of quantum optics, with a focus on experiments with heralded photons, are known by secondary school students. It can, therefore, be taken as a measure of declarative knowledge in this area.
This intended test score interpretation is based on the following assumptions adapted from Meinhardt (2018), which must be checked for plausibility as part of the validity argument: 1. The items adequately represent the construct according to the structural model (three empirically separable subscales theoretical aspects, photons and experimental aspects).
2. The items evoke intended cognitive processes in the students. In particular, correct answers are not (exclusively) given due to guessing.
3. The items are understood as intended by the respondents.
4. The items and distractors are authentic for the students.
5. The respective scales adequately represent the intended sub constructs (theoretical aspects, experimental aspects and photons).
6. The construct declarative knowledge on quantum optics is distinguishable from different or similar constructs.
As stated by Meinhardt (2018, p. 198), this list can, in principle, be continued arbitrarily. However, if the assumptions 1. -6. can be empirically substantiated, they serve as arguments for the justification of "evaluative and generalizing conclusions" based on the test scores according to Meinhardt (2018, p. 198).

Methods and Samples
This study has two central goals: first, to empirically test the assumptions on which the intended test score interpretation is based. This is to ensure a valid interpretation of the test scores. Second, the psychometric characterization of the test instrument in the sense of classical test theory. For this purpose, three studies were conducted (cf. Figure 1).
Study I. We interviewed N = 8 secondary school students (grade 12) in one-on-one settings using a Think-aloud protocol to investigate students' cognitive processes when answering the test items (Ericsson & Simon, 1998). Before, the students had participated in an introductory quantum physics course according to Bitzenbauer and Meyn (2020) in classroom. In order to ensure a standardised implementation of the method, a guideline was developed. The conversations were recorded as audio files and transcribed afterwards. The evaluation of the Think-aloud interviews was based on categories from the literature (Meinhardt, 2018) related to the comprehensibility of the items, the cognitive processes occurring during test processing, and the suitability of the response format using scaling content analysis (Mayring, 2010, p. 102). The coding was carried out by a second independent coder for 12.5% of the data with a high level of agreement κ = 0.72 (95% CI [0.58; 0.85]). During the Think-aloud interviews, participants are asked to verbalize their thoughts and reflections, and so the cognitive processes of respondents in a test situation become accessible (van Someren et al., 1994). In order to check whether the test persons correctly understand the items, the evaluation of a developed test instrument with the help of the Think-aloud method makes sense: Particularly in the case of closed questions (here single-choice), an examination of the test items appears to be useful, because the distractors sometimes do not cover all conceivable participants' responses, they influence each other or do not correspond to the participant's natural response (Schnell, 2016); this can have a negative influence on the validity of the survey results (Rost, 2004). During the Think-aloud interviews, some previously unnoticed difficulties and misconceptions were revealed.
Study II. The test items were revised based on study I. The Beta version of the test was then completed by N = 86 undergraduate students of engineering. Before test processing, the students had been introduced to quantum physics using the quantum optics-based introductory course suggested by Bitzenbauer and Meyn (2020). Following classical test theory (Engelhardt, 2009), item difficulty and item discriminatory indices were calculated for all items. Items that were not within empirically accepted tolerance ranges concerning item statistics were excluded from the item set. We refer to the accepted tolerance range of 0.2 to 0.8 for item difficulty (Kline, 2015). Concerning discriminatory indices, values above 0.2 are considered good (Jorion et al., 2015), while others suggest a threshold of 0.3 (Fisseni, 1997). In addition, the reliability of the test instrument was calculated using Cronbach's alpha as an estimator for internal consistency (Taber, 2018). To check criterion validity, the participants' physics grades were collected as an external criterion. Multidimensionality of the test is to be assumed because of the model of the knowledge domain, which the test development is based on (cf. description of knowledge domain). We, therefore, check the fit of the structural model to the data using confirmatory factor analysis: "Confirmatory factor analysis is designed to assess how well a hypothesized factor structure 'fits' the observed data" (Russell, 2002(Russell, , p. 1638. Confirmatory factor analyses require samples of at least 100, rather larger (cf. Loehlin, 2004;Schumacker & Lomax, 2004). Jackson (2003) recommends a sample size to parameter ratio of at least 10:1, as "lower ratios are increasingly less Figure 1. Overview of the development of our test instrument and pilot studies. The grey arrows indicate that such a process is never complete but should be understood as iterative trustworthy" (Kyriazos, 2018(Kyriazos, , p. 2217. However, MacCallum and Widaman (1999) suggest 5:1 as a cutoff ratio. In our study we have 6.6:1 and thus our sample can be considered sufficiently large for a confirmatory factor analysis in our pilot study. We refrain from Rasch scaling of the test because different test items refer to the same experimental setups, so the requirement of local stochastic independence is violated (Debelak & Koller, 2020).
Study III. Finally, the test content's relevance and representativeness were verified with an expert survey (N = 8 scientists from physics education research). For each of the 13 items in the test instrument, experts were asked to rate the following three items on a 5-point Likert scale (1 = strongly disagree, 2 = disagree, 3 = undecided, 4 = agree, 5 = agree completely) using a pen-and-paper format: a) The item's distractors are authentic; thus, they could be assumed to be true by someone who is not certain of the answer.
b) This item assesses a crucial aspect of the knowledge domain.
c) This item is of good quality.
In addition to an evaluation of the individual items, the test as a whole was also evaluated in order to check for content validity. For this purpose, a scale consisting of four items -again with a 5-point Likert scale -was developed: 1. The items represent relevant contents of the knowledge domain.
2. The contents are in an appropriate relation to each other, i.e. the weighting of the content areas is reasonable.
3. The test instrument has a high fit to the knowledge domain.

The test instrument covers important content aspects of the single photon experiments.
The internal consistency of the scale was found to be = 0.86.
To evaluate the expert survey, Diverging Stacked Bar Charts (Robbins & Heiberger, 2011) were created using the software Tableau, version 2019.3. These charts align a bar corresponding to 100% of the expert ratings relative to the scale's centre (0%). Experts' agreement corresponds to a swing of the bar to the right, and disagreement corresponds to the bar's swing to the left. The mean value m and standard deviation SD of the expert ratings are also given to quantify the expert ratings.

RESULTS
The three sub-studies I-III conducted as part of the new test instrument's piloting help to verify the assumptions that the intended test score interpretation is based on. The summary of all these individual procedures enables a detailed validity argument. This supports valid test score interpretation. According to this test score interpretation, the participants' test scores represent a measure of the declarative knowledge on quantum optics, focusing on experiments with heralded photons. Besides, study II provides a psychometric characterization of the instrument.
In summary, Table 5 shows which assumption regarding the intended test score interpretation is addressed in which sub-study and with which measure. 2. The items evoke intended cognitive processes in the students. In particular, correct answers are not (exclusively) given due to guessing.
Think-aloud study (study I) 3. The items are understood as intended by the respondents. Think-aloud study (study I) 4. The items and distractors are authentic for the students.
Think-aloud study (study I) 5. The respective scales adequately represent the intended sub constructs (theoretical aspects, experimental aspects and photons).
Expert survey (study III) 6. The construct declarative knowledge on quantum optics is distinguishable from different or similar constructs.

Quantitative study (study II): correlation analysis with external criterion
Results of study I. The participants formulated their thoughts aloud while processing the test and were observed continuously. The results of the Think-aloud study led to a revision of the test items. Here, two items will be used as examples to show the extent to which the Think-aloud study contributed to an optimization of the test items at an early stage of the test's development.
A particular focus of the Think-aloud study was whether the participants perceived the developed distractors as authentic. In this regard, the original item 1 proved to be problematic. This was: "A beam splitter...

a) ...is employed in the Michelson interferometer, because it can be used to split an incident ray of light into two partial beams. b) …is a prism. c) ...separates incident rays of light or superimposes two rays of light."
The test participants often were critical of distractor b) because a logical contradiction arose by equating beam splitter and prism. Therefore, this distractor was not authentic. In contrast, distractor a) is highly authentic because, according to some respondents, the term Michelson interferometer is very attractive. From these observations, the item was revised to: "A beam splitter...

a) ...is employed in the Michelson interferometer, because it can be used to split an incident ray of light into two partial beams. b) …...is made out of two merged prisms, where one of them is responsible for the transmitted beam and one for the reflected beam. c) ...separates incident rays of light or superimposes two rays of light."
Through the utterances in the Think-aloud study, indications of the cognitive processes occurring while answering the items could be recorded. It is noticeable that the decision-making process was perceived as complex for the majority of the items. One item (item 13 in the final test version) that proved problematic in this regard was the following: "Photons are… a In summary, the Think-aloud study led to the confirmation of assumptions 2-4 (cf . Table 5), which the intended test score interpretation is based on, but all of the items were revised based on the Think-aloud protocols. Exemplary, we discussed two item revisions in this chapter.
Results of study II -Descriptive analysis. The psychometric parameters reported below refer to the 13 items found in the final version of the test instrument (cf. Appendix A). Items that were removed from the item set due to the item analysis are not presented and discussed here. Thus, in this chapter, we provide a psychometric characterization of our instrument based on 86 undergraduate engineering students' data.
In the 13 items, students could score a maximum of 13 points (1 point each). The students reached a mean score of m = 5.94, SD = 3.15, ranging from 0 points (four students) to 12 points (three students), cf. Figure 2. To check criterion validity, the test scores' correlation with the subjects' physics scores was determined, which was found to be r = 0.44 (p < 0.01).

Figure 2. Histogram of the students' test scores
For most of the items, each distractor is selected by at least 5% of the participants so that none of the items has to be excluded due to a too rarely selected distractor (cf. Table 6). The analysis of the item difficulties (cf. Table 7) showed that almost all items lie within the tolerance range of 0.20 to 0.80, ranging from 0.15 (Item 1) to 0.74 (Item 7). Only for items 1 and 9 there are small deviations into the problematic range below 0.2. Nevertheless, items 1 and 9 were retained because they inquire essential aspects of the experiments with heralded photons. For the discriminatory indices of the test items, values in a range between 0.31 (Items 9, 12) and 0.54 (Item 3) are obtained (cf. Table 7). These values are within the tolerance range above 0.3. The value of Cronbach's alpha as an estimator for the internal consistency of the test instrument is found to be 0.78 for the final test version with 13 items. The values in Table 7 also prove that the internal consistency could not be raised by excluding additional items. We furthermore calculated split-half reliability: Therefore, we decided to use the Guttman formula as it does not assume homogeneity of both test halves and does not lead to an overestimation of reliability (Kerlinger & Lee, 2000). For our test instrument, we obtain a value of 0.75. Results of study II -Confirmatory factor analysis. A confirmatory factor analysis was used to check whether there is sufficient agreement between the empirical data and the theoretical model of our knowledge domain (Moosbrugger & Kelava, 2012, p. 334), thus, to generate evidence for construct validity. The empirical data are the test scores collected in study II. The theoretical model is the structural model for the knowledge domain, which we have outlined in Table 2, consisting of the three factors theoretical aspects, experimental aspects and photons. The model parameters were estimated using the maximum likelihood method, and the model fit was checked using different goodness of fit measures on model level. We refer to the fit parameters χ²/df, RMSEA, the root-mean-square error of approximation (Steiger & Lind, 1980), CFI, the comparative fit index (Bentler, 1990) and SRMR, the standardized root mean square residual (Hu & Bentler, 1999) and their cut-off values for good and acceptable model fits according to Schermelleh-Engel et al. (2003). In order to apply maximum likelihood estimation, (approximately) multivariate normally distributed data are necessary. To test the data for normal distribution, we used the skewness and kurtosis of the indicator variables here. These values are all smaller than |2| for the reported sample, so that normal distribution can be assumed (Hammer & Landau, 1981, p. 578). For the model used here, the following model fits were found, which indicate an acceptable to good fit to the empirical data. The confirmatory factor analysis' results on indicator and construct level are summarized in Table 9 and Figure 3. The correlations among the factors are statistically significant at the 1% level and range from 0.39 and 0.46. This indicates that all three subscales contribute to the students' declarative knowledge on quantum optics, focusing on heralded photons' experiments. Table 9. Confirmatory factor analysis results on indicator and construct level. The indicator reliabilities mostly lay above the threshold for good reliability of 0.40 (Bagozzi & Baumgartner, 1994, p. 402). The Fornell-Larcker criterion is also met because the squared correlation of two latent variables is smaller than the mean extracted variance AEV per factor in each case (Fornell & Larcker, 1981, p. 46  For the internal consistencies of the three subscales theoretical aspects (5 items, = 0.68), experimental aspects (5 items, = 0.52) and photons (3 items, = 0.55) values are obtained that are well below the overall reliability of the test instrument of = 0.78. This aspect will be addressed in the discussion in more detail. Thus, in summary, study II contributes to the confirmation of assumptions 1 and 6 (cf .  Table 5), which the intended test score interpretation is based on.
Results of study III. In study III, an expert survey was conducted to -among others -ensure that the developed items' distractors represent meaningful response options in terms of content. According to Landis and Koch (1977), the experts show substantial agreement on this (Fleiss κ = 0.62), as can be seen from the descriptive statistics on the experts' ratings (cf . Table 10) and the corresponding Diverging Stacked Bar Chart (cf. Figure 4).
Furthermore, the experts -according to Landis and Koch (1977) -show moderate consensus (Fleiss κ = 0.59) that all the test instruments' items ask relevant content about quantum optics with a focus on experiments with heralded photons. This finding is of particular importance for judging the content validity of the instrument (cf. Table 11, Figure 5).  Moreover, the experts were asked to rate the quality of each item. The results (cf. Table 12, Figure 6) show that two of the items are viewed critically by the experts (items 2 and 3). However, these two items were retained for didactic reasons in order to maintain the breadth of content.  Finally, looking at the scale for the content validity of the test instrument as a whole (4 items, = 0.86), a consistently positive picture emerges (cf . Table 13). Thus, in summary, study III contributes to the confirmation of assumptions 1 and 5 (cf . Table 5), on which the intended test score interpretation is based.   Figure 6. Diverging Stacked Bar Chart for the experts' ratings on the statement "This item is of good quality" (Fleiss κ = 0.61).

DISCUSSION
The development of tools to assess a given construct validly and reliably has a long tradition in science education research (Britton & Schneider, 2007;Doran et al., 1994;Tamir, 1998). Our newly developed instrument aims at assessing students' declarative knowledge of quantum optics in the context of experiments with heralded photons. Because this field has only little been explored empirically, test development cannot draw on a body of published students' conceptions as is the case in other fields, e.g., for the development of concept inventories such as the FCI in mechanics (Hestenes et al., 1992;Hestenes & Halloun, 1995), the DIRECT in electricity (Engelhardt & Beichner, 2004) or the KTSO-A in optics (Hettmannsperger et al., 2021). Therefore, in this article, we presented the development process in detail to show a possibility to open up a field that has only little been researched empirically so far.
It is a consensus across disciplines in empirical research that study results must meet the quality criteria of objectivity, reliability, and validity. The content validity is typically ensured using expert surveys, and the construct validity is usually based on factor analysis. Mostly, reliability is approached using KR-20 or Cronbach's Alpha, as summarised by Liu (2012). In this article, we have presented three studies that, taken together, were intended to assure the quality of our newly developed test instrument. A Think-aloud study was conducted at an early stage of development, leading to item revision.
Furthermore, the newly developed instrument was piloted with a sample of N = 86 undergraduate engineering students. The results show that a reliable survey of declarative knowledge on quantum optics is possible with this test instrument. Besides, correlation analysis, a confirmatory factor analysis, and an expert survey support the assumptions on which an intended test score interpretation was based. While validity cannot be assessed conclusively, the results presented provide solid arguments that the developed test on quantum optics allows for a valid test score interpretation. This means that the test scores can be interpreted as a measure of declarative knowledge in quantum optics. Jorion et al. (2015) provide a categorical judgement scheme and assignment rules to evaluate concept inventories. The authors use their framework to analyze three different concept tests, namely the Concept Assessment Tool for Statics CATS (Steif & Dantzler, 2005), the Statistics Concept Inventory SCI (Stone et al., 2003), and the Dynamics Concept Inventory DCI (Gray et al., 2005). We refer to this scheme to judge the psychometric characteristics of our new test instrument (cf . Table 14).
While the psychometric parameters of our test instrument correspond to medium to excellent values (cf . Table 14), they lie outside the recommended ranges for the empirically separable subscales theoretical aspects (5 items, = 0.68), experimental aspects (5 items, = 0.52) and photons (3 items, = 0.55). These have been confirmed using confirmatory factor analysis (cf. Table 9). Given the background of published test instruments on other domains, however, this is not a surprise because the investigation of factor structure has often led to difficulties -not least for the Force Concept Inventory (Huffman & Heller, 1995;Scott et al., 2012). For many instruments, no factor structure could be extracted at all: one example is the FCI, another one is the CSEM (Maloney et al., 2001), the Conceptual Survey of Electricity and Magnetism. For many of the instruments for which the extraction of a factor structure was successful, no reliabilities of the subscales have been reported yet (cf. Engelhardt & Beichner, 2004;Ramlo, 2008;Urban-Woldron & Hopf, 2012). In the article by Jorion et al. (2015), the authors examined the subscale reliabilities of CATS (0.33 ≤ ≤ 0.72), SCI (0.27 ≤ ≤ 0.47) and DCI (0.06 ≤ ≤ 0.62) and our quantum optics test instruments' subscale reliabilities lay in a similar range. In the paper of Hettmannsperger et al. (2021), the development and validation of the concept test KTSO-A for ray optics are presented. For the KTSO-A, subscale reliabilities in a similar or slightly higher range are reported.

LIMITATIONS AND CONCLUSION
This article reports the development and validation of a test instrument to assess secondary school students' declarative quantum optics knowledge. With that, we respond to modern developments on learning quantum physics from physics education research: Numerous researchers propose quantum optics-based introductory courses in quantum physics, focusing on experiments with heralded photons (cf. Introduction).
Our test instrument's development is based on test development standards from the literature (cf. chapter Development of the test instrument), and we follow a contemporary conception of validity (cf. chapter Test score interpretation). Therefore, we present an evidence-based argument for our instrument's validation: We report the results from three studies to test various assumptions that justify a valid test score interpretation. Future pilot studies are necessary to refine the test instrument and to tackle limitations: Although the test instrument was piloted in a Think-aloud study with secondary school students, the quantitative study II was conducted with undergraduate engineering students. While we argue that these do not differ substantially from our primary target group of secondary school students in 11th/12th grade concerning their prior knowledge in quantum physics, the use of the instrument in larger samples with secondary school students is necessary. In this way, the psychometric characteristics reported in this article, all of which are in the acceptable to the excellent range (cf. chapter Discussion), can be verified.
With the test instrument presented in this article, we want to provide the possibility to economically assess students' declarative knowledge of quantum optics focusing on experiments with heralded photons. We consider developing and validating such a test instrument as the first step in empirical research of secondary school students' learning processes on quantum physics in experiment-based settings. With the help of the presented test instrument, it becomes be possible to evaluate the learning effectiveness of developed teaching concepts on modern quantum physics focussing on experiments with heralded photons. In our next studies, we will use the test instrument presented here as part of an  (Jorion et al., 2015, p. 482). Values in parentheses specify the number of items that can lie outside this suggestion (Jorion et al., 2015, p evaluation study to investigate the effectiveness of our teaching concept on quantum optics (Bitzenbauer & Meyn, 2020) in 11th and 12th grades at secondary schools. Based on the results, both the teaching concept itself and the test instrument presented in this article will be refined in the sense of an iterative process (cf. Figure 1). In this context, the structure of the test instrument will have to be reviewed with a larger sample of the primary target group of secondary school students, as possible deviations from the confirmatory factor analysis results reported here due to the different group of people (school students vs. engineering students in this study) cannot be excluded with certainty.
Based on such evaluation studies on a large scale, it can empirically be investigated which conceptions of quantum physics learners develop in such settings. Insights of this kind are necessary to uncover typical learning difficulties in quantum optics-based introductory courses in the future. To this end, we believe qualitative research methods, such as interview studies, are necessary. In the long run, this may lead to a concept inventory that makes the different teaching concepts on modern quantum physics based on experiments with heralded photons comparable -not only concerning a mere learning gain but especially with respect to the question to what extent learners acquire a conceptual understanding of quantum physics.
Funding: This study was funded by the Emerging Talents Initiative (University of Erlangen, Germany).

Declaration of interest:
Author declares no competing interest.
Data availability: Data generated or analysed during this study are available from the author on request.