MEASURING HIGHER-ORDER THINKING SKILLS IN SCIENCE AMONG PRIMARY SCHOOL STUDENTS USING ITEM RESPONSE THEORY

: Higher-order thinking skills (HOTS) are crucial competence in education. Higher-order thinking skills can help learners solve problems and decision making efficiently by anticipating connections between divergent ideas. The present study aims to develop reliable and valid instruments to assess higher-order thinking skills in science among primary school students. The study followed eight stages of developing a model adapted from a previous study. The total sample of this research comprised 428 fifth-grade students from six primary schools located in urban and rural areas in Mongolia. The gathered data were analyzed using SPSS 22.0 and STATA 16.0 to examine the item characteristics curve, test reliability, and item correlation. The study recommends developing creativity skills through exercise-based activities, so those item developers could produce reliable and valid instruments to assess HOTS.


Introduction
In the 21st century, technological advancement and changes in the socioeconomic climate and workplace require future citizens to have a wide range of skills to face new challenges (OECD, 2015;Otgonbaatar, 2021a).To address these challenges, educators and international organizations have emphasized specific skills, such as critical thinking, creative thinking, problem-solving, and decision-making, encapsulated under the term "Higher-Order Thinking Skills (HOTS)" (Anderson & Krathwohl, 2001;Scully, 2017).However, these skills have been described with different terms, such as 21st-century skills (The Partnership for 21st Century Skills, 2009), transversal competencies (UNESCO, 2015), and social and emotional skills (OECD, 2015).The concept of HOTS connects to Bloom's Taxonomy of Educational Objectives, and mainly corresponds with the top three levels of the taxonomy: analyzing, evaluating, and creating.(Anderson & Krathwohl, 2001;Nitko & Brookhart, 2007;Scully, 2017).Most countries report that these skills are not taught as separate subjects but incorporated across the curriculum, according to OECD (2015) and Ontario (2015).These studies identified the importance of developing skills in relation to specific subjects, rather than as topics for separate teaching.Thus, there is a call for education systems to intentionally emphasize and develop these specific through deliberate changes in curriculum design and pedagogical practice (Ontario, 2016;Otgonbaatar, 2021b).Students' HOTS are fostered through a more collaborative process across all subjects, which means that a person cannot develop these skills in isolation (Lawson, 1993;Shellens & Valcke, 2005).
Notably, primary and secondary educational reforms primarily referenced the poor results of fourth and eighth graders in the Trends in International Mathematics and Science Study (TIMSS)-2011, which showed that Mongolian students performed very poorly in mathematics and natural sciences (39.6% for 4th graders, and 25.8% for eighth graders).These scores highlight the unacceptable quality of education and the inability of the education sector to meet labor market needs.From research conducted at the national level, we can see that learning achievement is not progressing at all, and the result is below 60% at all education levels (Education in Mongolia, a country report, 2019, p. 8).According to Brookhart (2010) and Tanujaya, Mumu & Margono (2017), there is a linear, positive, and robust relationship between HOTS and students' academic achievement, and if we can successfully assess higher-order thinking, we find that it increases student achievement.

Material and Methods
The study is conducted using a correlation research model.The researcher used existing research to develop approaches followed by the Borg & Gall Model (1983), which states that there are 10 steps in the test development process.The current study consists of eight stages of developing a model adapted from the Borg & Gall model: 1) Needs Analysis, 2) Planning, 3) Develop the Preliminary Form of the Product, 4) Field Testing, 5) Product Revision, 6) Operational Field Testing, 7) Final Product Revision, and 8) Dissemination and Implementation.The target population consisted of fifth-grade primary school students in Mongolia.The total sample of this research was 428 fifth-grade students from six primary schools, three located in Ulaanbaatar and three in Orkhon Province, Mongolia.Since 2013, the Mongolian government has been introducing and implementing a new curriculum nationwide, which these schools already implement.The question type selected was multiple-choice and open-ended, which Paul and Nosich (1992) argue is the best approach for assessing HOTS.

Data analysis and results
Data were collected from a pilot test.The pilot testing research data were gathered and input into SPSS 22.0.The reliability coefficient was 0.62.The sample size of the test was 58 fifth-grade students.Since the initial Cronbach's alpha of 0.62 was unacceptable, the researcher conducted a revision, after which Cronbach's alpha increased to 0.71 in the large-scale sample.Based on the results above, the reliability coefficient in the collected data of the sample was 0.71, within the acceptable range.Nunnally (1967) states that 0.70 -0.80 is a good range useful for a classroom test.
A detailed analysis of descriptive statistics was conducted on large samples.The minimum, maximum, mean, median, mode, and standard deviation were calculated and are shown in Table 1.The item parameter is a fundamental concept of IRT.Item discrimination shows the ability of an item to differentiate between good and poor students based on how well an item can discriminate.The characteristic of a better test item is that high-ability students will answer it correctly more frequently than lower-ability students.The item discrimination parameter expresses how well an item can be differentiated among examinees with different ability levels.Satisfactory and good items usually have discrimination values ranging from 0.5 to 2. High discrimination indicates that higherscoring candidates tend to answer the item correctly, while lower-scoring candidates tend to answer it incorrectly.Item difficulty is one of the essential concepts in psychometrics and is the most useful item in analysis statistics.The item difficulty, known as the [] parameter, is essentially the percentage of examinees who answered the item correctly.The greater the difficulty of an item, the higher an examinee's ability must be to answer that item correctly.Items with greater difficulty are hard items, which low-ability examinees are unlikely to answer correctly.If items with low difficulty are easy items, most examinees will get that item correct (Otgonbaatar, 2016).
According to Table 2, most of the items included a medium category of difficulty and they satisfied the discrimination index category.Item 12 (creating skill) has the highest item difficulty index (0.25), which means it is the hardest.Item 10 (analyzing skill) has the lowest item difficulty index (0.76), making it the easiest item.In addition, Items 5 (evaluating skill) and 9 (analyzing skill) have the lowest item discrimination index, meaning that they are least able to distinguish between examinees who are knowledgeable and those who are not.and p > 0.7, too easy.The following formula (Güler, 2014) was used to calculate item difficulty indices for openended items: Item difficulty index = ( x − y z − y ) x: Mean scores received from the item; y: The minimum score receivable from the item; z: The maximum score receivable from the item; The item discrimination index (d-value) falls in the ranges d ≥ 0.40, quite satisfactory; 0.30 ≤ d ≤ 0.39, good; 0.20 ≤ d ≤ 0.29, marginal and needs revision; and d ≤ 0.19, poor.
According to Table 3, in the test instrument, five items were intended to measure analyzing skills.Since these items measure the same thing, the items should be correlated with each other.The intercorrelations of all items were analyzed by looking for the data result of each question from among the 428 samples.All three types of items had a high correlation, so it was determined that the items measure the same thing.The results are shown in Tables 3, 4, and 5. Based on the tables above, we found the correlations between analyzing skills (Q1, Q4, Q10), evaluating skills (Q2, Q5, Q8, Q11), and creating (Q3, Q6, Q12) to be significant (pvalue = 0.01).These questions can be used to measure the same skill and can serve as an instrument to measure higher-order thinking skills.In Figure 1 below, item characteristic curves (ICC) are shown for items of analyzing skills.The left-hand curve or q10 represents the easiest item.It shows that the probability of the correct answer is higher for low-ability students and closer to 1 for high-ability students.According to Figure 1, the ICC for analyzing the skill of an item is intended to measure ability.The probability of this defined success increases as the ability increases.The probability of correct answers changes very quickly as the examinee's ability increases.This is an easier item and low-ability examinees should perform correctly on it.The range of HOTS tasks included in the survey assessment allows for describing six levels of problem-solving proficiency (Table 6).The first level is the lowest described level, and it corresponds to an elementary level of higher-order thinking skills.The top level corresponds to the highest level of higherorder thinking skills.Students with a proficiency score within the range of the first level are expected to complete most elementary-level tasks successfully, but they are unlikely to be able to complete tasks at higher levels.Students with scores in the last level range are likely to be able to complete all the tasks included in the survey assessment of problem-solving.

Discussion and Conclusion
The procedure described in this study to develop and validate the higher-order thinking skills test items was mainly in line with the checklist suggested for the preparation of multiple-choice and open-ended question constructions (Paul & Nosich,1992;Haladyna, 1997;TIMSS 2011 items).
The reliability coefficient in the collected data of the sample was 0.71, which is acceptable.Therefore, the test items developed in this study accurately measure higherorder thinking skills among the target population.
All items were developed to measure the three components of higher-order thinking skills, and the level of Bloom's taxonomy was higher than applying.The items did not use ambiguous sentences or words, such as item, stem, table, or figure.All items were intercorrelated, and all items converged on the same construct.Therefore, it is believed that the items used in this study have high content and construct validity.Item analysis revealed that items for analyzing skills have moderate item validity coefficients, while those for evaluating and creating skills have higher validity coefficients.It can be judged that the items are valid and tend to measure the same skill.
Results show that students' performance on the HOTS test was below the expected average score.Notably, performance on the creating skill tasks was lower than on the analyzing and evaluating tasks.Finally, the test instruments used to measure higherorder skills are reliable and valid for the purpose of this study.The performance of higher-order thinking skills at the national level was low among the target population.The students are less trained in solving HOTS-related tasks.The fifth-grade students' skills are weak in creativity to solve a problem, intellectual analysis, making assumptions, and ability to execute independent actions.Similar findings were reported in a study that examined creativity among Mongolian students (Otgonbaatar, 2020).School students are less trained in solving items of higher-order thinking.This may have multi-faceted reasons.One of the causal factors is that the students might be unfamiliar with the item formats and how the questions were posed.Mongolian primary school students do not receive much training in solving higher-order thinking items or demanding higher-order thinking activities.
This study determined that the level of Mongolian fifth-graders' HOTS and the status of implementation of the new curriculum appear to be low, showing that the quality of reform implementation still has challenges.
Creative Commons licensing terms Author(s) will retain the copyright of their published articles agreeing that a Creative Commons Attribution 4.0 International License (CC BY 4.0) terms will be applied to their work.Under the terms of this license, no permission is required from the author(s) or publisher for members of the community to copy, distribute, transmit or adapt the article content, providing a proper, prominent and unambiguous attribution to the authors in a manner that makes clear that the materials are being reused under permission of a Creative Commons License.Views, opinions and conclusions expressed in this research article are views, opinions and conclusions of the author(s).Open Access Publishing Group and European Journal of Education Studies shall not be responsible or answerable for any loss, damage or liability caused in relation to/arising out of conflicts of interest, copyright violations and inappropriate or inaccurate use of any kind content related or integrated into the research work.All the published works are meeting the Open Access Publishing requirements and can be freely accessed, shared, modified, distributed and used in educational, commercial and non-commercial purposes under a Creative Commons Attribution 4.0 International License (CC BY 4.0).

Table 1 :
Descriptive statistics on student performance

Table 2 :
Item response theory parameters

Table 3 :
The Pearson correlations of analyzing skill items

Table 4 :
The Pearson correlations of evaluating skill items

Table 5 :
The Pearson correlations of creating skill items

Table 6 :
Relationship between items and student performance on a higher-order thinking scale (adapted from PISA 2012)