Achievement tests from an item perspective : an exploration of single item data from the PISA and TIMSS studies, and how such data can inform us about students’ knowledge and thinking in science

Olsen, Rolf Vegar

Doctoral thesis

View/Open

Rolf_Olsen.pdf (3.752Mb)

Year

2005

Abstract

Summary of chapter 1
The thesis was introduced in this chapter by presenting the fundamental rationale for why analysis of items, either one-by-one, or by the study of profiles across a few items, is worthwhile. This rationale was based on a model of how items typically are correlated with each other and to the overall score in an achievement test such as those in TIMSS and PISA. It followed from this model that if we represent the total achievement measure by one overall latent factor, only a small fraction of the variance in the scored items is accounted for by a typical cognitive test score. Furthermore, this argument was brought one step forward by also considering the categorical information in the codes initially used by the markers.
Before the variables in the data file are scored, they are nominal variables with codes reflecting qualitative aspects of students’ responses. Taken together with the theoretical model of the scored items, it was concluded that further analysis of the single items would be reasonable, and would involve the analysis of information beyond that contained in the overall score. All the empirical papers in the thesis are based on this rationale: to analyse the surplus information in the items. The purpose of the thesis was then formulated as an exploration into the nature of this surplus information, and the potential of using this information to describe qualitative differences at the student or the country level. Furthermore, the underlying motivation for doing this was stated as a desire to inform the science education community about the potential for, and limitations of, using the data from LINCAS in secondary research. This latter issue was elaborated and discussed in the next chapter.

Summary of chapter 2
This chapter gave a broad presentation of LINCAS, their policy relevance, and their link, or lack thereof, to the field of science education research. The chapter consisted of several related elements that, taken together, addressed the issue of why and how researchers in science education could or should engage in analyses of LINCAS.
This was done by presenting the historical development of LINCAS, from the first IEA studies by the end of the 1950’s to the contemporary studies PISA and TIMSS. I suggested that the development in this period reflects broader societal issues. Moreover, I suggested that the development illustrates a tension or dilemma that LINCAS have been confronted with from the very beginning: LINCAS was initially framed by the idea that international comparisons could be the basis of a powerful design for studying educational issues. Thus, the main idea driving the genesis of LINCAS (which I labelled Purpose I) was an ambition to utilise the international variation in the study of general educational issues. This research base has been maintained throughout the history of LINCAS. What made it possible to conduct the increasingly more expensive studies was the fact that policy makers evaluated the studies as providers of policy-relevant information. Over the years there has been a shift towards the purpose of finding evidence for effective policy at the system or national level(which I labelled Purpose II), and the discussion in this chapter demonstrates that this vision for LINCAS is very visible in the PISA study. It would be fair to say that my thesis aims to promote Purpose I, and, furthermore, it aims to promote the view that the tension that is often perceived between the two purposes is to some degree based on a lack of communication and interaction between the policy makers and the educational researchers. The chapter then turned to a comparison between PISA and TIMSS. This is an issue that in itself is worthwhile because there are some indications that users of the information may be confused by discrepant results in the two surveys. However, by examining the differences between the studies, it is evident that the results should not be compared in a simplistic manner: they have different designs targeting different populations and different levels of the school systems, they have defined the achievement measures differently, and even if many countries participate in both studies the composition of the countries in the two studies is clearly not the same. Chapter 2 continued by discussing how science education may be linked to the policy context by engaging in secondary analysis of data and documents from LINCAS. This was not to argue that all, or even most, of the research in science education should be linked to PISA or TIMSS. Nevertheless, a relatively comprehensive review of possibilities for secondary analysis related to LINCAS was presented in the chapter, and the increased potential for such analyses relating to scientific literacy in PISA after the 2006 study was emphasised.

Summary of chapter 3
Chapter 3 gave an overview of some methodological issues that have heavily influenced my work. It began by placing my work in a tradition that could best Achievement tests from an item perspective be labelled as exploratory data analysis. The main idea of this tradition is that when confronted with a data set we should seek to develop a description of the overall structure in the data, the multivariate relationship, which is a challenging task since there is no general procedure to follow for finding such overall patterns in the data. In addition the general issue of the nature of the information in the cognitive items in TIMSS/PISA was explored in this chapter. A novel innovation in TIMSS was the double digit codes and the associated marking rubrics used for the constructed response items. With TIMSS it was acknowledged that using only multiple choice items, which before TIMSS was commonplace in most large-scale assessments, would seriously limit the range of competencies activated by a test. By using open-ended questions, giving students the opportunity to construct their own responses, TIMSS had the ambition of developing descriptions of how students’ represented and made use of concepts in science. The double digit codes were used to preserve that information. This was also the idea in the science assessment of PISA 2000, although the generic system was slightly modified. However, with PISA 2003, and with the items that have undergone field trials before PISA 2006, it is evident that the use of such coding is gradually disappearing. The reason for this change is not entirely clear, but it may be suggested that the codes have been of little use internationally. Nevertheless, constructed response items will still be used since they allow for the testing of competencies other than the selected response formats. The paradoxical consequence of this is that from PISA 2006 more information about students’ thinking and knowledge will be available from analysis of the multiple choice items than from students’ own written accounts of their reasoning and thinking, since the former at least include a code reflecting the response selected by the students. The constructed response items that were originally introduced into these assessments as tools for making the students demonstrate their thinking and reasoning are, in the marking guides for PISA 2006, more or less directly reduced to a description of how to score the items. Even if the marking guide includes explicit descriptions of the criteria for scoring, for the great majority of items there are no longer separate codes for students with different types of responses. I will suggest that this development was perhaps inevitable given that these codes were not extensively used or reported on in the international reports. However, I regard this development as a decrease in the potential for communicating how students typically think and interact with the items in tests like PISA. Furthermore, this development can be viewed as unfortunate from the perspective that such data could possibly be an important resource for secondary analysis aimed at studying students’ understanding of very specific scientific concepts or phenomena. Figure 3.3 provided some bipolar characteristics of analyses of information at different levels in item processing from specific written responses, through the coded responses, and finally to the scored items. Information is continuously and consciously peeled off in this process. In the first process of coding, all aspects that are seen as irrelevant for the overall intention of the response are peeled off. This may, for instance, be information regarding errors in spelling, errors in grammar, and other very specific elements in the response. However, it may also be information that reflects characteristic features of students’ thinking and knowledge. The marking guide has to be understood similarly by all markers, in all countries, and thus, it is a necessary condition that the number of codes are limited, and that they reflect clearly identifiable features of students’ responses. The codes therefore represent classes of typical responses that may be distinguished from each other. In the next process, when the items are scored, all aspects other than the overall quality or correctness of the item are peeled off. The score can therefore be considered as not representing aspects of the responses as such, but rather as representing aspects of the ability that students have used to create their responses. At least this is the idea. However, as demonstrated in Figure 1.1, the score information at the single item level is still highly specific for the item. Furthermore, chapter 3 addressed more specifically the methods used in one of the papers: correspondence and homogeneity analysis. I have so far not seen any other analysis where these, or similar tools, are used to study the relationship between nominally measured cognitive variables. In that sense the work undertaken in this paper represents an innovative approach to the analysis of data from cognitive tests. The aim of this section in chapter 3 was to write about the methods at a level requiring very little mathematics. This was a conscious choice in order to make this part of the text available to a more diverse group of readers. One consequence of this would be that interesting aspects of the methods are not commented on. Furthermore, since the language of mathematics is a useful tool that allows for very precise and unequivocal communication, another unfortunate consequence may be that the text is ambiguous, thus allowing misunderstandings to develop. Nevertheless, writing for a wider audience has forced me to challenge my own understanding of the methods I have applied.