Theme: Standard setting/psychometric analysis

Download a PDF of the abstracts: Standard setting/psychometric analysis

How does, information provided to judges during the modified Angoff process influence their standard setting decisions? Looking into the black box - a pilot qualitative study of standard setting in a UK medical school.

*Fowell SL, Fewtrell R, Coulby C, Jayne J, Jha V.

The Angoff method is the primary method used for standard setting final written examinations in UK medical schools (1) and centres on judges’ estimations of the probability that a borderline candidate will correctly answer questions. Despite its popularity, there is debate regarding the feedback and type of information that should be given to judges to help them make informed decisions during the process.(2)

Many schools implement the two-stage modified Angoff, with a discussion phase where judges are able to re-consider questions and amend their ratings, supported by the provision of various forms of feedback during this phase. Meta-analysis of quantitative studies indicates that providing item analysis leads to lower probability estimates.(3) There are, however, concerns that judges may simply adjust their ratings to follow feedback data.(4) Such studies look at the consequences of providing judges with feedback data, but do not provide any insight into how the data may influence the judges’ thinking.

Qualitative research methods have been applied to investigate the thought processes of judges in standard setting. However, these are mainly within the US Elementary and High School system (5, 6) with few studies within medical education or higher education.(7, 8) The aim of this study was to utilise a qualitative approach to explore this aspect of standard setting in a UK undergraduate medical school. The discussion phase was observed and audio recorded and the transcript was thematically analysed by the research team.

In this paper, we will present the major emergent themes from the analysis and discuss the types of feedback provided. The implications of these findings, limitations of the approach and further areas of study will be addressed in order to stimulate wider discussion of this topic and the utility of this approach.

*Corresponding author Dr Susan Fowell, University of Liverpool, s.l.fowell@liverpool.ac.uk

 

Borderline Group Post-Test Standard-Setting of Student-Selected Special Study Units

*Curnow A, Hopcroft N, Rice N.

The University of Exeter Medical School provides an extensive range of student-selected options to each of its students throughout its medical curriculum. As a result, our students engage in self-selected inquiry-based learning in small groups supported by a variety of providers in various locations (including clinical and scientific research environments) on eleven separate occasions during their studies with us. The rationale, advantages and challenges of having multiple markers assessing the diverse assignment content produced by these Special Study Units (SSUs) in large quantities will be discussed, along with the assessment strategies we have successfully implemented to help mitigate the inherent issues encountered. These strategies include adoption of a limited range of well-defined categorical theme-specific marking criteria, post-test standard-setting utilising the borderline group method to determine the pass mark for each iteration of themed SSUs delivered, in addition to extensive assessor moderation and benchmarking processes.

*Corresponding Author Dr Alison Curnow, University of Exeter Medical School, a.curnow@exeter.ac.uk

 

Predictive abilities of standard setters using the Angoff method

*Professor Brian Lunn, Newcastle University, School of Medical Education, brian.lunn@newcastle.ac.uk

A retrospective analysis was made of the predicted performance of borderline candidates by standard setters, compared to how a nominal ‘borderline group actually performed in SBA papers (including Common Content items). Candidates performing 2% around the examination cut-mark were identified for each assessment. Their performance on each question was compared to the prediction of standard setters for these questions. Looking at ‘whole exam prediction standard setters tended to underestimate the ability of candidates by 1-4%. When the facility of questions were analysed it was noted that the higher the facility the more standard setters underestimated candidate performance (range 19-29% difference) and when the facility was lower the more candidate performance was overestimated (range 24-43% difference). The closer to the cut-mark the closer standard setters were to predicting performance. These data suggest that standard setters assessments tend to cluster around a nominal point where the cut-mark has been historically, rather than predicting borderline candidate performance. There is no ‘gold standard for standard setting but the Angoff model is often cited as the best of the options available. This and similar methodologies depend on the ability of standard setters to gauge candidate performance and are resource and time intensive. If evidence suggests that the model does not hold to the theory then perhaps it is time to consider whether this methodology is the best choice? This will be discussed in the context of move to a National Medical Licensing Assessment and consideration of standard setting such an exam.

 

How sure are you that I actually failed?” - Quantifying the precision of decisions on individuals in high stakes contexts

*Schauber S.K., Nouns Z.M.

A retrospective analysis was made of the predicted performance of borderline candidates by standard setters, compared to how a nominal ‘borderline group’ actually performed in SBA papers (including Common Content items). Candidates performing 2% around the examination cut-mark were identified for each assessment. Their performance on each question was compared to the prediction of standard setters for these questions.

Looking at ‘whole exam’ prediction standard setters tended to underestimate the ability of candidates by 1-4%. When the facility of questions were analysed it was noted that the higher the facility the more standard setters underestimated candidate performance (range 19-29% difference) and when the facility was lower the more candidate performance was overestimated (range 24-43% difference).

The closer to the cut-mark the closer standard setters were to predicting performance. These data suggest that standard setters’ assessments tend to cluster around a nominal point where the cut-mark has been historically, rather than predicting borderline candidate performance.

There is no ‘gold standard’ for standard setting but the Angoff model is often cited as the best of the options available. This and similar methodologies depend on the ability of standard setters to gauge candidate performance and are resource and time intensive. If evidence suggests that the model does not hold to the theory then perhaps it is time to consider whether this methodology is the best choice? This will be discussed in the context of move to a National Medical Licensing Assessment and consideration of standard setting such an exam." Professor Brian Lunn

How sure are you that I actually failed? - Quantifying the precision of decisions on individuals in high stakes contexts. The aim of this contribution is to illustrate how the precision of pass-fail decisions - and similar classificatory decisions in general - can be determined on the individual level. As traditional reliability coefficients usually only inform on the replicability of the overall results (e.g. test-taker rank order) they are, generally speaking, not informing about the precision of a specific classificatory decision on a particular student (e.g., mastery, grade, pass/fail). Therefore, we will illustrate how to obtain an estimate of the precision of such classificatory decisions on the individual level within the framework of Item Response Theory. Based on real and simulated exam data, we show how these estimates can be calculated, reported, and used for decision making. Finally, we argue that whenever classificatory decisions are made, the use of traditional estimates of reliability to justify these verdicts should be discouraged.

*Corresponding Author Dr Stefan Schauber, Centre for Educational Measurement at the University of Oslo (CEMO), stefan.schauber@cemo.uio.no

 

The relation between relevance and psychometric properties

M.D., Ph.D, René A. Tio, University Medical Centre Groningen, r.a.tio@umcg.nl

Introduction

In a multiple choice test we often force students to give an answer (number right scoring). It is however also possible to give them the opportunity to not give an answer (formula scoring). It is expected that students answer questions that are more relevant to their cognitive processes and relate more to relevant items due to items authenticity. However, whether item relevance correlates with misfit items is still unclear. Therefore, we investigated whether the misfit items were less relevant in the two most common scoring methods, formula and number right scoring.

Methods

295 students were divided over two groups in a 2x2cross-over design. A sample of 200 previously used questions of the progress test were selected based on their p values. The response option analysis was used to identify misfit items. Furthermore, the items were classified according to their relevance in five categories: medical knowledge, degree of available knowledge, relation to practice, practical relevance and relation to the medical curriculum. We conducted t-test analysis to investigate whether there was any difference between misfit and fit items regarding to the relevancy.

Results

The response option analysis showed that the majority of the dysfunctional items emerged in the formula scoring condition. The t-tests analyses demonstrated that item relevance discriminated miss fit items for one of the groups. Within this group, for the formula scoring condition the misfit items were less relevant whereas for the number right condition the misfit items were more relevant than the fit items (t = 2.130,t = -2.368, p > 0.05, respectively). For the other group, there was no significant difference.

Conclusions

It seems that the relevance of items is sample dependent to discriminate misfit items. Furthermore, our findings suggest that the scoring methods have an influence on how the misfit behaves regarding the relevance of the items.

 

Practical Considerations on the reliability analysis of small numbers assessments: the case of a first cohort assessment in a Family Medicine postgraduate course

*Hadjipapas. A, Therapontos E, Kolokotroni O, Howard V.J.

We have recently developed a postgraduate course in Family Medicine. As, is typical for newly developed courses, especially in the postgraduate setting, the initial cohorts are inherently small.

The first part of this course involved assessment of developing clinical competences, whereby test accuracy and reliability had to be evaluated. However, the accurate estimation of reliability in such small-numbers assessments (such as estimated by Cronbach’s alpha coefficient) is very questionable; a fact acknowledged by the GMC in their supplementary guidance (1). This is because for small numbers assessments, a lot of the parameters that psychometric statistics are based on (means, variances, covariances) can be strongly affected by extreme values. Perhaps the most widely used measure of reliability of assessment is Cronbach’s alpha coefficient. However, alpha does not solely depend on the properties of a given test but also crucially on the variation of scores among test takers who happen to take the particular test (2,3). We have estimated very high alpha coefficients for both an applied knowledge written test (200 items) and a Clinical Skills Examinations (12 stations). In both cases, the alpha estimates were inflated due to the large variation between test taker performance, which in turn was caused by either outlying values or bimodal distributions (split between better performing and worse performing test takes). Alongside the alpha coefficient we have also estimated the Standard Error of Measurement (SEM). Consistent with previous literature (3), we found SEM estimates to provide a more robust indication of test accuracy against test taker variability and outlying values. We discuss these findings and suggest that in the case of small numbers, first cohort assessments the SEM may still be valuable in evaluating test accuracy.

References

1. Postgraduate education publications. Standards for curricula and assessment systems. Reliability issues in the assessment of small cohorts. http://www.gmc-uk.org/Reliability_issues_in_the_assessment_of_small_cohorts_0410.pdf_48904895.pdf


2. Harvill, L. M. (1991), Standard Error of Measurement. Educational Measurement: Issues and Practice, 10: 33–41. doi: 10.1111/j.1745-3992.1991.tb00195.x


3. Tighe, J., McManus, I.C., Dewhurst, N.G., Chis, L., and Mucklow, J. (2010). The standard error of measurement is a more appropriate measure of quality for postgraduate medical assessments than is reliability: an analysis of MRCP(UK) examinations. BMC Med. Educ. 10, 40.


*Corresponding Author Dr Avgis Hadjipapas, University of Nicosia, Medical School, hadjipapas.a@unic.ac.cy