Assessment support

toolbox

Written tests

Assignments

design construct questions Assemble test Organize test taking Grading tests Analyze test results Assign grades, archive Evaluate and improve

Assignment cycle

ANALYZE TEST RESULTS

After the test is taken, it is highly recommended to execute an analysis on the results. Preferably before the grades are communicated. Based on the analysis, you can:

Take measures if needed before the grades are given (e.g., one question was unclear and should not be counted).
Evaluate the quality of your test items.
Assess the reliability of the entire test.
Evaluate whether the learning objectives are achieved.
Evaluate your teaching process.

Assess the overall student results

First of all, you can look at and reflect upon the student results for the whole test. You can look at:
+ How many students passed/failed? Number/percentage?
+ What is the range and distribution of the grades? Frequencies?
+ What is the average grade?
+ Standard deviation/variance (tells you how spread out the samples in a set are from the mean).

To reflect: Are these results as expected? Can you account for the results? Explain the results?

Test analysis on item level

The next step is looking at the values for each item (question). For open questions, this can be done in a qualitative, holistic way and/or in a quantitative way. Based on the conclusions, measures can be taken for the purpose of scoring and grading, and for evaluative purposes.

Qualitative way: During your grading process, you get a notion of the common mistakes. It may be necessary to adapt your answering and scoring model (Check the already scored tests again!). You can make notes about these common mistakes and what stands out. If there are more assessors involved, you can ask all assessors to do the same and afterwards compare and discuss the notes.

Quantitative way: Calculate the psychometric values for all items, interpret the values, draw conclusions, and if necessary,y take actions.
The data provides signals, you still have to check what happened. For instance, if hardly any student choose for a MC question the right answer,
it can be that it was a very difficult question, or an unclear question, or more answers turned out to be correct, or maybe just by mistake,e the wrong key (the letter indicating the correct answer) was entered in the system.

Quantitative test item analysis

For a psychometric analysis, you can look at the following values:

P-values - Item difficulty
The P-value gives an indication of the Item difficulty. It shows the percentage of students who answered an item correctly.
A low P-value (e.g. below 0.30) indicates that for a many students this was a difficult item, a high P-value (e.g. above 0.90) indicates that almost all students found this question easy to answer.

For open questions
The formula for calculating is: P-Value = average score of the class / maximum possible score
Example: Maximum score for question A was 20 points. Students on average got 14 points (sum of all scores / number of participants)
P-value = 14 / 20 = 0.70
Optimal P-value for open questions
Theoretically a question that discriminates optimally, has a P-value of 0.50. It is not too easy and not too difficult. In practice you want to construct a test with questions that are a bit easier (e.g. 0.80 or higher), questions which will be a bit more difficult (e.g. below 0.40), and most questions will be in between and in the range around 0.50. The exact distribution is difficult to indicate, but could be 20-60-20, for example.

For closed questions
The formula for calculating now is: P-Value = total who answered the item correctly / total number of participants
Example: 30 students gave a corret answer. Totale number of stduents: 50.
P-value = 30/50 = 0.60
Optimal P-value for closed questions
Students may guess and you want to take that into account. Optimal P-value: slightly higher than midway between chance (1 divided by the number of answer options) and a perfect score (1) for the item. Formula: (1 + (1 / nr. alt.)) / 2
Example: Say a MC item with 5 alternatives. Score: 0 or 1. Guessing chance = 1/5 = 0.20 (20%) Optimal P-value: (1 + 0.20) / 2 = 0.60.
Optimal values: 2 answer options: 0.75 || 3 answer options: 0.67 || 4 answer options: 0.62
Distractor values
This applies only for closed questions.
When working with closed questions, beside looking at the P-value, it is also interesting to look at the A-value; the values for the aternatives or the incorrect answers. The alternatives are alos known as “distractors”. You can execute a distractor analysis.
You look at the number and percentage of students that choose one of the distractors.
A-value = total who choose a certain distractor / total of participants taking the test
The distractors should be plausible but incorrect. How to interpret the values? Look at the outliers.
Look at the very low percentages (e.g.2%). This is not a good distractor. It needs revision or replacement for a next time (else it makes a question easier).
Look at the values higher than the P-value. This can indicate that there may be something wrong with the question. E.g. are the answers mutually exclusive? Miskeyed? Or this points to a misconception shared by many students (evaluative information).
Rit / Rir values - Item discrimination
Item-total correlation (Rit) [a.k.a. Point Biserial correlation].

The Rit-value provides an indication of the correlation between the item and the total score. This Item Discrimination Index shows the extent to which students with high overall test results also got a certain question correct. This is what you would expect. A high Rit value will indicate this.
It becomes interesting when this will not be the case. When students who didn't perform well in general, answered a certain question correctly and those who did do well on the test overall, choose a incorrect answer. Then this is a signal that something can be the matter with this question.

The RiT ranges from -1.00 to 1.00.
As a general rule: the value should be higher than 0.20

+ ) Positive (high) indicates that those scoring high on the total exam answered a test item correctly more frequently than low-scoring students.
- ) Negative indicates low scoring students on the total test did better on a test item than high-scoring students.

The Rir-value in general is used when a test has not many questions. It has the same function, but doesn't take the results of the question for which you calculate the value into account for the total.
For thise interested, this is the Rit formula:
For those who want to learn more, the following sites are informative:
> Use point-biserial to discriminate high and low performers | GradeHub
> To see how it is calculated: Item Statistics for Classroom Assessments Part 2: Computing P-Values and Point-Biserial Correlations - Educational Data Systems (eddata.com)
Chronbach's alpha
Cronbach's alpha is a measure of internal consistency. It shows the extent to which the questions of a test provide consistent information regarding the students’ mastery of the course content.
It takes a lot of calculation, but if you administer the test ivia Remindo or Contest, you will get the value automatically.

The range is from 0 to 1. As a rule of thumb: < 0.70 bad / moderate > 0.80 sufficient / good
The value will be influenced by homo/heterogeneous of items (topics) and group and the number of participants and items.
It won’t be reliable to use for small or very heterogeneous groups or a test with just a few questions or content that is not expected to be coherent (like for instance two subject-contents combined in one test).
So yes it can be used to get an impression of the test in total. But be careful with the interpretation.

For those who want to read more about this:
> In this guide the formula is explained: An Instructor’s Guide to Understanding Test Reliability. Craig S. Wells & James A. Wollack, 2003.

Useful tools and resources

When you make use of digital test systems, such as Remindo, you will get the psychometric data automatically. That helps a lot. If this is not the case, you can create, for instance, your own Excel (SPSS or R) file or make use of existing format files for these data. Below are some useful resources.

The Step-by-Step guide from BMS explains test analysis and presents some norm values.
The Hogeschool Utrecht offers an Excel template for calculating psychometric values. Very useful. In Dutch and English. Kwantitatieve toetsanalyse | toetsing (husite.nl)
Excel Spreadsheets for Classical Test Analysis: This site, designed and maintained by Prof. Glenn Fulcher, also provides Excel spreadsheets to calculate the basic statistics. These include distractor analysis, item facility, a discrimination index, reliability, and descriptive statistics (mean, standard deviation, and standard error of measurement).
A nice short video about P-value. The Definition of Item Difficulty
A very insightful video if you want to see how the Item Test Correlation (Rit) is calculated (in Excel): Item test correlation by Sharon Kinkenberg. By the same presenter, also a video about the Item Rest correlation (Rir)