At the end of Dan McCarthy’s blog post, “How to Evaluate a Training Program”, in which he explains his pre-post, survey approach to applying the Kirkpatrick four levels of training evaluation, he asks: Has anyone used a system like this, or something better? What do you think, is it worth the bother? Since McCarthy asked, here is my response.
Yes. I’ve used a similar “system” of evaluation many times in my career as a program evaluator. However, I don’t recommend this approach any longer for most situations. While better than nothing, for many training programs (as well as for coaching, mentoring, simulations, self-directed learning, etc.), this approach does not produce the information needed to continuously improve performance and achieve business results. There are at least six reasons for this.
First of all, there is low correlation among the four levels (reaction, learning, behavior, and results). This takes nothing away from the contribution that the Kirkpatrick model has made to the field over the past 50 years. I hate to think where employee training would be today if we hadn’t been guided by Donald Kirkpatrick’s thinking. However, just because learners liked the training doesn’t mean that they learned anything new, or will apply what they learned, or will contribute to achieving business results. Knowing one level does not predict outcomes at the other levels.
Second, self-report surveys produce unreliable performance data. What people say they did or will do is often not what actually happens (I know this is not shocking news.). This is not to say that we shouldn’t ask; it just means that we have to interpret those findings very cautiously.
Third, it shouldn’t be a decision to evaluate or not to evaluate. Good training programs measure impact; they hold learners (and their organizations) accountable. That’s part of the learning process. And some form of evaluation should always be done because it reinforces learners learning and contributes to performance improvement.
Fourth, the outcomes of leadership training cannot be fully anticipated. To construct a useful survey, we have to be able to anticipate what will be learned, how it will be applied, and what difference it will make so that we know what questions to ask. However, especially with leadership training, that kind of prediction is nearly impossible. The only way to discover all of the outcomes is through direct observation or in-depth interviews.
Fifth, if we apply the Kirkpatrick model strictly, then we’re not asking if the training program is the right thing to be doing in the first place. Shouldn’t we ask that question first, before we ask about reactions, learning, behavior, and results?
Sixth, performance improvement is never the result of training, alone. Many organizational factors (e.g., manager support) determine whether impact from that learning will occur. If, as McCarthy suggests, we only compare what people knew before training to what they know and do after training, we still will not know what to do to achieve better results.
In most learning interventions, especially for leadership and management training, Robert O. Brinkerhoff’s Success Case Method is a more useful approach to evaluation. I have written about this method in previous blog posts.
Thank you, Dan McCarthy, for asking.
Kirkpatrick has for decades been the only game in town in the evaluation of training, although hardly known in education. In his early Techniques for evaluation training programmes (1959) and Evaluating training programmes: The four levels (1994), he proposed a standard approach to the evaluation of training that became a de facto standard. It is a simple and sensible schema but has it stood the test of time?
Level 1 Reaction
At reaction level one asks learners, usually through ‘happy sheets’ to comment on the adequacy of the training, the approach and perceived relevance. The goal at this stage is to simply identify glaring problems. It is not, to determine whether the training worked.
Level 2 Learning
The learning level is more formal, requiring a pre- and post-test. This allows you to identify those who had existing knowledge, as well as those at the end who missed key learning points. It is designed to determine whether the learners actually acquired the identified knowledge and skills.
Level 3 Behaviour
At the behavioural level, you measure the transfer of the learning to the job. This may need a mix of questionnaires and interviews with the learners, their peers and their managers. Observation of the trainee on the job is also often necessary. It can include an immediate evaluation after the training and a follow-up after a couple of months.
Level 4 Results
The results level looks at improvement in the organisation. This can take the form of a return on investment (ROI) evaluation. The costs, benefits and payback period are fully evaluated in relation to the training deliverables.
JJ Phillips has argued for the addition of a separate, fifth, "Return on Investment (ROI)” level which is essentially about comparing the fourth level of the standard model to the overall costs of training. However, it is not that ROI is a separate level as it can be included in Level 4. Kaufman has argued that it is merely another internal measure and that of there were a fifth level it should be external validation from clients, customers and society.
Level 1 - keep 'em happy
Traci Sitzmann’s meta-studies (68,245 trainees, 354 research reports) ask ‘Do satisfied students learn more than dissatisfied students?’ and ’Are self-assessments of knowledge accurate?’ Self-assessment is only moderately related to learning. Self-assessment captures motivation and satisfaction, not actual knowledge levels. She recommends that self-assessments should NOT be included in course evaluations and should NOT be used as a substitute for objective learning measures.
So Favourable reactions on happy sheets do not guarantee that the learners have learnt anything, so one has to be careful with these results. This data merely measures opinion.
Learners can be happy and stupid. One can express satisfaction with a learning experience yet still have failed to learn. For example, you may have enjoyed the experience just because the trainer told good jokes and kept them amused. Conversely, learning can occur and job performance improve, even though the participants thought the training was a waste of time. Learners often learn under duress, through failure or through experiences which, although difficult at the time, prove to be useful later.
Happy sheet data is often flawed as it is neither sampled nor representative. In fact, it is often a skewed sample from those that have pens, are prompted, liked or disliked the experience. In any case it is too often applied after the damage has been done. The data is gathered but by that time the cost has been incurred. More focus on evaluation prior to delivery, during analysis and design, is more likely to eliminate inefficiencies in learning.
Level 2 - Testing, testing
Level 2 recommends measuring difference between pre- and post-test results but pre-tests are often ignored. In addition, end-point testing is often crude, usually testing the learner’s short-term memory. With no adequate reinforcement and push into long-term memory, most of the knowledge will be forgotten, even if the learner did pass the post-test.
Tests are often primitive and narrow, testing knowledge and facts, not real understanding and performance. Again, level2 is inappropriate for informal learning.
Level 3 – Good behaviour
At this level the transfer of learning to actual performance is measured. Many people can perform tasks without being able to articulate the rules they follow. Conversely, many people can articulate a set of rules well, but perform poorly at putting them into practice. This suggests that ultimately, Level three data should take precedence over Level two data. However, this is complicated, time consuming and expensive and often requires the buy-in of line managers with no training background, as well as their time and effort. In practice it is highly relevant but usually ignored.
Level 4 - Does the business
The ultimate justification for spending money on training should be its impact on the business. Measuring training in relation to business outcomes is exceedingly difficult. However, the difficulty of the task should, perhaps, not discourage efforts in this direction. In practice Level 4 is often ignored in favour of counting courses, attendance and pass marks.
First, Kirkpatrick is the first to admit that there is no research or scientific background to his theory. This is not quite true, as it is clearly steeped in the behaviourism that was current when it was written. It is summative, ignores context and ignores methods of delivery. Some therefore think Kirkpatrick asks all the wrong questions, the task is to create the motivation and context for good learning and knowledge sharing, not to treat learning as an auditable commodity. It is also totally inappropriate for informal learning.
Senior managers rarely want all four levels of data. They want more convincing business arguments. It's the training community that tell senior management that they need Kirkpatrick, not the other way round. In this sense it is over-engineered. The 4 linear levels too much. All the evidence shows that Levels 3 and 4 are rarely attempted, as all of the effort and resource focuses on the easier to collect Levels 1 and 2. Some therefore argue that it is not necessary to do all four levels. Given the time and resources needed, and demand from the organisation for relevant data, it is surely better to go straight to Level four. In practice, Level 4 is rarely reached as fear, disinterest, time, cost, disruption and low skills in statistics mitigate against this type of analysis.
The Kirkpatrick model can therefore be seen as often irrelevant, costly, long-winded, and statistically weak. It rarely involves sampling, and both the collection and analysis of the data is crude and often not significant. As an over-engineered, 50 year old theory, it is badly in need of an overhaul (and not just by adding another Level).
Evaluation should be done externally. The rewards to internal evaluators for producing a favourable evaluation report vastly outweigh the rewards for producing an unfavourable report. There are also lots of shorter, sharper and more relevant approaches; Brinkerhoff’s Success Case Method, Daniel Stufflebeam's CIPP Model, Robert Stake's Responsive Evaluation, Kaufman's Five Levels of Evaluation, CIRO (Context, Input, Reaction, Outcome), PERT (Program Evaluation and Review Technique), Alkins' UCLA Model, Provus's Discrepancy Model and Eisner's Connoisseurship Evaluation Model. However, Kirkpatrick is too deeply embedded in the culture of training, a culture that tends to get stuck with theories that are often 50 years, or more, old.
Evaluation is all about decisions. So it makes sense to customise to decisions and decision makers. And if one asks ‘To what problem is evaluation a solution’ one may find that it may be costs, low productivity, staff retention, customer dissatisfaction and so on. In a sense Kirkpatrick may stop relevant evaluation.
Kirkpatrick’s four levels of evaluation have soldiered on for over 50 years as, like much training theory, it is the result of strong marketing, now by his son James Kirkpatrick and has become fossilised in ‘train the trainer’ courses. It has no real researched or empirical background, is over-engineered, linear and focuses too much on less relevant Level 1 and 2 data drawing effort away from the more relevant Level 4.
Kirkpatrick, D. (1959). Techniques for evaluation training programmes.
Kirkpatrick, D. (1994). Evaluating training programmes: The four levels.
Kirkpatrick, D. and Kirkpatrick J.D. (2006). Evaluating Training Programs (3rd ed.). San Francisco, CA: Berrett-Koehler Publishers
Phillips, J. (1996). How much is the training worth? Training and Development, 50(4),20-24.
Kaufman, R. (1996). Strategic Thinking: A Guide to Identifying and Solving Problems. Arlington, VA. & Washington, D.C. Jointly published by the American Society for Training & Development and the International Society for Performance Improvement
Kaufman, R. (2000). Mega Planning: Practical Tools for Organizational Success. Thousand Oaks, CA. Sage Publications.
Sitzmann, T., Brown, K. G., Casper, W. J., Ely, K., & Zimmerman, R. (2008). A review and meta-analysis of the nomological network of trainee reactions. Journal of Applied Psychology, 93, 280-295.
Sitzmann, T., Ely, K., Brown, K. G., & Bauer, K. N. (in press). Self-assessment of knowledge: An affective or cognitive learning measure? Academy of Management Learning and Education.