2001
Unfinished Business: Causes and Values
Ernest R. House
During the past several decades, two issues have strongly influenced much of what has happened in evaluation. These are the quantitative-qualitative dispute and the fact-value dichotomy. The first issue is familiar history by now, though underpinnings of the dispute are not well understood perhaps, such as the shift in our conception of causation. The second issue, the fact-value dichotomy, concerns the nature of values. We have come to grips with this issue only recently, and it promises to be equally pivotal. Both issues are unfinished business, though in different ways.
The Quantitative-Qualitative Dispute
In the early days of professional evaluation, policy
makers and evaluators put their faith in large-scale quantitative studies, like
Follow Through, Head Start, and the Income Maintenance experiment, to mention a
few. Policymakers and evaluators thought that these large national studies
would yield definitive findings that would demonstrate which programs or
policies worked best. The findings could serve as the basis for mandates by the
central government.
These large studies proved extremely disappointing
for the most part. One problem was their scale. During one data collection, the
Follow Through evaluators collected twelve tons of data. They were overwhelmed
by the logistics to the point where they could not produce timely reports.
Eventually, the government sponsors reduced the study to a fraction of its
original size by limiting the number of variables.
A more serious problem was that the findings of
these studies proved to be equivocal. The studies did not produce clear-cut
results that could be generalized, as had been expected. For example, when the
Follow Through data were analyzed, the variance in test score outcomes across
the dozen early childhood programs being compared was about as great as the
variance within these programs. In other words, if a given early childhood
program had been implemented at six sites, two of the sites might have good
results, two sites might have mediocre results, and two sites might have poor
results.
Choosing a particular early childhood program was
not effective in predicting the test score outcomes. This was not the kind of
evaluative conclusion the government could base national recommendations on.
Policymakers and evaluators became disenchanted with large-scale studies
because of their cost, time scale, and lack of decisive results.
Meanwhile, evaluators were developing alternative
approaches, including qualitative studies, meta-analysis, and program theory.
Small qualitative studies were practical. For example, if a school district
wanted an evaluation of its early childhood education program, interviewing
administrators, teachers, and students was simple, cheap, and the findings were
easy to understand, even if they could not be published in the journals of the
time. Furthermore, generalizability was not the problem it was for large
national studies. The demand on the local study was that the results be true
for this place at this time, not true for sites all over the country.
However, many evaluators did not consider
qualitative studies to be scientific. Members of the evaluation community
engaged in intense debates over the scientific legitimacy of qualitative
methods. This dispute preoccupied the profession for twenty years, even as
qualitative studies became increasingly popular. After many words and much
rancor, the field finally accepted the idea that evaluation studies could be
conducted in a number of different ways (Reichart and Rallis, 1994). Evaluation
became methodologically ecumenical, even if personal sensitivities lingered.
The quantitative-qualitative dispute seems to be largely history by now.
Another alternative to large-scale quantitative studies
was meta-analysis (Glass, 1976). Meta-analysis was more readily accepted by
methodologists, though not without controversy. (Eysenck, 1978, called it
“mega-silliness.”) In some ways meta-analysis was a natural successor to
large-scale quantitative studies. Meta-analysis assembles scores of small
experimental studies, studies that have control groups, and combines the
findings of these studies quantitatively by focusing on the differences between
performances of the experimental and control groups. The technique is more
radical than it sounds since the researchers might combine outcomes that are
quite different in kind into the summary scores.
For example, in the first application, Smith and
Glass (1977) compared different approaches to psychotherapy. At the time the
efficacy of psychotherapy itself was being questioned. Smith and Glass
demonstrated with meta-analysis that different approaches to psychotherapy were
effective and about equally effective. In conducting the meta-analysis, quite
different outcomes were added together. For example, the researchers combined
attitude questionnaire responses with counts of patient behaviors.
Meta-analysis became overwhelmingly popular in
social and medical research to the point where it is difficult to pick up a major
research journal without finding a few meta-analytic studies in it. In fact,
medical researchers like the technique so much they sometimes claim credit for
inventing it. Part of this popularity is due to meta-analysis being successful
where single quantitative studies were not, such as in detecting treatment
effects for mild hypertension. The technique combines the results from many
studies, and since these individual studies are conducted in different settings
and circumstances, they contain considerable variation, which seems to give the
findings more generalizability when they are added together (Cook, 1993).
A third alternative to large-scale studies was
program theory (Chen and Rossi, 1987). Program theory takes many forms but
essentially consists of constructing a model of the program that can be used to
base the evaluation on. Earlier some researchers had advocated basing
evaluations on grand social theories, but such attempts failed. First, there
were no social theories that seemed to have the explanatory power or
credibility of physical theories. Second, even if such theories existed, could
they be used to evaluate social programs? For example, given the task of
evaluating automobiles, would evaluators use theories of physics to do the job?
It seems unlikely.
Evaluators reconsidered and transformed the grand
theory idea into developing theories for individual programs, in other words,
constructing a model of the program. This substitution worked better. The
program formulation is concrete enough to give guidance to the evaluation
study, and it communicates directly with program participants. Program theory
can guide the evaluation by delineating places where the evaluator might seek
data to confirm whether the program is working in particular components. It
enables evaluators to eliminate rival hypotheses and make causal attributions
more easily (Lipsey, 1993; Davidson, 2000).
Underlying
qualitative studies, meta-analysis, program theory and other developments have
been changes in our conception of causation. The changes in our conception of
causation have been subtle and have passed unnoticed perhaps. The conceptual
changes suggest why these alternatives have worked better than the large-scale
studies that preceded them.
Changing Conceptions of Causation
The conception of causation that we inherited is
called the regularity or Humean theory of causation, named after David Hume’s
influential analysis of cause (House, 1991). Regularity describes the
conception. Put simply, the reason that we know one event caused another event
is that the first event took place before the other event
regularly—regularity of succession. If such and such event occurred and
another event occurred after it repeatedly, we would have reason to believe the
events would occur together again. So succession of events is what we are
after. In fact, Hume said that is all there is to causation, along with
contiguity of the events. The research task is to determine the succession of
events. Put succinctly, “If p, then q; p, therefore q.”
This notion of cause is the underlying basis for
most of the discussion of experimental design over the past decades. It is
manifest in one of the early evaluation books, written by Edward Suchman: “One
may formulate an evaluation project in terms of a series of hypotheses which
state that “Activities A, B, C will produce results X, Y, Z’” (Suchman 1967, p.
93). In other words, if we have a program A under circumstances B and C, it
will produce results X, Y, and Z. Furthermore, the perfect design for
determining whether the result has occurred is the classic randomized control
group design. No error could result from employing this design, according to
Suchman.
Although this assertion sounds reasonable, it falls
apart on inspection. If we go back to the Follow Through experiment, we had the
same early childhood program at six different sites, but it produced different
outcomes at the sites. Why? Because social causation is more complex than the
regularity conception suggests. Even with the same program, there are different
teachers at different sites who produce different results. We might try to
control for the teachers, but there are so many variables that influence or
might influence the outcomes, the researcher can’t control for all of them. Put
another way, the program is not in and of itself an integrated causal
mechanism. Parts of the program might interact with elements in the environment
to produce quite different effects.
Such considerations led Cronbach to give up on
treatment-interaction research altogether. He was trying to determine which
characteristics of students affected outcomes, that is, how student
characteristics and outcomes interacted. But there were so many possibilities
that could not be controlled that he gave up. Put more technically, the effects
of the secondary interactions of the variables were consistently as strong as
the main effects.
Cronbach (1982) looked into the nature of causation
and devised a more complex formulation. “In S, all (ABC or DEF or JKL) are
followed by P.” In other words, in this particular setting, P, the outcome, may
be determined by ABC or DEF or JKL. The problem for evaluators is that if A is
the program, we only get P if conditions B and C are also present. So we could
have A and not have the outcome P. More confounding, since P is caused by DEF
and JKL combinations as well, we might not have the program A but still get P
anyhow. Neither the presence nor the absence of A, the program, determines P.
Succession of events is not a definitive test of cause and effect. The classic
control group design will not produce definitive conclusions if causation is
this complex.
Even
so, we could devise a determinate research design using Cronbach’s formulation,
albeit a very expensive and complex one. However, social causation is more
complex than even Cronbach’s formulation indicates. Cronbach based his analysis
on Mackie (1974), a seminal work on causation. Mackie’s original formulation
was this: “All F (A…B… or D…H… or …) are P…” The dots represent missing causal
factors we don’t know about. We have huge gaps in our knowledge of social
events, gaps we don’t know about, and gaps we don’t even know we don’t know
about. We can never fill these gaps in so we can never be certain of all that
is involved.
I won’t extend the causal analysis further. It
remains incomplete, unfinished business for the field, except to say that we do
understand that social causation is more complex than we thought back in the
old days. Why do qualitative studies, meta-analysis, and program theory seem to
work better than large-scale studies of the past?
Each approach takes account of a more complex social
reality by framing the program and the study more precisely, albeit in
different ways. Qualitative studies show the interaction of people and events
with other causal factors in context, which limits the causal possibilities and
alternatives one must contend with (Maxwell, 1996). Meta-analysis uses
individual studies, each of which occurred in separate circumstances of rich
variation, which makes generalization more possible (Cook, 1993). Program
theory delineates the domain investigated, which makes the questions evaluators
pose more precise, relevant, and testable (Lipsey, 1993).
Recent books by Pawson and Tilley (1997) and Mark,
Henry, and Julnes (2000) deal with causation, mostly by advancing realist
conceptions, somewhat similar to the conception I have employed here. There is
some agreement between the books and also significant disagreement, as in the
utility of experimental studies. Developing a more complex notion of causation
remains unfinished business, though we have made a start (Cf., Rogers et al,
2000).
The Fact-Value Dichotomy
A second issue that has shaped development in the field is the fact-value dichotomy. This influence has been subtle and pernicious. The dichotomy is the belief that facts refer to one thing and values refer to something totally different. The fact-value dichotomy is a particularly difficult problem for evaluation since values lie at the heart of evaluation. I doubt anything in the field has caused more trouble than this belief.
The distinction between facts and values has been
around for many decades, but it came down to us in the evaluation community
through the positivists and their influence on social science. The logical
positivists thought that facts could be ascertained and that only facts were
the fit subject of science, along with analytic statements like “1 plus 1
equals two” that were true by definition. Facts were empirical and could be
based on pristine observations, a position called foundationalism.
On the other hand, values were something else.
Values might be feelings, emotions, or useless metaphysical entities. Whatever
they were, they were not subject to scientific analysis. People simply held
certain values or believed in certain values or did not. Values were chosen.
Rational discussion had little to do with them. The role of the scientist was
to determine facts. Someone else, politicians perhaps, could worry about
values.
Donald Campbell, one of the great founders of the
evaluation field, accepted the fact-value dichotomy (Campbell, 1982). However,
he did not accept foundationalism about facts. Counter to the positivists, he
contended that there were no pristine observations on which factual claims
could be based because all observations were influenced by theories and
preconceptions that people held. Knowledge was still possible because although
you could not compare a fact to a pristine observation to see if the fact was
true, what you could do was to compare a fact to the body of knowledge it
related to. The fact should fit the whole body of beliefs. Occasionally, the
body of knowledge had to be changed to accommodate the fact. In any case, you
were comparing a belief to a body of beliefs, not a belief to pure observation.
This non-foundationalism was counter to the positivist view.
On the other hand, Campbell explicitly accepted the
positivist conception of values. Values could not be determined rationally;
they had to be chosen. He thought it was not the evaluator’s job to choose
values. Once values were determined by politicians, sponsors, or program
developers, evaluators could examine the outcomes of programs and policies with
criteria based on those values. Practically speaking, this meant that
evaluators could not evaluate the goals of programs, since the goals were
closely connected to values. Evaluators had little choice but to accept program
and policy goals.
I believe Campbell had the correct idea about facts
but not about values. We can deal with both facts and values rationally. Facts
and values are not separate kinds of entities altogether, though they sometimes
appear that way. Facts and values (factual claims and value claims) blend
together in the conclusions of evaluation studies and, indeed, blend together
throughout evaluation studies. We might conceive facts and values schematically
as lying on a continuum like this:
Brute Facts_______________________Bare Values
What we call facts and values are fact and value
claims, which are sometimes expressed as fact and value statements. They are
beliefs about the world. Sometimes these beliefs look as if they are strictly
factual without any value aspect built in, such as, “Diamonds are harder than
steel.” This statement may be true or false, and it fits at the left end of the
continuum. There is little individual preference or taste built into it.
A statement like “Cabernet is better than
chardonnay” fits better at the right end of the continuum. It is suffused with
personal taste. But what about a statement like, “Follow Through is a good
educational program”? This statement contains both fact and value aspects. The
evaluative claim is based on criteria from which the conclusion is drawn, and
it must be based on factual claims as well. The statement fits towards the
middle of the continuum, a blend of factual and value claims. Most evaluative
conclusions fall towards the center of the continuum as blends of facts and
values.
Context makes a huge difference in how a statement functions. A statement like, “George Washington was the first president of the United States,” looks like a factual (historical) claim. But if I am engaged in a discussion with a group of feminists who are pointing out the racist and patriarchal origins of the country, this statement becomes evaluative as well in this particular context. The statement can be factual and evaluative simultaneously. Similarly, claims that might seem factual in another context might become evaluative in the context of an evaluation.
Such evaluative claims are subject to rational
analysis in the way we ordinarily understand rational analysis. First, the
claims can be true or false. Follow Through may or may not be a good
educational program. Second, we can collect evidence for and against the truth
or falsity of the claim, as indeed we do in evaluation studies. Third, the
evidence can be biased or unbiased, good or bad. Finally, the procedures for
the evidential assessment as to what data are likely to be biased or unbiased
are determined by the discipline.
Of course, some claims are not easy to determine. In
some situations, it may not be possible to determine the truth or falsity of
the claims. Also, we may need new procedures to help us collect, determine, and
process the validity of fact-value claims, in addition to traditional
techniques. Just as we have developed sophisticated procedures for testing
factual claims over the years, we might develop procedures for collecting and
processing claims that contain strong value aspects so that our evaluative
conclusions are unbiased regarding these claims as well. Actually, the claims
blend together in evaluation studies.
Elsewhere, we have suggested three general
principles we might follow in arriving at unbiased claims (House and Howe, 1999).
The principles are inclusion
of all relevant stakeholder perspectives, values, and interests in the study;
extensive dialogue between the
evaluator and stakeholders, and sometimes among the stakeholders themselves;
and extensive deliberation to reach
valid conclusions in the study. We call this approach deliberative democratic
evaluation.
This analysis of facts and values is quite different
from the fact-value dichotomy. In the old view, to the extent evaluative
conclusions were value based, they were outside the purview of the evaluator.
In the new view, values are subject to rational analysis by the evaluator and
others. Values are evaluations.
References
Campbell, D. (1982).
Experiments as arguments. In E. R. House, S. Mathison, J. A. Pearsol, & H.
Preskill (Eds.). Evaluation Studies
Review Annual, 7, 117-128.
Chen, H. & Rossi, P. H.
(1987). Evaluating with sense: The theory-driven approach to validity. Evaluation Review, 7, 283-302.
Cook, T. D. (1993). A
quasi-sampling theory of the generalization of causal relationships. In L. B.
Sechrest & A. G. Scott (Eds). Understanding Causes and Generalizing about
them. New Directions in Evaluation,
no. 57, 39-82.
Cronbach, L. J. (1982). Designing evaluations of educational and
social programs. San Francisco: Jossey-Bass.
Davidson, E. J. (2000).
Ascertaining causation in theory-based evaluation. In Rogers, P. J., Hacsi, T.
A., Petrosino, A., Huebner, T. A. (Eds.). (2000). Program theory in evaluation: Challenges and opportunities. New
Directions in Evaluation, no. 87, 17-26.
Eysenck, H. J. (1978). An
exercise in mega-silliness. In T. D. Cook, M. L. Del Rosario, K. M. Hennigan,
M. M. Mark, and W. M. K. Trochim (Eds.).
Evaluation Studies Review Annual,
Vol. 3, 697.
Glass, G. V (1976). Primary,
secondary, and meta-analysis of research. Educational
Researcher, 5, 3-8.
House, E. R. & Howe, K.
R. (1999). Values in Evaluation and
social research. Thousand Oaks, CA: Sage.
House, E. R. (1991). Realism in research. Educational
Researcher, 20, 6, 2-9.
Lipsey, M. W. (1993). Theory
as method: Small theories of treatments. In L. B. Sechrest & A. G. Scott
(Eds.). Understanding Causes and
Generalizing about them. New Directions in Evaluation, no. 57, 5-38,
Mackie, J. L. (1974). The cement of the universe. Oxford:
Clarendon Press.
Mark, M. M., Henry, G. T.,
Julnes, G. (2000). Evaluation: An
Integrated Framework. San Francisco: Jossey-Bass.
Maxwell, J. A. (1996). Using
qualitative research to develop causal explanations. Working Papers, Harvard
Project on Schooling and Children. Cambridge, MA.
Pawson, R. & Tilley, N.
(1997). Realistic evaluation. London:
Sage.
Reichardt, C. S. &
Rallis, S. F. (1994). The
qualitative-quantitative debate: New perspectives. New Directions in
Program Evaluation, no. 61, San Francisco: Jossey-Bass.
Rogers, P. J., Hacsi, T. A.,
Petrosino, A., Huebner, T. A. (Eds.). (2000). Program theory in evaluation: Challenges and opportunities. New
Directions in Evaluation, no. 87.
Smith, M. L. & Glass, G.
V. (1976). Meta-analysis of psychotherapy outcomes studies. American Psychologist, 32, 752-760.
Suchman, E. A. (1967). Evaluative research. New York: Russell
Sage.