2001

Unfinished Business: Causes and Values

Ernest R. House

During the past several decades, two issues have strongly influenced much of what has happened in evaluation. These are the quantitative-qualitative dispute and the fact-value dichotomy. The first issue is familiar history by now, though underpinnings of the dispute are not well understood perhaps, such as the shift in our conception of causation. The second issue, the fact-value dichotomy, concerns the nature of values. We have come to grips with this issue only recently, and it promises to be equally pivotal. Both issues are unfinished business, though in different ways.

The Quantitative-Qualitative Dispute

In the early days of professional evaluation, policy makers and evaluators put their faith in large-scale quantitative studies, like Follow Through, Head Start, and the Income Maintenance experiment, to mention a few. Policymakers and evaluators thought that these large national studies would yield definitive findings that would demonstrate which programs or policies worked best. The findings could serve as the basis for mandates by the central government.

These large studies proved extremely disappointing for the most part. One problem was their scale. During one data collection, the Follow Through evaluators collected twelve tons of data. They were overwhelmed by the logistics to the point where they could not produce timely reports. Eventually, the government sponsors reduced the study to a fraction of its original size by limiting the number of variables.

A more serious problem was that the findings of these studies proved to be equivocal. The studies did not produce clear-cut results that could be generalized, as had been expected. For example, when the Follow Through data were analyzed, the variance in test score outcomes across the dozen early childhood programs being compared was about as great as the variance within these programs. In other words, if a given early childhood program had been implemented at six sites, two of the sites might have good results, two sites might have mediocre results, and two sites might have poor results.

Choosing a particular early childhood program was not effective in predicting the test score outcomes. This was not the kind of evaluative conclusion the government could base national recommendations on. Policymakers and evaluators became disenchanted with large-scale studies because of their cost, time scale, and lack of decisive results.

Meanwhile, evaluators were developing alternative approaches, including qualitative studies, meta-analysis, and program theory. Small qualitative studies were practical. For example, if a school district wanted an evaluation of its early childhood education program, interviewing administrators, teachers, and students was simple, cheap, and the findings were easy to understand, even if they could not be published in the journals of the time. Furthermore, generalizability was not the problem it was for large national studies. The demand on the local study was that the results be true for this place at this time, not true for sites all over the country.

However, many evaluators did not consider qualitative studies to be scientific. Members of the evaluation community engaged in intense debates over the scientific legitimacy of qualitative methods. This dispute preoccupied the profession for twenty years, even as qualitative studies became increasingly popular. After many words and much rancor, the field finally accepted the idea that evaluation studies could be conducted in a number of different ways (Reichart and Rallis, 1994). Evaluation became methodologically ecumenical, even if personal sensitivities lingered. The quantitative-qualitative dispute seems to be largely history by now.

Another alternative to large-scale quantitative studies was meta-analysis (Glass, 1976). Meta-analysis was more readily accepted by methodologists, though not without controversy. (Eysenck, 1978, called it “mega-silliness.”) In some ways meta-analysis was a natural successor to large-scale quantitative studies. Meta-analysis assembles scores of small experimental studies, studies that have control groups, and combines the findings of these studies quantitatively by focusing on the differences between performances of the experimental and control groups. The technique is more radical than it sounds since the researchers might combine outcomes that are quite different in kind into the summary scores.

For example, in the first application, Smith and Glass (1977) compared different approaches to psychotherapy. At the time the efficacy of psychotherapy itself was being questioned. Smith and Glass demonstrated with meta-analysis that different approaches to psychotherapy were effective and about equally effective. In conducting the meta-analysis, quite different outcomes were added together. For example, the researchers combined attitude questionnaire responses with counts of patient behaviors.

Meta-analysis became overwhelmingly popular in social and medical research to the point where it is difficult to pick up a major research journal without finding a few meta-analytic studies in it. In fact, medical researchers like the technique so much they sometimes claim credit for inventing it. Part of this popularity is due to meta-analysis being successful where single quantitative studies were not, such as in detecting treatment effects for mild hypertension. The technique combines the results from many studies, and since these individual studies are conducted in different settings and circumstances, they contain considerable variation, which seems to give the findings more generalizability when they are added together (Cook, 1993).

A third alternative to large-scale studies was program theory (Chen and Rossi, 1987). Program theory takes many forms but essentially consists of constructing a model of the program that can be used to base the evaluation on. Earlier some researchers had advocated basing evaluations on grand social theories, but such attempts failed. First, there were no social theories that seemed to have the explanatory power or credibility of physical theories. Second, even if such theories existed, could they be used to evaluate social programs? For example, given the task of evaluating automobiles, would evaluators use theories of physics to do the job? It seems unlikely.

Evaluators reconsidered and transformed the grand theory idea into developing theories for individual programs, in other words, constructing a model of the program. This substitution worked better. The program formulation is concrete enough to give guidance to the evaluation study, and it communicates directly with program participants. Program theory can guide the evaluation by delineating places where the evaluator might seek data to confirm whether the program is working in particular components. It enables evaluators to eliminate rival hypotheses and make causal attributions more easily (Lipsey, 1993; Davidson, 2000).

Underlying qualitative studies, meta-analysis, program theory and other developments have been changes in our conception of causation. The changes in our conception of causation have been subtle and have passed unnoticed perhaps. The conceptual changes suggest why these alternatives have worked better than the large-scale studies that preceded them.

Changing Conceptions of Causation

The conception of causation that we inherited is called the regularity or Humean theory of causation, named after David Hume’s influential analysis of cause (House, 1991). Regularity describes the conception. Put simply, the reason that we know one event caused another event is that the first event took place before the other event regularly—regularity of succession. If such and such event occurred and another event occurred after it repeatedly, we would have reason to believe the events would occur together again. So succession of events is what we are after. In fact, Hume said that is all there is to causation, along with contiguity of the events. The research task is to determine the succession of events. Put succinctly, “If p, then q; p, therefore q.”

This notion of cause is the underlying basis for most of the discussion of experimental design over the past decades. It is manifest in one of the early evaluation books, written by Edward Suchman: “One may formulate an evaluation project in terms of a series of hypotheses which state that “Activities A, B, C will produce results X, Y, Z’” (Suchman 1967, p. 93). In other words, if we have a program A under circumstances B and C, it will produce results X, Y, and Z. Furthermore, the perfect design for determining whether the result has occurred is the classic randomized control group design. No error could result from employing this design, according to Suchman.

Although this assertion sounds reasonable, it falls apart on inspection. If we go back to the Follow Through experiment, we had the same early childhood program at six different sites, but it produced different outcomes at the sites. Why? Because social causation is more complex than the regularity conception suggests. Even with the same program, there are different teachers at different sites who produce different results. We might try to control for the teachers, but there are so many variables that influence or might influence the outcomes, the researcher can’t control for all of them. Put another way, the program is not in and of itself an integrated causal mechanism. Parts of the program might interact with elements in the environment to produce quite different effects.

Such considerations led Cronbach to give up on treatment-interaction research altogether. He was trying to determine which characteristics of students affected outcomes, that is, how student characteristics and outcomes interacted. But there were so many possibilities that could not be controlled that he gave up. Put more technically, the effects of the secondary interactions of the variables were consistently as strong as the main effects.

Cronbach (1982) looked into the nature of causation and devised a more complex formulation. “In S, all (ABC or DEF or JKL) are followed by P.” In other words, in this particular setting, P, the outcome, may be determined by ABC or DEF or JKL. The problem for evaluators is that if A is the program, we only get P if conditions B and C are also present. So we could have A and not have the outcome P. More confounding, since P is caused by DEF and JKL combinations as well, we might not have the program A but still get P anyhow. Neither the presence nor the absence of A, the program, determines P. Succession of events is not a definitive test of cause and effect. The classic control group design will not produce definitive conclusions if causation is this complex.

Even so, we could devise a determinate research design using Cronbach’s formulation, albeit a very expensive and complex one. However, social causation is more complex than even Cronbach’s formulation indicates. Cronbach based his analysis on Mackie (1974), a seminal work on causation. Mackie’s original formulation was this: “All F (A…B… or D…H… or …) are P…” The dots represent missing causal factors we don’t know about. We have huge gaps in our knowledge of social events, gaps we don’t know about, and gaps we don’t even know we don’t know about. We can never fill these gaps in so we can never be certain of all that is involved.

I won’t extend the causal analysis further. It remains incomplete, unfinished business for the field, except to say that we do understand that social causation is more complex than we thought back in the old days. Why do qualitative studies, meta-analysis, and program theory seem to work better than large-scale studies of the past?

Each approach takes account of a more complex social reality by framing the program and the study more precisely, albeit in different ways. Qualitative studies show the interaction of people and events with other causal factors in context, which limits the causal possibilities and alternatives one must contend with (Maxwell, 1996). Meta-analysis uses individual studies, each of which occurred in separate circumstances of rich variation, which makes generalization more possible (Cook, 1993). Program theory delineates the domain investigated, which makes the questions evaluators pose more precise, relevant, and testable (Lipsey, 1993).

Recent books by Pawson and Tilley (1997) and Mark, Henry, and Julnes (2000) deal with causation, mostly by advancing realist conceptions, somewhat similar to the conception I have employed here. There is some agreement between the books and also significant disagreement, as in the utility of experimental studies. Developing a more complex notion of causation remains unfinished business, though we have made a start (Cf., Rogers et al, 2000).

The Fact-Value Dichotomy

A second issue that has shaped development in the field is the fact-value dichotomy. This influence has been subtle and pernicious. The dichotomy is the belief that facts refer to one thing and values refer to something totally different. The fact-value dichotomy is a particularly difficult problem for evaluation since values lie at the heart of evaluation. I doubt anything in the field has caused more trouble than this belief.

The distinction between facts and values has been around for many decades, but it came down to us in the evaluation community through the positivists and their influence on social science. The logical positivists thought that facts could be ascertained and that only facts were the fit subject of science, along with analytic statements like “1 plus 1 equals two” that were true by definition. Facts were empirical and could be based on pristine observations, a position called foundationalism.

On the other hand, values were something else. Values might be feelings, emotions, or useless metaphysical entities. Whatever they were, they were not subject to scientific analysis. People simply held certain values or believed in certain values or did not. Values were chosen. Rational discussion had little to do with them. The role of the scientist was to determine facts. Someone else, politicians perhaps, could worry about values.

Donald Campbell, one of the great founders of the evaluation field, accepted the fact-value dichotomy (Campbell, 1982). However, he did not accept foundationalism about facts. Counter to the positivists, he contended that there were no pristine observations on which factual claims could be based because all observations were influenced by theories and preconceptions that people held. Knowledge was still possible because although you could not compare a fact to a pristine observation to see if the fact was true, what you could do was to compare a fact to the body of knowledge it related to. The fact should fit the whole body of beliefs. Occasionally, the body of knowledge had to be changed to accommodate the fact. In any case, you were comparing a belief to a body of beliefs, not a belief to pure observation. This non-foundationalism was counter to the positivist view.

On the other hand, Campbell explicitly accepted the positivist conception of values. Values could not be determined rationally; they had to be chosen. He thought it was not the evaluator’s job to choose values. Once values were determined by politicians, sponsors, or program developers, evaluators could examine the outcomes of programs and policies with criteria based on those values. Practically speaking, this meant that evaluators could not evaluate the goals of programs, since the goals were closely connected to values. Evaluators had little choice but to accept program and policy goals.

I believe Campbell had the correct idea about facts but not about values. We can deal with both facts and values rationally. Facts and values are not separate kinds of entities altogether, though they sometimes appear that way. Facts and values (factual claims and value claims) blend together in the conclusions of evaluation studies and, indeed, blend together throughout evaluation studies. We might conceive facts and values schematically as lying on a continuum like this:

Brute Facts_______________________Bare Values

What we call facts and values are fact and value claims, which are sometimes expressed as fact and value statements. They are beliefs about the world. Sometimes these beliefs look as if they are strictly factual without any value aspect built in, such as, “Diamonds are harder than steel.” This statement may be true or false, and it fits at the left end of the continuum. There is little individual preference or taste built into it.

A statement like “Cabernet is better than chardonnay” fits better at the right end of the continuum. It is suffused with personal taste. But what about a statement like, “Follow Through is a good educational program”? This statement contains both fact and value aspects. The evaluative claim is based on criteria from which the conclusion is drawn, and it must be based on factual claims as well. The statement fits towards the middle of the continuum, a blend of factual and value claims. Most evaluative conclusions fall towards the center of the continuum as blends of facts and values.

Context makes a huge difference in how a statement functions. A statement like, “George Washington was the first president of the United States,” looks like a factual (historical) claim. But if I am engaged in a discussion with a group of feminists who are pointing out the racist and patriarchal origins of the country, this statement becomes evaluative as well in this particular context. The statement can be factual and evaluative simultaneously. Similarly, claims that might seem factual in another context might become evaluative in the context of an evaluation.

Such evaluative claims are subject to rational analysis in the way we ordinarily understand rational analysis. First, the claims can be true or false. Follow Through may or may not be a good educational program. Second, we can collect evidence for and against the truth or falsity of the claim, as indeed we do in evaluation studies. Third, the evidence can be biased or unbiased, good or bad. Finally, the procedures for the evidential assessment as to what data are likely to be biased or unbiased are determined by the discipline.

Of course, some claims are not easy to determine. In some situations, it may not be possible to determine the truth or falsity of the claims. Also, we may need new procedures to help us collect, determine, and process the validity of fact-value claims, in addition to traditional techniques. Just as we have developed sophisticated procedures for testing factual claims over the years, we might develop procedures for collecting and processing claims that contain strong value aspects so that our evaluative conclusions are unbiased regarding these claims as well. Actually, the claims blend together in evaluation studies.

Elsewhere, we have suggested three general principles we might follow in arriving at unbiased claims (House and Howe, 1999). The principles are inclusion of all relevant stakeholder perspectives, values, and interests in the study; extensive dialogue between the evaluator and stakeholders, and sometimes among the stakeholders themselves; and extensive deliberation to reach valid conclusions in the study. We call this approach deliberative democratic evaluation.

This analysis of facts and values is quite different from the fact-value dichotomy. In the old view, to the extent evaluative conclusions were value based, they were outside the purview of the evaluator. In the new view, values are subject to rational analysis by the evaluator and others. Values are evaluations.

References

Campbell, D. (1982). Experiments as arguments. In E. R. House, S. Mathison, J. A. Pearsol, & H. Preskill (Eds.). Evaluation Studies Review Annual, 7, 117-128.

Chen, H. & Rossi, P. H. (1987). Evaluating with sense: The theory-driven approach to validity. Evaluation Review, 7, 283-302.

Cook, T. D. (1993). A quasi-sampling theory of the generalization of causal relationships. In L. B. Sechrest & A. G. Scott (Eds). Understanding Causes and Generalizing about them. New Directions in Evaluation, no. 57, 39-82.

Cronbach, L. J. (1982). Designing evaluations of educational and social programs. San Francisco: Jossey-Bass.

Davidson, E. J. (2000). Ascertaining causation in theory-based evaluation. In Rogers, P. J., Hacsi, T. A., Petrosino, A., Huebner, T. A. (Eds.). (2000). Program theory in evaluation: Challenges and opportunities. New Directions in Evaluation, no. 87, 17-26.

Eysenck, H. J. (1978). An exercise in mega-silliness. In T. D. Cook, M. L. Del Rosario, K. M. Hennigan, M. M. Mark, and W. M. K. Trochim (Eds.). Evaluation Studies Review Annual, Vol. 3, 697.

Glass, G. V (1976). Primary, secondary, and meta-analysis of research. Educational Researcher, 5, 3-8.

House, E. R. & Howe, K. R. (1999). Values in Evaluation and social research. Thousand Oaks, CA: Sage.

House, E. R. (1991). Realism in research. Educational Researcher, 20, 6, 2-9.

Lipsey, M. W. (1993). Theory as method: Small theories of treatments. In L. B. Sechrest & A. G. Scott (Eds.). Understanding Causes and Generalizing about them. New Directions in Evaluation, no. 57, 5-38,

Mackie, J. L. (1974). The cement of the universe. Oxford: Clarendon Press.

Mark, M. M., Henry, G. T., Julnes, G. (2000). Evaluation: An Integrated Framework. San Francisco: Jossey-Bass.

Maxwell, J. A. (1996). Using qualitative research to develop causal explanations. Working Papers, Harvard Project on Schooling and Children. Cambridge, MA.

Pawson, R. & Tilley, N. (1997). Realistic evaluation. London: Sage.

Reichardt, C. S. & Rallis, S. F. (1994). The qualitative-quantitative debate: New perspectives. New Directions in Program Evaluation, no. 61, San Francisco: Jossey-Bass.

Rogers, P. J., Hacsi, T. A., Petrosino, A., Huebner, T. A. (Eds.). (2000). Program theory in evaluation: Challenges and opportunities. New Directions in Evaluation, no. 87.

Smith, M. L. & Glass, G. V. (1976). Meta-analysis of psychotherapy outcomes studies. American Psychologist, 32, 752-760.

Suchman, E. A. (1967). Evaluative research. New York: Russell Sage.

Ernest R. House Archives

Blog Archive

Friday, October 7, 2022