Tuesday, October 11, 2022

Validity and the Relevance of Research to Practice

2005

Validity and the Relevance of Research to Practice

Ernest R. House

How is it possible for research to be relevant to practice if the researcher cannot guarantee that event A will be followed by result B, the bedrock of the standard view? This issue has never been successfully resolved in the “standard” view of research, either in theory or practice. The usual attempt at resolution is to divide research findings into the internally valid and the externally valid. In the standard view the researcher discovers that a program has certain results in particular settings (internal validity) and then attempts to find similar settings in which the program will produce the same results (external validity). Like causes produce like effects. However, the research literature is full of cases in which programs did not have similar effects, and practitioners do not use much educational research.

As Glass has noted, "The internal validity of an experimental finding cannot be judged solely in terms of the properties of the experimental design.... Internal validity of a causal claim is a complex judgment that ultimately rests on a larger number of empirical facts that necessarily extend beyond the context of a single experiment. The external validity of an experimental finding rests similarly on judgment and a network of empirical facts.... the facts are more numerous and are less well established" (Glass, 1982). One cannot depend solely on the formal characteristics of the research design to draw inferences. One must have substantive knowledge. This is true both in drawing inferences from a study and in applying the study elsewhere. To address this problem, Cronbach (1982) has argued that external validity is more important than internal validity, contending that there are so many unexplored and unknown interactions between the program and the setting that participants in the implementing site must themselves extrapolate the research findings into their own settings by adjusting the findings to the circumstances they encounter, and only the practitioners themselves can do this. Researchers can help by conducting studies in ways that help participants extrapolate the research findings, but researchers cannot guarantee universal results for a particular program.

In a realist view although patterns of events do not repeat themselves exactly, there are transfactual causal structures that influence events and that operate in different settings, even though their interactions with other causal mechanisms may not produce the same events from site to site. The realist would expect programs not to have the same effects in different sites and circumstances. However, transfactual entities can be causally efficacious across sites, though effects might be amplified or cancelled by other factors. Hence, a goal of research is to discover entities that tend to produce effects. For example, teachers are causal agents, and it is a commonplace that particular teachers make a tremendous difference not only in the classroom but also in the implementation of educational programs. Parents even act on this knowledge by putting their children into the classes of the best teachers, if they can. However, one can never be certain that a particular program or a particular teacher will produce good results. The standard view has carried with it the implicit image of program participants as compliant agents who follow administrative directives and whose own views and particularities make little difference, a view refuted strongly by many implementation studies that have been conducted (Fullan, 1982).

In the standard view programs are conceived as discrete, reproducible activities that are sufficient to produce particular results under similar circumstances. However, if we can conceive of programs themselves more like events than like causal entities. The programs themselves are produced by such entities. We can evaluate programs on a particular site with considerable efficacy, even while not knowing exactly which important causal structures are at work in what interactions, but we cannot expect that same program to have the same effects on other sites with great confidence. We can expect a major causal entity, such as a good teacher, to be effective in other settings with similar students, but even this transfer is not certain. There are ways of dealing with this complex social reality. If all these interactions occur in somewhat unpredictable ways so that causal laws can never be more than statements of tendencies, we might try to average across these situations in some way to discern the tendencies, such as in meta-analysis studies (Glass, McGaw, and Smith, 1981). Meta-analysis does not depend on a single critical experiment but rather summarizes across many different studies in an attempt to discern general tendencies. A program that contains strong causal structures might be expected to produce effects on average. Meta-analysis has indeed yielded results when other methods of investigation have failed, such as in discerning the effects of reducing moderately high blood pressure on heart disease.

There is a further extension of this reasoning. If teachers and other practitioners themselves are strong causal agents, able to affect the production of events dramatically, then their intentions and their knowledge are also important factors in good educational programs. A teacher's knowledge consists not only of subject matter but also of knowledge of concrete interactions of particular students in the classroom. Good teachers possess knowledge of what is likely to happen with particular students when certain activities occur, and in fact teachers may know that each single student may respond in a different way to certain classroom activities. That is, the teacher possesses specific causal knowledge built on inferences constructed from different sources over a period of time. This knowledge is focused on particular students and the concrete conditions of particular classrooms.

Being able to act effectively in such settings entails knowledge of concrete causal entities--what is likely to happen with these students, even with this particular student. So teachers can strongly influence particular students by specific activities based on knowledge of that student. Of course, the teachers may be wrong in the inferences drawn and the activities initiated. Improvement of the teachers’ causal inferences themselves, based on the particulars of the teachers’ students and classroom, would seem to be an important strategy for improving practices like education. Unfortunately, in the search for general laws, not much attention has been paid to improving particular teacher and practitioner concrete inferences directly. Given such a strong conception of agency from a realist point of view, the standard distinction of internal validity and external validity is inadequate. Rather we might think of the validity with which researchers draw conclusions from their studies, the validity with which practitioners draw conclusions from these studies to their own situation, and the validity with which teachers and other practitioners draw conclusions for themselves based upon their own experiences (House et al, 1989). A critical test for realism in evaluation and research is that it be realistic in practice as well.

            Validity

Few concepts are as fundamental to educational research and evaluation as that of validity. Notions of validity are based on particular views of causation. The regularity theory of causation undergirds traditional conceptions of validity, but such a theory is inadequate for describing and explaining many the causal inferences required in professional practice and everyday life. Teaching, for example, is a practice better understood in terms of a theory of intentional causation. This has significant implications for what counts as valid teacher knowledge.

An Example of Practical Knowledge: Learning to Teach

Let us imagine a situation in which a person is learning to teach for the first time, say, a woman instructor who is faced with teaching her first class. How does she proceed? A likely scenario is that as a student she has had teachers she thought were particularly effective or ineffective. She tries to remember what they did that worked with classes in which she was a student, as well as what didn't work very well. Based on her own experiences in the classroom as a student, she has notions of cause-and-effect relationships, of what works and what doesn't work. Some of these ideas may well be mistaken, but she holds them nonetheless.

From this repertoire of ideas and techniques, she selects notions around which to organize her class. How many of these considerations will there be? Ten? Fifty? No doubt it depends on the person and situation. There will be many. What they have in common is that most will be based on the new teacher's actual experience as a student participating in former classes. A student, after all, is a participant rather than merely an observer. But even all this is only preparatory to learning to teach--trying out these ideas in the classroom. As the new teacher begins to teach, the general considerations of how to act (should she be highly organized? authoritative? flexible? well-dressed?) give way to more specific considerations of exactly what to do (should she lecture? lead group discussions? show movies?).

The new teacher learns cause-and-effect relationships through direct participation, through participating first as a student and then as the teacher. This direct experience is gained mostly by performing and acting rather than by passively observing, and this direct personal experience is so intense and powerful that it shapes what the teacher will do and try to do throughout her career. After a few years, her learning rate will decline because she will feel that she has mastered her environment. Her teaching repertoire will be largely formed.

This direct firsthand experience of learning to teach is better explained by the intentional theory of causation than by the Humean regularity theory. The teacher has something in mind, tries it out, and judges its success or failure. The determination of whether the lesson works is based largely on firsthand experience, on performing, and through those experiences the teacher develops a personal set of cause-and-effect relationships about teaching. These are singular causal claims and not dependent on universal laws or regularities that assert universal correlations of events. Rather, they are based on personal experience. They well may be mistaken, but mostly they are not. The teacher can develop a reasonable set of cause-and-effect relationships to guide her through the day, just as most of us manage to drive our cars to work, feed ourselves, and conduct our daily affairs. All this is not ordinarily a problem, except perhaps when the car won't start, because causation is not always the problem that the skepticism of the Humean regularity theory suggests. Causal inference guides our lives every day, which is not to say that either our lives or the teacher's performance cannot be improved. It is to say that most of what we do is rational and makes good sense.

The teacher learns to teach not through observing her own actions as a spectator but through performing certain actions. The direct experience of acting is the basis for the cause and-effect relationships she learns. She does not infer the essential cause-and-effect relationships from repetition or regularity or universal causal laws. One can ride as a passenger in a car and witness the passing scene, yet not be able to retrace the route that one has taken. If one is the driver of the car, however, there is intentionality to one's action, perhaps responsibility, which makes it highly likely that one has learned the route. Similarly, one can sit in hundreds of classes for 20 years and not learn how to teach, but one can learn to teach by performing the teacher role for only a few semesters. For better or worse, many cause-and-effect relationships seem directly discernible and form the basis of professional practice. Many cause-and-effect relationships cannot be discerned through personal performance.

If this is a reasonable account of how teachers learn to teach, then what can we say about how valid their knowledge is? Is the validity of the inferences of the teacher captured by our traditional notions of validity? Or must we look elsewhere for conceptions that more adequately represent the state of their knowledge, and consequently discover new ways of improving the validity of teacher inference? We address this issue by examining two powerful conceptions of validity, those by Cook and Campbell (1979) and Cronbach (1982).

Cook and Campbell's Formulation of Validity

The traditional conception of validity has been explicated by Cook and Campbell (1979) in their revision of Campbell and Stanley (1963) and Campbell (1957). Cook and Campbell pose four research questions with corresponding types of validity: statistical conclusion validity, internal validity, construct validity, and external validity. Cook and Campbell believe that a precondition for inferring a causal relationship between two variables is to establish that the two variables covary. For the most part, Cook and Campbell's discussion of statistical conclusion validity is limited to variability and sampling error in units.

The second question is whether one variable caused the other, whether the treatment really caused the outcomes. The truth of this relationship is called internal validity. At this stage of Cook and Campbell's formulation, neither the treatment variables nor the outcome variables have been given a name that might generalize.

To generalize, one must label the cause and the effect. In other words, the cause and the effect must be related to higher order constructs, and the researcher generalizes to higher order constructs from the research operations. Inferences are based on the fit between the operations and the conceptual definitions (Cook & Campbell, p. 38). In practice, construct validity usually involves specification of the outcome measures and what they represent. As Cook and Campbell note, proper labeling of the treatment is a critical and often overlooked problem of construct validity.

External validity, the fourth type of validity for Cook and Campbell, addresses the question of how the causal relationship generalizes to and across other persons and settings. When researchers want to generalize to a particular population, it is essential that their samples be representative in some way. When all four types of validity are taken together, they permit the final inference. In Cook and Campbell's judgment, internal validity is the most important concern of all. In applied settings, they believe that external validity and construct validity of the effect are also relatively important, but that internal validity is still the most important.

Cronbach's Formulation

By contrast, Cronbach (1982) divided the world of inquiry and action into two domains: the domain of admissible operations and the domain of application, using a notation system of his own device, similar to the traditional X's and O's of experimental design. The first domain is the area of the study or investigation, and it is further subdivided into four components: units (U), treatments (T), observations (O), and the setting (S), or, collectively, UTOS.

However, UTOS represents only the stated plan for the study. Actual participants must be selected, actual treatment procedures applied, and actual observations made. The actual study is represented by utoS. The small letters indicate that one has sampled participants, treatments, and observations. The setting is ordinarily uncontrollable and does not warrant a small s, Cronbach believed.

What one finds in the utoS is only an imperfect manifestation of the original plans. The domain of admissible operations, UTOS, defines what is admissible to the study, that is, what range of U's, T's, and O's can be included. The reproducibility of the inference about UTOS from the actual utoS is what Cronbach calls internal validity. This inference is internal to the UTOS domain.

In Cronbach's formulation, there is also a second domain, the domain of application. Suppose that the school superintendent in Seattle hears about the evaluation of a Direct Instruction program in East Saint Louis, reads the evaluation report, and decides to do something similar in Seattle. But Direct Instruction has certain features that she doesn't like. The U, the T, and the O are different, not to mention the setting. Yet the superintendent does not disregard the evaluation findings. She makes some mental adjustments and arrives at her own conclusions, which are not exactly the same as those of the original study. Cronbach represents the new domain of application by *UTOS. This is the domain of the Seattle superintendent, and the leap made by her is from the sample of the original study to her own situation (utoS^*UTOS). Cronbach calls this “external” validity because it is external to the original UTOS domain of the study.

As formulated by Cronbach, internal validity (utoS^UTOS) is a matter of judging whether u, t, and o belong to the stated domain, and this is done by the investigator. It is the investigator's task to draw the proper inferences, given the stipulation of the domain. External validity for Cronbach is different, however. It requires an inference from the study to a domain outside the investigation (utoS^*UTOS), and it is not a matter of sampling. The rules of statistical inference do not apply. Cronbach calls this external inference an extrapolation, a projection of the information outside the range of the study. This inference requires substantive modifications in reasoning contrasting the similarity of the two situations or domains.

Note that the external validity of the inference (utoS^*UTOS) is not directly dependent on the internal validity of the inference (utoS^UTOS), as the inference is in Cook and Campbell's formulation. A conclusion may have good internal validity but may not extrapolate to the domain of application. And, more surprisingly, a conclusion may extrapolate externally without being first validated internally, contrary to the familiar idea that the external validity of a conclusion first must be established as internally valid.

It is apparent that there are important differences between these two conceptions of validity. For Cook and Campbell, statistical conclusion validity and internal validity lead to a conclusion that A caused B in a particular instance. Then, through construct validity and external validity, the researcher can generalize beyond the particular research operations to higher level causal relationships in other populations and settings (utoS^UTOS and possibly UTOS^*UTOS). The reasoning procedure is one of relatively strict inference from the study to a larger domain.

By contrast, Cronbach split the world into two domains. In the first, the researcher generalizes from the particular study to a larger domain, like Cook and Campbell (utoS^ UTOS), and this is called internal validity. External validity consists of drawing conclusions from the particular study to a different domain altogether (utoS*^UTOS), which may not resemble the original study in important ways. This generalization may be done by someone other than the researcher, and this other person may make substantive adjustments to the conclusions of the study based on experience and knowledge of the domain of application.

These differences between the validity formulation of Cronbach and that of Cook and Campbell reflect deeper differences about how things happen in the world and, in fact, how the world is constituted.

The Regularity Theory of Causation

The orthodox theory of causation in the social sciences is called the regularity theory, as we have seen. The recurrent example used in illustrating this theory is that billiard ball A rolls across the table and strikes billiard ball B. At this point billiard ball A ceases to roll and billiard ball B starts to roll across the table. According to the orthodox view, when we witness this scene, we can never observe any causal connections between the two events. All we can see is the event of A's striking B followed by the event of B's moving. We observe only one event followed by another event and nothing more.

However, by observing repetitions of similar events we can infer eventually that these types of events are related causally to each other. The regularity and repetition of the events make causal inference possible. This is the paradigmatic example of the regularity theory, the deeply skeptical view of causal knowledge attributed to Hume, who believed that we never can observe causation directly: It can be inferred only by observing regular successions of events.

Although there are many versions of the regularity theory, three main principles are common to such theories (Searle, 1983). The first principle is that the causal nexus is not itself observable. We can observe regular sequences of events where one type of event is followed by another type of event, and we may infer from these sequences that the regularity is causal. We observe only temporal sequence, contiguity, and regularity.

Second, when a pair of events is identified as cause and effect, that pair must be a particular instance of some universal regularity. The universal regularity is usually referred to as a causal law. We may not know which particular law is entailed by the causal statement, but we know that such a law exists. We can discover a particular causal relationship without knowing the form of the universal law, but the law is there.

The third principle is that causal regularities are distinct from logical regularities. The aspects under which one event causes another must be logically independent aspects. For example, we would not say that something's being a triangle caused it to be three sided. Causal events must be logically independent of one another, according to the regularity theory.

In conscientiously explicating their conception of causation, Cook and Campbell embrace their own sophisticated version of the regularity theory.1 They believe that the causal nexus is not directly observable (1979, p. 10). They interpret particular pairs of cause-and-effect events as instances of universal regularities or laws. In their eight concluding statements about causality, they say, "In these, the term molar refers to causal laws stated in terms of large and often complex objects" (p. 32). Cook and Campbell see treatment outcome relationships revealed by experiments as instances of universal regularities or laws, More to the point, "We have a great deal of sympathy with the position that all aspects of research design test propositions of a general and universal nature" (p. 87). Causal laws are mentioned repeatedly in their summary discussion.

In identifying Cook and Campbell's (1979) view with the regularity theory of causation, I have oversimplified somewhat. In their analysis of causation, Cook and Campbell survey dozens of theories of causation, taking bits and pieces from several theorists, many of whose positions are not consistent with each other. Hence, their position is neither fully consistent nor fully representative of any one theory. believe our characterization of their position is fundamentally accurate although insufficiently detailed to capture Cook and Campbell's eclecticism. Campbell later wrote to me and embraced a view of causation similar to the realist view, which I have quoted earlier.

Ultimately, Cook and Campbell's conceptions of validity is based on the regularity theory. "Causality may well be such a logical hodgepodge of nonentailing but useful clues in the diagnosis of dependably manipulable correlations on the basis of fragmentary and momentary perceptual evidence" (1979, p. 30, emphasis added). Their theory of causation is reflected in the four major validity questions that they believe any researcher must face. First, is there a relationship between the two variables? In other words, does a regularity exist? "Covariation is a necessary condition for inferring cause, and practicing scientists begin by asking of their data: 'Are the presumed independent and dependent variables related?" (p. 37).

Second, given the covariation, is there a causal relationship between the two, that is, is the regularity causal? "Is there a causal relationship from variable A to variable B . . . ?" (p. 38). Third, if it is causal, what are the "higher-order constructs that the research operations represent? "Researchers would like to be able to give their presumed cause and effect operations names" (p. 38). Fourth, how generalizable across persons, settings, and time is the causal regularity? In other words, given the relationship, where else can we find it repeated? Universal regularities will be repeated, if we can only find and describe where.

In Cook and Campbell's view, the researcher is to discover the underlying causal regularities and their range of application so that the treatment can be reproduced. Presumably, once these causal relationship are known, the treatments can be reproduced or replicated at will, or at least with a reasonable degree of probability.

Cronbach's Version of the Regularity Theory

Cronbach sees causation as more complex and less certain, and hence less useful, than do Cook and Campbell. However, Cronbach's conception of causation is still based on the regularity theory. According to Cronbach, there are so many interactions of treatments with units and observations, and so little is known about how events occur, that speaking in causal terms is not very useful. Social events are too complex to yield simple formulations.

Following Mackie (1974), Cronbach formulates a causal law this way: "In S, all (ABC or DEF or JKL) are followed by P" (Cronbach, 1982, p. 139), where the letters refer to kinds of events or situations or possibly to the absence of some objects or actions. Now ABC is sufficient for the P to occur but not necessary because P may be preceded by DEF or JKL just as well. In other words, P may occur without ABC.

On the other hand, ABC is sufficient for P to occur if all elements—A, B, C— occur together, but not if only AB or AC or BC occurs alone. Yet the situation is even more complex than this. Mackie's original formulation of causal regularities is, "All F (A . . . B . . . or D . . . H . . . or . . . .) are " (Mackie, 1974, p. 66), where the ellipses indicate missing events or conditions that affect the outcome , but which are not represented in the stated law and about which we know little. Such "elliptical" or "gappy" propositions represent the true state of our knowledge of social causation better than the statements of simple regularity, according to Mackie and Cronbach. Cronbach cites the occurrence of numerous strong interaction effects in educational research as evidence of these gappy propositions.

Now here is the problem Cronbach's formulation of causation poses for the researcher/evaluator. If event A is the treatment one implements in an educational program, the complexity of the causal relationships becomes apparent. The treatment A is neither necessary nor sufficient for the effect to occur. The treatment is only part of a larger package of events that may be followed by P. Furthermore, we are ignorant of what many of these events are, as represented by the ellipses. Hence, specifying a treatment in an experimental design may actually be misleading because it may lead one to believe that treatment A is either necessary or sufficient for P the outcome, to occur when it is not. We may draw erroneous conclusions about A, the treatment. In other words, an experiment cannot provide a definitive test for the effectiveness of a program. Cronbach analyzes major field experiments to show that the researchers had to use knowledge gained outside the experiments to draw their conclusions.

All is not lost, however. The gaps can be partially filled in by experience outside the causal statement. In addition, over a long period of time one might attempt to localize the missing causes through an extensive program of research, thus filling in the gaps, so to speak. However, this long time scale is hardly adequate for applied research, such as evaluation. Thus Cronbach contends that the proper concern of generalization in evaluation is not establishing the internal causal relationships (utoS --- UTOS) but extrapolating to the external domain of application (utoS ---*UTOS) and that this latter extrapolation must be done in part by the person applying the results of the study. In a sense, the gaps in the causal relationships will have to be supplied by the interpreter. The task can be made easier for the interpreter by the researcher asking the right questions in the study.

Given this more complex notion of social causation, Cronbach insists that the critical inference from an evaluation study is from the actual data of the study to the domain of application. Only through the knowledge and experience of the interpreter can the gaps be filled. Thus, external validity becomes more important than the internal validity of the original study. In summary, Cronbach's and Cook and Campbell's differing conceptions of validity depend significantly on their differing degrees of confidence in being able to discern causes.

Agency and Intentional Causation

The regularity theory does not exhaust the possibilities of how to construe causation. One objection to the regularity theory is that it is contrary to common sense and psychological research. In spite of Hume's analysis, we do not see billiard ball A stopping and billiard ball B continuing. We actually see billiard ball A striking B, causing it to move. In recent years, Hume's analysis of causation has come under increasing attack.

Searle (1983) has advanced another theory of causation based on realist presumptions, which he calls intentional causation. Here is Searle's primary example:
"I now want to call attention to the fact that there are certain sorts of very ordinary causal explanations having to do with human mental states, experiences, and actions that do not sit very comfortably with the orthodox account of causation. For example, suppose I am thirsty and I take a drink of water. If someone asks me why I took a drink of water, I know the answer without any further observation: I was thirsty. Furthermore, in this sort of case it seems that I know the truth of the counterfactual without any further observations or any appeal to general laws. I know that if I hadn't been thirsty then and there I would not have taken that very drink of water. Now when I claim to know the truth of causal explanation and a causal counterfactual of this sort, is it because I know that there is a universal law correlating "events" of the first type, my drinking, under some description or other? And when I said that my being thirsty caused me to drink the water, was it part of what I meant that there is a universal law? Am I committed to the existence of a law in virtue of the very meaning of the words I utter? Part of my difficulty in giving affirmative answers to these questions is that I am much more confident of the truth of my original causal statement and the corresponding causal counterfactual than I am about the existence of any universal regularities that would cover the case." (pp. 117-118)

Searle's example departs significantly from the regularity theory of causation. First, he knows the answer to the causal question and the truth of the corresponding counterfactual without any observations other than the experience of the event. He might indeed be wrong, but the justification for his claim doesn't depend on further observations. The experience is all he needs.

Second, his causal claim does not commit him to the existence of any causal laws. There might indeed be such laws, but his singular causal claim does not commit him to their existence. His knowledge of the truth of the counterfactual claim, that if he hadn't taken a drink he would still be thirsty, is not derived from his knowledge of any such laws. Searle contends that because there is a causal relation in a particular instance does not logically entail that there be a universal correlation in similar instances.

Third, in Searle's example, the cause and effect are logically related to one another. That is, the notion of thirst, no matter how described, is inextricably connected to the notion of drinking, no matter how described. When we say, "My thirst caused me to drink," we are connecting logically related events. His desire to drink is logically related to his taking a drink, even though one caused the other. Searle's account of intentional causation contradicts all three principles of the regularity theory.

According to Searle's theory of intentional causation, in some cases we can actually experience causation directly. Suppose that in the classic billiard ball example, instead of being observers watching ball A, then ball B, that we actually take the cue stick in our hands with the firm intention of making ball B go into the corner pocket, by means of hitting ball A with the cue stick and making it strike ball B. What we actually experience is our intention of doing so and our execution of the task. If we are successful, we actually make it happen, and we directly experience the causation of making ball B go into the corner pocket. We do not experience our intention, watch ball A, then ball B, then infer from the events that we have a case of causation. We do not observe two separate events, and we do not need a covering law to explain their correlation. Rather, in Searle's analysis, the causal nexus is internal to the experience itself: Part of the experience of acting is an awareness that the experience itself is causing the action.

In Searle's account, by manipulating things, by making things happen, we gradually discover other "by-means-of" relations and extend our notion of causation to events outside our direct intentional action. For example, a child who breaks a vase with a rock learns not only that he or she can break a vase but also that hard objects can break glass. Eventually, one is able to observe causal relations even when one is not the actor but only an observer. Our primitive notion of causation is extended to other situations; we extend our causal knowledge by accumulating experience. Other theories of causation, such as Searle's, open up new possibilities for validating knowledge.

Implications for Validity

Earlier I analyzed two major formulations of validity and suggested that these two views are based on a particular theory of causation, although differently conceived. I introduced another theory of causation and suggested that validity considerations might be different yet again if we accepted a theory like intentional causation. More specifically, the validity concerns might include the validity of the inferences that practitioners themselves draw from their own experiences because these inferences are primary influences on practice, as in the case of the beginning teacher. Where does this leave us with our overall conception of validity?

We have three distinct situations: (a) the researcher draws inferences from an evaluation study and expects the practitioner to apply them, (b) the practitioner draws inferences from an evaluation study but modifies those inferences based on his or her particular domain of application; (C) the practitioner draws inferences based on his or her own experience and applies them in context. In each of these three situations the inferences might be wrong or, if you will, invalid. Ways of improving the validity of the inferences in the first situation are covered in the Cook and Campbell (1979) formulation and the traditional research literature. And Cronbach has argued that the second situation is more important and the validity considerations are significantly altered thereby. I think that the third situation is more important than the first two as far as the conduct and improvement of professional practices are concerned and that the validity concerns for practitioner inferences have been very much ignored.

The differences between causal inferences in formal research studies and practitioner inferences may be deep ones. Researchers usually express their findings as propositions, and it is the validity of propositions we test. By contrast, much of the knowledge of practitioners is tacit rather than propositional in form, elicited only when practitioners face a particular problem. In fact, practitioners often cannot state what they know in propositional form. Nonetheless, it is the validity of their causal knowledge that is critical for professional practices like teaching.

Accepting other theories of causation, like the intentional theory, does not mean that experimental design is useless. Rigorous designs may be useful for many purposes, including validating effects far removed from the practitioners' perspective, such as long-term effects of actions, or investigating controversial issues in which the practitioner’s knowledge might be biased. But accepting such a theory of causation means that correct causal inferences can be arrived at in other ways as well. There are means of validation other than experimental designs.

On the other hand, few of us would be willing to accept practitioners' causal inferences at face value. We need ways of checking and validating practitioners' inferences. The choice is not between valid scientific knowledge on the one hand and invalid practitioner superstition on the other, as the issue is often posed. The real problem is to arrive at valid causal inferences, and this can be done in a number of ways, most of which we have yet to invent.

Unfortunately, we don't know much about practitioners' inferences or how they arrive at them. We have few procedures or resources for helping teachers and other practitioners improve their critical inferences. Although the validity considerations in the experimental modes of reasoning have been well explicated, the inferences critical to practice lie hidden, subterranean. We must invent new ways of helping practitioners improve their causal inferences if we want to improve practice. (There is no apparent theoretical problem with discerning how practitioners reason causally. See, for example, Ennis (1973) on causal reasoning and Scriven (1973) on how historians draw causal inferences. Also Weir (1982) has applied the powers theory of causation to naturalistic evaluation.)

Thus, an expanded and pluralistic conception of validity would seem useful. It is neither correct nor useful to think of only one narrow pathway for reaching valid causal inferences. There are many pathways, some more efficacious than others, depending on one's circumstances.

References

Campbell, D. T. (1957). Factors relevant to the validity of experiments in social settings. Psychological Bulletin, 54, 297-312.

Campbell, D. T., & Stanley, J. C. (1963). Experimental and quasi-experimental designs for research. Chicago: Rand McNally.

Cook, T. D., & Campbell, D. T. (1979). Quasi-experimentation. Boston: Houghton-Mifflin.

Cronbach, L. J. (1980). Validity on parole: How can we go straight? New Directions for Testing and Measurement, 5, 99-108.

Cronbach, L. J. (1982). Designing evaluations of educational and social programs. San Francisco, CA: Jossey-Bass.

Ennis, R. H. (1973). On causality. Educational Researcher, 2(6), 4-11.

15. Glass, G.V. (1982). Experimental validity. In Mitzel, H. (Ed.), Encyclopedia of Educational Research (5th edition). N.Y.: Free Press.

House, E. R., Mathison, S. and McTaggert, R. Validity and Teacher inference. Educational Researcher, Vol. 18, No. 7, pp. 11-15, 26

Mackie, J. L. (1974). The cement of the universe: A study of causation. London: Oxford University Press.

Scriven, M. (1973). Causes, connections and conditions in history. In H. S. Broody, R. H. Ennis, & L. I. Krimerman (Eds.), Philosophy of educational research (pp. 439-458). New York: Wiley.

Searle, J. R. (1983). Intentionality. Cambridge, England: Cambridge University Press.

Weir, E. E. (1982). Causal understanding in naturalistic evaluation of educational programs. Unpublished doctoral dissertation, University of Illinois at Urbana-Champaign.

Childhood Influences on My Work

2016

Childhood Influences on My Work

Ernie House

Abstract
House tells stories about his childhood experiences with adults who were making what he felt were poor evaluations. He had similar evaluation experiences as a teacher and slowly developed his own views about how evaluation could be most fairly and appropriately accomplished in everyday life. He began applying these ideas and many others when he was put in charge of a national evaluation in the 1960s, drawing upon perspectives of many colleagues he assembled to give him advice. He retained a healthy skepticism that typifies all his work, based on the childhood convictions he developed through observing and defending himself from poor adult evaluations.



In the first grade our teacher put a chart on the wall with our names on it. She said, in her best grade school teacher voice, “Children, if you do this, you will get a blue star; if you do that, you get a silver star; and if you do this, you get a gold star!” I thought, she doesn’t think we’re going to fall for that, does she? To my astonishment, the other kids began falling all over themselves to win these stars. I felt like yelling, “You idiots, they’re just little paper stars!” (Perhaps a portent of evaluations yet to come.) By that time I was living with my mother, who was working three shifts in a munitions factory while my grandmother took care ofmy sister and me.

My father had been killed in a car wreck 2 years earlier. My mother had no other means of support and no resources. After a few years she married a man from the factory she didn’t know very well, and we moved along a lonely rural highway miles out of town. Unfortunately, the guy turned out to be psycho.

At night they would get into violent arguments, and sometimes he would bring out a loaded gun and hold it to my head, hammer cocked. It was a way of threatening her. I don’t know if you’ve had the opportunity to have an experience like this, but it’s totally mind focusing. During these episodes my mind was absolutely lucid. I could see that he was deranged, and I sat perfectly still, in complete control of my emotions. No crying, pleading, or moving. I didn’t know what might set him off. I did think that if I survived, I would never allow myself to get into such a helpless situation again. From these and other experiences, I developed a strong resolve and motivation not to be controlled by others.

Another conclusion I had reached by the age of eight was that adults made bad decisions that could prove disastrous for them and for my sister and me. My mother was the best person I ever knew, good through and through, actually too good for the world in which she lived. She was in extremely difficult circumstances, doing the best she could. My father and his four brothers were the toughest people I knew, but hardly models of prudence, as police records show.When they were little, they had been sent to a St. Louis orphanage and farmed out as child laborers after their own father died of silicosis working in the mines in southern Illinois. I reasoned that if I could see through adult motivations and anticipate what adults might do, I could protect my sister and myself. At an early age I began looking beneath the surface of people and events, and I looked suspiciously. This attitude evolved into an intellectual style.

Years later these traits became useful in evaluation. Often, I can see what others do not see, and I will say what others will not say. All people practice willful ignorance to a greater or lesser degree. They choose not to see things—a luxury I felt I could not afford. I pushed willful ignorance back further than most people can tolerate. In books, articles, and high-profile evaluations, I employed these skills.

I was pressured in various ways, as you often are. After all, careers, reputations, and livelihoods are at risk. One of the strangest episodes was a review of environmental education in Austria for the Organization for Economic Development and Cooperation (OECD). The Austrians were so upset with my report that they sent a formal diplomatic protest to OECD. Not every evaluator can say that. Of course, I was highly resistant to such pressures. What could they do? Hold a gun to my head?

This style had carry-over to other parts of my life, as in financial investing. To my great surprise, when I began managing my retirement funds, I found investing fascinating. In a way it was a pure form of evaluation that culminated in concrete gains and losses, unlike contemplating the inadequacy of Hume’s theory of causation. And I was good at it. Investing requires skills—skills of skepticism similar to those I had developed in evaluation.

In retrospect, I didn’t make the same mistakes as the adults of my childhood. No. I made other mistakes instead. Really, you can’t see through everything all the time. You can’t live without some illusions. You need illusions to motivate and protect you. Decades ago I said that people are able to withstand far less evaluation than they think they can. That includes me. How did these childhood experiences affect my work? When conducting evaluations, I don’t necessarily believe what people tell me. I validate what they say with other data and with what others say. I have a keen sense of looking beyond appearances towards what lies beneath. My motive is to develop a deeper understanding, with the idea of preventing serious mistakes.

I also empathize with the poor and powerless. Evaluators typically come from the same backgrounds as those in charge, whereas those receiving benefits come from the lower social classes or else are children, patients, or victims, helpless to protect themselves. Empathy with the poor and powerless has prompted me to hold strong positions about social justice, which I have tried to incorporate into evaluation.

Used inappropriately, evaluations can be instruments of repression. I’ve tried to advance standards that evaluators should live by, and have conducted meta-evaluations to show what evaluators should do. I’ve been partial to qualitative methods that focus on what people actually experience. Often those in charge do not know what’s happening, and sometimes they do not want to know. Willful ignorance is widespread.

Of course, no perspective encompasses the entire truth, and the evaluator’s task is to make sense of many perspectives using multiple methods. My skepticism applies to people andmethods. No method delivers unequivocal truth. I’m skeptical of foundational positions based on methods or ideology. Evaluators need to look at evidence cautiously and holistically.

I have been bold in challenging authorities. If I arrive at a position not favored by those in power, I expect them to pressure me to change my views and to retaliate if I don’t. I am willing to change findings if I have missed something or the issue is unimportant. However, if the issue is critical, that’s another matter. How seldom those in power encounter principled opposition is indicated by how uncomfortable they are with it. They realize that professionals are vulnerable in their concern about their own careers, and since those in power can help or hinder, they expect professionals to play along.

In summary, characteristics that influence how I approach evaluation include skepticism and resolve, which lead me to maintain an autonomous viewpoint, question authority, look beneath the surface, resist pressure, control emotion, and empathize with the poor and powerless. No doubt, other evaluators share some of these traits, which they’ve developed from different backgrounds. And although these traits are well suited to evaluating, being too critical, too suspicious, too cynical, or too provocative can be quite counterproductive. I have been guilty of such missteps on many occasions.

ERNIE HOUSE is an emeritus professor of education at the University of Colorado at Boulder.

2012

WORK MEMOIR
Ideas and Influences

Ernest R. House

In 1967, I was sitting in a university office talking to Gene Glass, my statistics instructor, when Bob Stake walked in. Glass said, “Do you know Ernie House?” Stake said, “He doesn’t know I know him, but I know who he is.” Glass said, “Ernie has a gold mine here, a big evaluation project.” Thus began my evaluation career. I was finishing graduate study at the University of Illinois and had been asked to evaluate the statewide Illinois Gifted Program.

How did I get here? I began by trying to change education, convinced by my experiences that there was something amiss in the education system. My scholarly career has been marked by two themes: educational innovation and evaluation, including the processes, politics, policies, and values of both. The two merge in evaluating education innovations. Gradually, I moved more into evaluation.

The purpose of this chapter is to identify the influences on my ideas. Memory is a biased instrument, not balanced in this case by other accounts of events. To approach the task, I have outlined my career activities and identified what influenced them at that time. It’s one way of putting distance between now and then, to reduce the danger of rewriting history by projecting current ideas backward. The approach hasn’t been as successful as I had hoped. In a multidecades career, you interact with hundreds of people, engage in dozens of projects, and write far too many papers. My first draft wasn’t very readable—too many names, too many places, too many ideas. The early version did prove one thing: The influences were numerous, varied, and complex. To make this chapter more readable, I have omitted many, and there is much that I regret leaving out.

I have identified several types of influences. First, there are those scholars a generation or so ahead of me in the field. Second, there are scholars from other disciplines, especially philosophy, political science, history, and economics. Third, there are a few friends and colleagues who shaped my ideas by reviewing my work in the early drafts. Fourth, there are colleagues I met a few times a year and had productive exchanges with. Fifth, there are the diffuse effects of spending time in other countries.

During my career, the social context—indeed, society itself—changed greatly. Evaluation gained its impetus in 1965 with passage of the Great Society legislation, which mandated evaluation for some education programs for the first time. Through the 1960s and 1970s, the role of evaluation was to legitimate government activities by evaluating them. (“We aren’t sure this will work, but we will evaluate it to see.”) In the 1980s, Reagan reversed 50 years of the New Deal and Great Society by privatizing, deregulating, and discrediting government endeavors. The private sector could do things better, he maintained. In the 1990s, these trends continued, with Clinton trying to convert government’s role to managing rather than producing services. In the new century, Bush embarked on even more radical privatizing and deregulating of policies. Hence, while evaluation began in 1965 by serving the public interest, by 2010, evaluation itself was being privatized to serve private interests in some cases. These are remarkable changes, against which my career played out. To simplify the analysis, I have divided the era into four periods corresponding roughly to the decades, each period typified by a different ethos.

LEARNING THE CRAFT AND CREATING NEW IDEAS (1967–1980)

In 1967, I prepared for the Illinois evaluation by putting all the evaluation papers I could find in a cardboard box and reading them in a month. There wasn’t much. I established an advisory panel, including Stake, Egon Guba, and Dan Stufflebeam. Their advice proved invaluable. I used Stake’s (1967) “countenance” model of evaluation to plan the study. In 4 years, I learned evaluation from the bottom up, aided by a talented team consisting of Steve Lapan, Joe Steele, and Tom Kerins. We worked with program managers, hundreds of school districts, including Chicago, the state education agency, and the Illinois legislature. These interactions convinced me that evaluation was highly political, not a common idea at the time. Following that, Stufflebeam, Wendell Rivers, and I conducted a review of the Michigan Accountability Program, touted as a national model. This reinforced my sense of the ubiquity of politics.

In fact, I asked myself, was it all politics? The possibility disturbed me. Surely, there must be a way to adjudicate what evaluators did. I saw a review of Rawls’s (1971) work on social justice. Maybe this was what I needed. If politics was about who got what, an ethical framework might help evaluators grapple with the politics. I wrote an article on justice in evaluation. It’s difficult to imagine the incredulity of people in the field. What could justice possibly have to do with evaluation? The two terms didn’t belong in the same sentence. Some did see the relevance. Don Campbell sent for copies, and Glass included the article in the first Annual (Glass, 1976).

Rawls’s theory hypothesizes an “original position” in which people decide what principles of justice they should adopt. He arrives at one principle securing basic civil liberties and another stipulating that if inequalities are allowed, these inequalities should benefit the least advantaged. This conception was more egalitarian than the dominant utilitarian view. I discussed how the principles might apply to evaluation. The import was to bring social justice into consideration. Even if evaluators disagreed with Rawls, they needed to think about how what they were doing affected others, particularly the disadvantaged. Evaluation was more than politics.

During the 1970s, the quantitative–qualitative debate heated up. Along with others, I defended the legitimacy of qualitative studies. Again, I looked for a broader perspective and found a work reviving the classical discipline of rhetoric (Perelman & Olbrechts-Tyteca, 1969). I conceived that evaluations were arguments in which evaluators presented evidence for and against and that in making such arguments they might use both quantitative and qualitative data. Evaluation was more than methods. These ideas gained quick acceptance. I received personal messages from Lee Cronbach and Guba that the ideas had changed their thinking. Cronbach recast the validation of standardized tests as arguments, and Guba advanced naturalistic evaluation much further in work with Lincoln.

Meanwhile, our innovative center at Illinois, led by Stake and Tom Hastings, was experimenting with case studies, influenced by Barry MacDonald at East Anglia. In the Illinois evaluation, we had collected 40 different kinds of information on a stratified random sample of local gifted programs. How should we put that together? Bob encouraged us to combine these data into “portrayals.” After writing some cases, I gave a folder of data to a colleague, who said, “From what angle do I write this?” I said you don’t need an angle, just read the material and put it together. The result was incoherent. I realized you must have a point of view to make sense.

The framework this time was to see evaluations as using voice, plot, story, imagery, metaphor, and other literary elements, based on ideas from literary theory, linguistics, and cognitive science. I applied these concepts to case studies and scientific studies. Even scientific studies tell a story. One example was a sociological analysis of research on drunk driving, showing how the studies had changed the image of drunk drivers from those who have one drink too many to that of falling down habitual drunks. This change in image prompted strong legislation. Such elements I called “the vocabulary of action.” They motivate people to act. The deeper idea is one of coherence and meaning, of conveying powerful, shared values through metaphors, images, and nonliteral means. Evaluation is more than literal truth.

In the 1970s, the field expanded rapidly. There were at least 60 evaluation models. Examining them, I saw that they were similar. I posited eight basic approaches and analyzed how these differed in methods and assumptions. From there, I critiqued the approaches with the criteria—meta-evaluation. I included all these papers in my 1980 validity book, generalizing that truth, beauty, and justice were three broad criteria by which evaluations could be judged (House, 1980). Evaluations should be true, coherent, and just. Untrue, incoherent, and unjust evaluations are invalid. You need adequacy in all three areas. In each case, I had encountered a practical problem and looked to other disciplines to provide insights.

What about change in education? The Illinois Gifted Program was a complex, very effective innovation. In 1974, I published a book on the politics of educational innovation, drawing on the Illinois study and on quantitative geography about how innovations spread (House, 1974). Educators do not respond to new ideas as rationalistic research and development models of change anticipate. Teachers blend new ideas with old practices, heavily influenced by colleagues around them. The distinction is between reforms that enhance teacher skills and replacing teacher practices with techniques from authorities—craft versus technology. Adding to the craft perspective, Lapan and I wrote a book of advice for teachers (House & Lapan, 1978). In our view, the key to educational innovation was to influence teacher thinking and, through that, teacher practice. It wasn’t advisable to ignore how teachers conceive their work.

In those early years, I participated in several projects that influenced me in the long term. One was a study of change in a Chicago school by a team that included Dan Lortie and Rochelle Mayer. This study deepened my insights about how complex school social structures are and how that affects reform. Another project was a 4-month visit to East Anglia, where I established connections with MacDonald and his colleagues, who were working on democratic evaluation via case studies. A third was a critique of the Follow Through program evaluation with Glass, Decker Walker, and Les McLean. Our panel concluded that the Follow Through findings depended on how closely programs fit narrowly defined outcome measures rather than broader criteria. Our conclusion: There was no simple answer as to which early childhood program was the best. Our critique dealt a blow to the presumption that government could conduct large evaluations to determine definitive answers for everyone everywhere. Evaluation findings don’t generalize that easily.

META-EVALUATION: CRITIQUING POLICIES AND PRACTICES (1980–1990)

After his election in 1980, Reagan began privatizing and deregulating many government functions. Concern about the public interest began giving way to private interests, backed by claims that the private sector would be more effective. I began the 1980s by conducting two high-profile meta-evaluations. The New York City mayor’s office asked me to “audit” the evaluation of their controversial Promotional Gates Program, in which students were retained at grade level if they did not achieve prescribed scores on standardized tests. Those doubting the program’s efficacy insisted on an outside audit of the school district’s evaluation. Political pressures were intense. As I testified at a city council meeting, “No one in New York City seems to trust anyone else.” Bob Linn, Jim Raths, and I wrote confidential reports for the chancellor and mayor’s offices. The district evaluation had problems like failing to account for regression to the mean, thus claiming test gains when there were none. After a few rocky encounters, the district administrators decided that we were trying to help, and the evaluators corrected the errors. Eventually, the Village Voice obtained our confidential reports and featured them in a front-page story. The second meta-evaluation was a critique of the evaluation of Jesse Jackson’s PUSH/Excel program. Eleanor Farrar and I thought that the evaluators imposed an inappropriate program model on PUSH/Excel that was too rationalistic for a motivational enterprise. PUSH/Excel was like a church or coaching program, featuring loosely connected inspirational activities. Indeed, athletic teams rely heavily on similar motivational activities. (Also, the headlines generated by the evaluation—Jesse Jackson took government money and did not do what he said—did not match the findings.) I later wrote a book about the PUSH/Excel program, emphasizing the central issue of race (House, 1988).

The chief evaluator for PUSH/Excel was Charles Murray, who published Losing Ground (Murray, 1984) a few years later. This work claimed that Great Society programs made their beneficiaries worse off rather than better. Murray estimated the effects of Great Society programs by comparing before and after data in several areas. Unfortunately, Murray’s data analyses were badly flawed. In the education analysis, I discovered that he had used nonstandardized means for his critical measures, an egregious error. Bill Madura and I demonstrated that his analysis of unemployment was incorrect and misleading, accomplished by leaving out key employment data. Murray’s analyses seemed shaped to fit his message rather than the other way around. In spite of severe scholarly shortcomings, the book’s message attracted raves among neoconservatives and a White House eager to discredit the Great Society efforts.

Losing Ground set the tone for the coming decades of ideological studies purporting to be scholarly. Neoconservatives found that they could publish findings that did not meet rigorous standards in political journals and that the media would interpret these findings as social science—especially if the studies had lots of numbers. Journalists did not have the capacity to assess the statistics. Privately funded conservative think tanks became major sources of reform ideas. Education reforms became increasingly punitive, imposed on teachers and students and justified by pseudo–social science.

During the 1980s, I extended the craft perspective by writing papers on teacher thinking, teacher appraisal, and how to improve the insights of teachers as they direct their classrooms, coauthored with Lapan, Sandra Mathison, and Robin McTaggert. Cronbach (1982) influenced how we construed the validity of teacher inferences. Ultimately, I did not extend this work as far as intended, which was to integrate the craft perspective with evaluation thinking. Seeking educational improvement through enhancing teacher skills was being supplanted. Coercing teachers with standardized test scores became the reform focus for the next several presidents.

Looking back on the 1980s, perhaps I spent too much time fighting neoconservative ideas. In retrospect, many of these scholars were not influenced by rational argument. Rather, they were funded to produce certain findings, and they did. Ideological positions are not affected by discordant data. The privatizing, deregulating, and de-professionalizing policies they supported are taking their toll now in financial crises, a deteriorating infrastructure, and increasing social discord and stratification. Sometimes you have to take a stand even when you know your view won’t prevail.

EXPLORING EVALUATION FRONTIERS (1990–2000)

During the Clinton years, privatization and deregulation continued—for example, repealing the Glass–Steagall Act separating investment banking from other banking activities and refusing to regulate the burgeoning derivatives trade—which led directly to afinancial crisis in 2008 (Roubini & Mihm, 2010; Stiglitz, 2010). Clinton and Gore also tried “reinventing” government by making it the manager rather than the producer of social services. In such a scheme, evaluators would supply information to managers.

During this decade, I explored the institutional nature of evaluation. I spent 3 months in Spain, a culture different from any I had experienced. Curiosity led to the Annale historians, particularly Braudel’s (1981, 1982, 1983) history of capitalism as an institution developing over centuries. These ideas provided a map across time and societies in which I could place my own society and evaluation. I portrayed evaluation as a developing social institution in Professional Evaluation (House, 1993). My idea was that at some stage of capitalist development, government activities must be further justified and that professional evaluation emerges to play a legitimating role (which is how mandated evaluation of Great Society programs began).

In 1993, Sharon Rallis and Chip Reichart attempted to end the quantitative–qualitative dispute and asked me to talk on this at the American Evaluation Association (AEA). I had used scientific realism as a framework for integrating approaches (Bhaskar, 1975; House, 1991). If there is a substantive real world (and not just different perceptions), quantitative and qualitative inquiries must be ways of looking at the same thing and hence compatible at some level. There is no reason to claim the superiority of one method over another. Methods of inquiry depend on which aspect of reality one is investigating. Methods differ depending on the substance explored, but there is one complex reality of which evaluators are a part. Being immersed in that reality affects how people think about it. Indeed, actively participating enables people to think about it.

A new adventure began when Ken Travers at the National Science Foundation asked me to assist his research and evaluation unit. I considered the National Science Foundation the best federal agency and was not disappointed. I served on committees, interviewed staff, and became a participant observer of how evaluation works inside the agency. In addition to its own evaluations, the unit oversaw the first review of science, math, and technology education across all federal departments. Practical problems like finding contractors led to transaction cost economics as an explanation for how evaluation markets function. I developed a framework to appraise prospective innovations by using factors that characterize transaction costs in some markets, bounded rationality, opportunism, and asset specificity, based on Williamson’s (1985) work, which won a Nobel Prize in 2008 (House, 1998).

At the end of the decade, I concentrated on values and democratic evaluation. During the 1990s, Ove Karlsson spent considerable time in Colorado discussing evaluation politics. Continued contacts in Sweden and Norway reinforced Scandinavian egalitarian ideas. In 1999, Ken Howe and I published a book on values in evaluation, bringing together ideas on social justice, the Karlsson Scandinavian egalitarianism, the pragmatism of Dewey and Quine, the British ideas of MacDonald, and work on deliberative democracy by political scientists and philosophers (House & Howe, 1999). Among evaluators, Scriven’s (1976) influence was particularly strong regarding the objectivity of value judgments. Many evaluators view value judgments as subjective. In our conception, evaluators can arrive at (relatively) unbiased evaluative conclusions by including the views, perspectives, and interests of relevant stakeholders; conducting a dialogue with them; and deliberating together on the results. Evaluative findings can be “objective” in the sense of being relatively free of biases, including stakeholder biases as well as more traditional biases.

An additional rationale for the approach derives from considering hundreds of years of racism in the United States. Racism has not gone away; it has gone underground. In my experience, in a racist democracy racism takes disguised forms because citizens do not want to admit discrimination even to themselves. Treating minority students as explicitly “different” is no longer acceptable in most places. What happens is that policies and programs are promulgated that purport to help the students but, in fact, disadvantage them further. At some level, they are treated as different in ways that are damaging. In other words, there is considerable self-deception. One remedy is to have minority interests represented in evaluations to guard against such possibilities.

SEMIREFLECTING IN SEMIRETIREMENT (2000–2010)

In the new century, Bush pushed through even more radical privatizing and deregulating policies. Attention to private interests, rather than the public interest, became paramount. In education, privatization, deregulation, and de-professionalization crossed new boundaries. Private foundations and other agents of concentrated wealth promoted and sponsored many of these changes. As income and wealth distribution became increasingly unequal, those with power found it important to differentiate education to match an increasingly stratified social class structure. Even evaluation began to be privatized and controlled by private entities for their own ends (for additional influences, see Glass, 2008).

I began the century at the Center for the Advanced Study in the Behavioral Sciences, introducing evaluation to colleagues there by explaining how changing conceptions of causes, values, and politics had shaped the field. I handled causes and values analytically, but I presented politics in a case study, a storytelling technique I transformed into fiction by writing a novel about evaluation politics (House, 2007). I portrayed the political and ethical challenges evaluators face in what I call an educational novel, fiction deliberately constructed to educate students on substantive issues while entertaining them. My major evaluation project of the decade was monitoring the Denver bilingual program. Denver schools were under federal court order to provide Spanish language services for 15,000 immigrant children who did not speak English. Judge Matsch needed someone to monitor the implementation of the program agreed to by the school district and the plaintiffs—the Congress of Hispanic Educators and U.S. Justice Department. I anticipated an intensely political evaluation that might employ deliberative democratic principles.

I established a committee representing the contending parties. As I collected data from schools, I fed this information to the committee. We discussed progress in implementing the program, and when we had significant disagreements, we collected more data to resolve them. As evaluators, we insisted on standards for data collection and analysis, but the stakeholders shaped the evaluation in part. In my view, the findings should be more accurate since we tapped the knowledge of those in and around the program, as well as traditional data sources.

During the study, acrimony among stakeholders lessened, and implementation proceeded in an orderly manner, albeit slower than planned. At the end, there were still disagreements, but we also had a successful implementation informed by data. For a few years, I had been considering semiretirement so that I could spend more time overseas, do other writing, and focus more on financial investing. I began investing in the early 1990s, when I first thought about retirement (following a long meeting in which faculty members complained about not being appreciated). There are remarkable similarities between evaluation and investing, and I have derived many insights about evaluation from the finance and economics literature. Like evaluation, investing requires controlling emotions and evaluating situations in which there is overwhelming yet incomplete information. (I also wanted to leave something for my descendants other than several filing cabinets of reprints.)

At the same time, the evaluation community was important to me. It had shaped my working and social life. Staying involved with a few articles, speeches, and activities helped me stay in touch. Each year, I spend several months overseas—for example, northern winters in Australia, a place I admire for its egalitarianism, levelheadedness, and laid-back lifestyle. In 2006, Gary Henry and Mel Mark asked me to talk at AEA about the consequences of evaluation. I had been wondering about the frequent renunciations of findings from pharmaceutical drug evaluations. What was wrong with these studies? On investigation, I discovered that drug companies had gained control over many aspects of the evaluations and used their influence to produce findings favorable to their drugs, sometimes producing incorrect findings. Conflict of interest of the evaluators had become a threat—in fact, a serious threat to the field. This was another effect of privatization and unrestrained self-interest.

At the end of the decade, Leslie Cooksy, president of AEA, chose the quality of evaluation as the 2010 conference theme, citing my 1980 validity book on truth, beauty, and justice. The occasion enticed me to look back at work I had done over the years, reflections I have elaborated here. Truth, beauty, and justice are still appropriate as criteria for judging the validity of evaluations, even drug evaluations done 30 years later, though the social context has changed and the meaning of truth, beauty, and justice has shifted.

LOOKING BACK AND LOOKING AHEAD

Looking back, what influenced my ideas? I built directly on the ideas of some scholars, both those within evaluation, like Stake and Scriven, and those outside, like Rawls, Braudel, and Williamson. A few friends and colleagues shaped my ideas by reacting directly to my work, notably Glass, Lapan, and Howe, my Colorado colleague, and on occasion MacDonald and Karlsson overseas. The sociologist Dave Harvey, a hometown friend, provided valuable guidance over the years by reminding me where I came from. The influence of these people is greatly underestimated in this account. My work would have been much worse without questions like “This doesn’t make sense,” even if some of it still does not make sense. There were also useful discussions with certain colleagues, including Marv Alkin, editor of this volume (for a sample, see Alkin, 1990). And there was the influence of spending time overseas, especially in England, Spain, Sweden, and Australia.

As I become older, I find it important to listen to younger scholars. Having a long career means that you have made many mistakes, learned many lessons, and solved many problems. However, as the social context changes, these lessons become less relevant. This effect is noteworthy in finance. Having learned the secrets of investing success in a U.S.-centric world, investing gurus are having a difficult time adjusting to a global economy focused on Asia. Cronbach (1982) said that generalizations decay. The trouble is you’re not sure which ones.

One of my traits has been a strong interest in new ideas, especially new concepts that explain puzzling phenomena, and arranging those concepts into coherent patterns. Seeking coherence in explanations, in the meaning of phenomena, and in the meaning of life has been a driving motive. How do these things fit together? What do they mean? Once found, I tend to lose interest and move on (not a good scholarly trait). These tendencies are matters of personality as much as mind. Introducing me at the Canadian Evaluation Society in 2004, Alan Ryan said, “Throughout his long and distinguished career, Ernest House has continuously stressed the moral responsibility of evaluators. His social activist perspective has time and again alerted us to the dangers of being seduced by the agendas of those in power.” This personality trait comes from my family. My mother was the best person I ever knew. My father and his four brothers were the toughest. Sometimes I see things others don’t see and will say things others are afraid to say.

Of course, as we know, being outspoken comes at a cost. Keynes (1936/1997) wrote, “Worldly wisdom teaches that it is better for reputation to fail conventionally than to succeed unconventionally” (p. 158). Career risk is a major vulnerability of professionals. Professionals fear damaging their careers by taking stands different from their colleagues or contrary to those wielding power. I have been threatened with lawsuits and loss of my job and offered thinly veiled bribes. No doubt I would have won more prizes, had better jobs, and made more money if I had played along. But that’s not who I am.

As I look at those colleagues who shaped evaluation in its early decades, many have been people similarly willing to risk their careers by exploring uncharted ideas and, most important, taking a principled stand against those subverting evaluation’s integrity. Looking ahead to an ethically challenged era in which private interests trump the public interests, the pressures to compromise evaluations will intensify. Defending the integrity of the field will require more than intellect; it will require character.

REFERENCES

Alkin, M. C. (1990). Debates on evaluation. Newbury Park, CA: Sage.

Bhaskar, R. (1975). A realist theory of science. Sussex, England: Harvester Press.

Braudel, F. (1981, 1982, 1983). Civilization and capitalism: 15th-18th century. New York, NY: Harper & Row.

Cronbach, L. J. (1982). Designing evaluations of educational and social programs. San Francisco, CA: Jossey-Bass.

Glass, G. V. (1976). (Ed.). Evaluation studies review annual (Vol. 1). Beverley Hills, CA: Sage.

Glass, G. V. (2008). Fertilizers, pills, and magnetic strips. Charlotte, NC: Information Age.

House, E. R. (1974). The politics of educational innovation. Berkeley, CA: McCutchan.

House, E. R. (1980). Evaluating with validity. Beverly Hills, CA: Sage. (In Spanish, Evaluacion, etica y poder, 1994, Madrid, Spain: Morata, 1994. Reprinted 2010, Charlotte, NC: Information Age)

House, E. R. (1988). Jesse Jackson and the politics of charisma: The rise and fall of the PUSH/Excel program. Boulder, CO: Westview Press.

House, E. R. (1991). Realism in research. Educational Researcher, 20(5), 21–26.

House, E. R. (1993). Professional evaluation: Social impact and political consequences. Newbury Park, CA: Sage.

House, E. R. (1998). Schools for sale: Why free market policies won’t improve America’s schools and what will. New York, NY: Teachers College Press.

House, E. R. (2007). Regression to the mean: A novel of evaluation politics. Charlotte, NC: Information Age.

House, E. R., & Howe, K. R. (1999). Values in evaluation and social research. Thousand Oaks, CA: Sage.

House, E. R., & Lapan, S. G. (1978). Survival in the classroom. Boston, MA: Allyn & Bacon.

Keynes, J. M. (1997). The general theory of employment, interest, and money. Amherst, NY: Prometheus Books. (Original work published 1936)

Murray, C. (1984). Losing ground: American social policy 1950–1980. New York, NY: Basic Books.

Perelman, C., & Olbrechts-Tyteca, L. (1969). The new rhetoric: A treatise on argumentation. Notre Dame, IN: University of Notre Dame Press.

Rawls, J. (1971). A theory of justice. Cambridge, MA: Harvard University Press.

Roubini, N., & Mihm, S. (2010). Crisis economics. New York, NY: Penguin Books.

Scriven, M. (1976). Evaluation bias and its control. In G. V. Glass (Ed.), Evaluation studies review annual (pp. 119–139). Beverley Hills, CA: Sage.

Stake, R. E. (1967). The countenance of educational evaluation. Teachers College Record, 68, 523–540.

Stiglitz, J. E. (2010). Freefall. New York, NY: W. W. Norton.

Williamson, O. E. (1985). The economic institutions of capitalism: Firms, markets, and relational contracting. New York, NY: Free Press.

Note: Thanks to Steve Lapan and Gene Glass for helpful comments on this chapter.

Friday, October 7, 2022

Unfinished Business: Causes and Values

2001

Unfinished Business: Causes and Values

Ernest R. House

 

During the past several decades, two issues have strongly influenced much of what has happened in evaluation. These are the quantitative-qualitative dispute and the fact-value dichotomy. The first issue is familiar history by now, though underpinnings of the dispute are not well understood perhaps, such as the shift in our conception of causation. The second issue, the fact-value dichotomy, concerns the nature of values. We have come to grips with this issue only recently, and it promises to be equally pivotal. Both issues are unfinished business, though in different ways.

 

The Quantitative-Qualitative Dispute

 

In the early days of professional evaluation, policy makers and evaluators put their faith in large-scale quantitative studies, like Follow Through, Head Start, and the Income Maintenance experiment, to mention a few. Policymakers and evaluators thought that these large national studies would yield definitive findings that would demonstrate which programs or policies worked best. The findings could serve as the basis for mandates by the central government.

These large studies proved extremely disappointing for the most part. One problem was their scale. During one data collection, the Follow Through evaluators collected twelve tons of data. They were overwhelmed by the logistics to the point where they could not produce timely reports. Eventually, the government sponsors reduced the study to a fraction of its original size by limiting the number of variables.

A more serious problem was that the findings of these studies proved to be equivocal. The studies did not produce clear-cut results that could be generalized, as had been expected. For example, when the Follow Through data were analyzed, the variance in test score outcomes across the dozen early childhood programs being compared was about as great as the variance within these programs. In other words, if a given early childhood program had been implemented at six sites, two of the sites might have good results, two sites might have mediocre results, and two sites might have poor results.

Choosing a particular early childhood program was not effective in predicting the test score outcomes. This was not the kind of evaluative conclusion the government could base national recommendations on. Policymakers and evaluators became disenchanted with large-scale studies because of their cost, time scale, and lack of decisive results.

Meanwhile, evaluators were developing alternative approaches, including qualitative studies, meta-analysis, and program theory. Small qualitative studies were practical. For example, if a school district wanted an evaluation of its early childhood education program, interviewing administrators, teachers, and students was simple, cheap, and the findings were easy to understand, even if they could not be published in the journals of the time. Furthermore, generalizability was not the problem it was for large national studies. The demand on the local study was that the results be true for this place at this time, not true for sites all over the country.

However, many evaluators did not consider qualitative studies to be scientific. Members of the evaluation community engaged in intense debates over the scientific legitimacy of qualitative methods. This dispute preoccupied the profession for twenty years, even as qualitative studies became increasingly popular. After many words and much rancor, the field finally accepted the idea that evaluation studies could be conducted in a number of different ways (Reichart and Rallis, 1994). Evaluation became methodologically ecumenical, even if personal sensitivities lingered. The quantitative-qualitative dispute seems to be largely history by now.

Another alternative to large-scale quantitative studies was meta-analysis (Glass, 1976). Meta-analysis was more readily accepted by methodologists, though not without controversy. (Eysenck, 1978, called it “mega-silliness.”) In some ways meta-analysis was a natural successor to large-scale quantitative studies. Meta-analysis assembles scores of small experimental studies, studies that have control groups, and combines the findings of these studies quantitatively by focusing on the differences between performances of the experimental and control groups. The technique is more radical than it sounds since the researchers might combine outcomes that are quite different in kind into the summary scores.

For example, in the first application, Smith and Glass (1977) compared different approaches to psychotherapy. At the time the efficacy of psychotherapy itself was being questioned. Smith and Glass demonstrated with meta-analysis that different approaches to psychotherapy were effective and about equally effective. In conducting the meta-analysis, quite different outcomes were added together. For example, the researchers combined attitude questionnaire responses with counts of patient behaviors.

Meta-analysis became overwhelmingly popular in social and medical research to the point where it is difficult to pick up a major research journal without finding a few meta-analytic studies in it. In fact, medical researchers like the technique so much they sometimes claim credit for inventing it. Part of this popularity is due to meta-analysis being successful where single quantitative studies were not, such as in detecting treatment effects for mild hypertension. The technique combines the results from many studies, and since these individual studies are conducted in different settings and circumstances, they contain considerable variation, which seems to give the findings more generalizability when they are added together (Cook, 1993).

A third alternative to large-scale studies was program theory (Chen and Rossi, 1987). Program theory takes many forms but essentially consists of constructing a model of the program that can be used to base the evaluation on. Earlier some researchers had advocated basing evaluations on grand social theories, but such attempts failed. First, there were no social theories that seemed to have the explanatory power or credibility of physical theories. Second, even if such theories existed, could they be used to evaluate social programs? For example, given the task of evaluating automobiles, would evaluators use theories of physics to do the job? It seems unlikely.

Evaluators reconsidered and transformed the grand theory idea into developing theories for individual programs, in other words, constructing a model of the program. This substitution worked better. The program formulation is concrete enough to give guidance to the evaluation study, and it communicates directly with program participants. Program theory can guide the evaluation by delineating places where the evaluator might seek data to confirm whether the program is working in particular components. It enables evaluators to eliminate rival hypotheses and make causal attributions more easily (Lipsey, 1993; Davidson, 2000).

         Underlying qualitative studies, meta-analysis, program theory and other developments have been changes in our conception of causation. The changes in our conception of causation have been subtle and have passed unnoticed perhaps. The conceptual changes suggest why these alternatives have worked better than the large-scale studies that preceded them.

 

Changing Conceptions of Causation

 

The conception of causation that we inherited is called the regularity or Humean theory of causation, named after David Hume’s influential analysis of cause (House, 1991). Regularity describes the conception. Put simply, the reason that we know one event caused another event is that the first event took place before the other event regularly—regularity of succession. If such and such event occurred and another event occurred after it repeatedly, we would have reason to believe the events would occur together again. So succession of events is what we are after. In fact, Hume said that is all there is to causation, along with contiguity of the events. The research task is to determine the succession of events. Put succinctly, “If p, then q; p, therefore q.”

This notion of cause is the underlying basis for most of the discussion of experimental design over the past decades. It is manifest in one of the early evaluation books, written by Edward Suchman: “One may formulate an evaluation project in terms of a series of hypotheses which state that “Activities A, B, C will produce results X, Y, Z’” (Suchman 1967, p. 93). In other words, if we have a program A under circumstances B and C, it will produce results X, Y, and Z. Furthermore, the perfect design for determining whether the result has occurred is the classic randomized control group design. No error could result from employing this design, according to Suchman.

Although this assertion sounds reasonable, it falls apart on inspection. If we go back to the Follow Through experiment, we had the same early childhood program at six different sites, but it produced different outcomes at the sites. Why? Because social causation is more complex than the regularity conception suggests. Even with the same program, there are different teachers at different sites who produce different results. We might try to control for the teachers, but there are so many variables that influence or might influence the outcomes, the researcher can’t control for all of them. Put another way, the program is not in and of itself an integrated causal mechanism. Parts of the program might interact with elements in the environment to produce quite different effects.

Such considerations led Cronbach to give up on treatment-interaction research altogether. He was trying to determine which characteristics of students affected outcomes, that is, how student characteristics and outcomes interacted. But there were so many possibilities that could not be controlled that he gave up. Put more technically, the effects of the secondary interactions of the variables were consistently as strong as the main effects.

Cronbach (1982) looked into the nature of causation and devised a more complex formulation. “In S, all (ABC or DEF or JKL) are followed by P.” In other words, in this particular setting, P, the outcome, may be determined by ABC or DEF or JKL. The problem for evaluators is that if A is the program, we only get P if conditions B and C are also present. So we could have A and not have the outcome P. More confounding, since P is caused by DEF and JKL combinations as well, we might not have the program A but still get P anyhow. Neither the presence nor the absence of A, the program, determines P. Succession of events is not a definitive test of cause and effect. The classic control group design will not produce definitive conclusions if causation is this complex.

         Even so, we could devise a determinate research design using Cronbach’s formulation, albeit a very expensive and complex one. However, social causation is more complex than even Cronbach’s formulation indicates. Cronbach based his analysis on Mackie (1974), a seminal work on causation. Mackie’s original formulation was this: “All F (A…B… or D…H… or …) are P…” The dots represent missing causal factors we don’t know about. We have huge gaps in our knowledge of social events, gaps we don’t know about, and gaps we don’t even know we don’t know about. We can never fill these gaps in so we can never be certain of all that is involved.

I won’t extend the causal analysis further. It remains incomplete, unfinished business for the field, except to say that we do understand that social causation is more complex than we thought back in the old days. Why do qualitative studies, meta-analysis, and program theory seem to work better than large-scale studies of the past?

Each approach takes account of a more complex social reality by framing the program and the study more precisely, albeit in different ways. Qualitative studies show the interaction of people and events with other causal factors in context, which limits the causal possibilities and alternatives one must contend with (Maxwell, 1996). Meta-analysis uses individual studies, each of which occurred in separate circumstances of rich variation, which makes generalization more possible (Cook, 1993). Program theory delineates the domain investigated, which makes the questions evaluators pose more precise, relevant, and testable (Lipsey, 1993).

Recent books by Pawson and Tilley (1997) and Mark, Henry, and Julnes (2000) deal with causation, mostly by advancing realist conceptions, somewhat similar to the conception I have employed here. There is some agreement between the books and also significant disagreement, as in the utility of experimental studies. Developing a more complex notion of causation remains unfinished business, though we have made a start (Cf., Rogers et al, 2000).

 

The Fact-Value Dichotomy

 

A second issue that has shaped development in the field is the fact-value dichotomy. This influence has been subtle and pernicious. The dichotomy is the belief that facts refer to one thing and values refer to something totally different. The fact-value dichotomy is a particularly difficult problem for evaluation since values lie at the heart of evaluation. I doubt anything in the field has caused more trouble than this belief.

The distinction between facts and values has been around for many decades, but it came down to us in the evaluation community through the positivists and their influence on social science. The logical positivists thought that facts could be ascertained and that only facts were the fit subject of science, along with analytic statements like “1 plus 1 equals two” that were true by definition. Facts were empirical and could be based on pristine observations, a position called foundationalism.

On the other hand, values were something else. Values might be feelings, emotions, or useless metaphysical entities. Whatever they were, they were not subject to scientific analysis. People simply held certain values or believed in certain values or did not. Values were chosen. Rational discussion had little to do with them. The role of the scientist was to determine facts. Someone else, politicians perhaps, could worry about values.

Donald Campbell, one of the great founders of the evaluation field, accepted the fact-value dichotomy (Campbell, 1982). However, he did not accept foundationalism about facts. Counter to the positivists, he contended that there were no pristine observations on which factual claims could be based because all observations were influenced by theories and preconceptions that people held. Knowledge was still possible because although you could not compare a fact to a pristine observation to see if the fact was true, what you could do was to compare a fact to the body of knowledge it related to. The fact should fit the whole body of beliefs. Occasionally, the body of knowledge had to be changed to accommodate the fact. In any case, you were comparing a belief to a body of beliefs, not a belief to pure observation. This non-foundationalism was counter to the positivist view.

On the other hand, Campbell explicitly accepted the positivist conception of values. Values could not be determined rationally; they had to be chosen. He thought it was not the evaluator’s job to choose values. Once values were determined by politicians, sponsors, or program developers, evaluators could examine the outcomes of programs and policies with criteria based on those values. Practically speaking, this meant that evaluators could not evaluate the goals of programs, since the goals were closely connected to values. Evaluators had little choice but to accept program and policy goals.

I believe Campbell had the correct idea about facts but not about values. We can deal with both facts and values rationally. Facts and values are not separate kinds of entities altogether, though they sometimes appear that way. Facts and values (factual claims and value claims) blend together in the conclusions of evaluation studies and, indeed, blend together throughout evaluation studies. We might conceive facts and values schematically as lying on a continuum like this:

 

Brute Facts_______________________Bare Values

 

What we call facts and values are fact and value claims, which are sometimes expressed as fact and value statements. They are beliefs about the world. Sometimes these beliefs look as if they are strictly factual without any value aspect built in, such as, “Diamonds are harder than steel.” This statement may be true or false, and it fits at the left end of the continuum. There is little individual preference or taste built into it.

A statement like “Cabernet is better than chardonnay” fits better at the right end of the continuum. It is suffused with personal taste. But what about a statement like, “Follow Through is a good educational program”? This statement contains both fact and value aspects. The evaluative claim is based on criteria from which the conclusion is drawn, and it must be based on factual claims as well. The statement fits towards the middle of the continuum, a blend of factual and value claims. Most evaluative conclusions fall towards the center of the continuum as blends of facts and values.

Context makes a huge difference in how a statement functions. A statement like, “George Washington was the first president of the United States,” looks like a factual (historical) claim. But if I am engaged in a discussion with a group of feminists who are pointing out the racist and patriarchal origins of the country, this statement becomes evaluative as well in this particular context. The statement can be factual and evaluative simultaneously. Similarly, claims that might seem factual in another context might become evaluative in the context of an evaluation.

Such evaluative claims are subject to rational analysis in the way we ordinarily understand rational analysis. First, the claims can be true or false. Follow Through may or may not be a good educational program. Second, we can collect evidence for and against the truth or falsity of the claim, as indeed we do in evaluation studies. Third, the evidence can be biased or unbiased, good or bad. Finally, the procedures for the evidential assessment as to what data are likely to be biased or unbiased are determined by the discipline.

Of course, some claims are not easy to determine. In some situations, it may not be possible to determine the truth or falsity of the claims. Also, we may need new procedures to help us collect, determine, and process the validity of fact-value claims, in addition to traditional techniques. Just as we have developed sophisticated procedures for testing factual claims over the years, we might develop procedures for collecting and processing claims that contain strong value aspects so that our evaluative conclusions are unbiased regarding these claims as well. Actually, the claims blend together in evaluation studies.

Elsewhere, we have suggested three general principles we might follow in arriving at unbiased claims (House and Howe, 1999). The principles are inclusion of all relevant stakeholder perspectives, values, and interests in the study; extensive dialogue between the evaluator and stakeholders, and sometimes among the stakeholders themselves; and extensive deliberation to reach valid conclusions in the study. We call this approach deliberative democratic evaluation.

This analysis of facts and values is quite different from the fact-value dichotomy. In the old view, to the extent evaluative conclusions were value based, they were outside the purview of the evaluator. In the new view, values are subject to rational analysis by the evaluator and others. Values are evaluations.

 

References

 

Campbell, D. (1982). Experiments as arguments. In E. R. House, S. Mathison, J. A. Pearsol, & H. Preskill (Eds.). Evaluation Studies Review Annual, 7, 117-128.

Chen, H. & Rossi, P. H. (1987). Evaluating with sense: The theory-driven approach to validity. Evaluation Review, 7, 283-302.

Cook, T. D. (1993). A quasi-sampling theory of the generalization of causal relationships. In L. B. Sechrest & A. G. Scott (Eds). Understanding Causes and Generalizing about them. New Directions in Evaluation, no. 57, 39-82.

Cronbach, L. J. (1982). Designing evaluations of educational and social programs. San Francisco: Jossey-Bass.

Davidson, E. J. (2000). Ascertaining causation in theory-based evaluation. In Rogers, P. J., Hacsi, T. A., Petrosino, A., Huebner, T. A. (Eds.). (2000). Program theory in evaluation: Challenges and opportunities. New Directions in Evaluation, no. 87, 17-26.

Eysenck, H. J. (1978). An exercise in mega-silliness. In T. D. Cook, M. L. Del Rosario, K. M. Hennigan, M. M. Mark, and W. M. K. Trochim (Eds.).  Evaluation Studies Review Annual, Vol. 3, 697.

Glass, G. V (1976). Primary, secondary, and meta-analysis of research. Educational Researcher, 5, 3-8.

House, E. R. & Howe, K. R. (1999). Values in Evaluation and social research. Thousand Oaks, CA: Sage.

House, E. R. (1991). Realism in research. Educational Researcher, 20, 6, 2-9.

Lipsey, M. W. (1993). Theory as method: Small theories of treatments. In L. B. Sechrest & A. G. Scott (Eds.). Understanding Causes and Generalizing about them. New Directions in Evaluation, no. 57, 5-38,

Mackie, J. L. (1974). The cement of the universe. Oxford: Clarendon Press.

Mark, M. M., Henry, G. T., Julnes, G. (2000). Evaluation: An Integrated Framework. San Francisco: Jossey-Bass.

Maxwell, J. A. (1996). Using qualitative research to develop causal explanations. Working Papers, Harvard Project on Schooling and Children. Cambridge, MA.

Pawson, R. & Tilley, N. (1997). Realistic evaluation. London: Sage.

Reichardt, C. S. & Rallis, S. F. (1994). The qualitative-quantitative debate: New perspectives. New Directions in Program Evaluation, no. 61, San Francisco: Jossey-Bass.

Rogers, P. J., Hacsi, T. A., Petrosino, A., Huebner, T. A. (Eds.). (2000). Program theory in evaluation: Challenges and opportunities. New Directions in Evaluation, no. 87.

Smith, M. L. & Glass, G. V. (1976). Meta-analysis of psychotherapy outcomes studies. American Psychologist, 32, 752-760.

Suchman, E. A. (1967). Evaluative research. New York: Russell Sage.

 

 

Coherence and Credibility: The Aesthetics of Evaluation

1979 Ernest R. House. (1979). Coherence and Credibility: The Aesthetics of Evaluation, Educational Evaluation and Policy Analy...