Click here for search results

Subscribe to Mailing List

Evaluation Designs


Contents

Experimental (randomized)

Quasi-experimental

Non-experimental

Selection bias
                      

This section draws heavily on:

The Impact Evaluation Handbook - available in multiple languages
Monitoring and Evaluation - from the PRSP Sourcebook
See also...
Key Readings
Evaluation designs are determined by the choice of methods used to identify a comparison/control group, or in other words, a group of non-participants in a program or a project. This comparison/control group should be as similar to the target group as possible, but for the fact that its members do not participate in a program or receive the intervention. An estimate of impact can then be derived by comparing the levels of well-being between comparison/control groups and the target group (those who do receive the intervention).

Evaluation designs can be broadly classified into three categories: experimental, quasi-experimental and non-experimental. (The term control group is used when the evaluation employs an experimental design and the term comparison group is associated with a quasi-experimental design. In non-experimental design, program participants are compared to non-participants by controlling statistically for differences between participants and non-participants.)

These three evaluation designs vary in feasibility, cost, the degree of clarity and validity of results, and the degree of selection bias. Please read on to learn more.

Experimental (randomized)
See Key Readings for more information

This design involves gathering a set of individuals (or other unit of analysis) equally eligible and willing to participate in the program and randomly dividing them into two groups: those who receive the intervention (treatment group) and those from whom the intervention is withheld (control group). 

Experimental or randomized designs are generally considered the most robust of the evaluation methodologies. By randomly allocating the intervention among eligible beneficiaries, the assignment process itself creates comparable treatment and control groups that are statistically equivalent to one another, given appropriate sample sizes. This is a very powerful outcome because, in theory, the control groups generated through random assignment serve as a perfect counterfactual, free from the troublesome selection bias issues that exist in all evaluations.

The main benefit of this technique is the simplicity in interpreting results—the program impact on the outcome being evaluated can be measured by the difference between the means of the samples of the treatment group and the control group. 

While experimental designs tend to be considered the optimum approach to estimating project impact, in practice there are several problems:

  1. Randomization may be unethical owing to the denial of benefits or services to otherwise eligible members of the population for the purposes of the study.
  2. It can be politically difficult to provide an intervention to one group and not another.
  3. The scope of the intervention may rule out the possibility of selecting a control group such as with a nationwide program or policy change.
  4. Individuals in treatment or control groups may change certain identifying characteristics during the experiment that could invalidate or contaminate the results. If, for example, people move in and out of a project area, they may move in and out of the treatment or control group. Alternatively, people who were denied a program benefit may seek it through alternative sources, or those being offered a program may not take up the intervention.
  5. It may be difficult to ensure that assignment is truly random. An example of this might be administrators who exclude high-risk applicants to achieve better results.
  6. Experimental designs can be expensive and time consuming in certain situations, particularly in the collection of new data.

Quasi-experimental
See Key Readings for more information

This design consists of constructing a comparison group using matching or reflexive comparisons.

Matching involves identifying non–program participants comparable in essential characteristics to participants. Both groups should be matched on the basis of either a few observed characteristics or a number of them that are known or believed to influence program outcomes. Matched comparison groups can be selected before project implementation (prospective studies) or afterwards (retrospective studies). 

The main advantage of evaluations using matching methods is that they can draw on existing data sources and are thus often quicker and cheaper to implement. The principal disadvantages are that the reliability of the results is often reduced, as the methodology may not completely solve the problem of selection bias; and the matching methods can be statistically complex, thus requiring considerable expertise in the design of the evaluation and in analysis and interpretation of the results.

The most widely used type of matching is propensity score matching, in which the comparison group is matched to the treatment group by using the propensity score (predicted probability of participation given observed characteristics). This method allows one to find a comparison group from a sample of non-participants closest in terms of observable characteristics to a sample of program participants.

Score matching is a very useful method when there are many potential characteristics to match between a sample of program participants and a sample of non-participants. Instead of aiming to ensure that the matched control for each participant has exactly the same value of the control variables X, the same result can be achieved by matching on the predicted probability of program participation, P, given X, which is called the propensity score of X. The range of propensity scores estimated for the treatment group should correspond closely to that for the retained sample of non-participants. The closer the propensity score, the better the match. A good comparison group comes from the same economic environment and is administered the same questionnaire as the treatment group by similarly trained interviewers.

Reflexive comparison is another type of quasi-experimental design. In a reflexive comparison, the counterfactual is constructed on the basis of the situation of program participants before the program. Thus, program participants are compared to themselves before and after the intervention and function as both treatment and comparison group. This type of design is particularly useful in evaluations of full-coverage interventions such as nationwide policies and programs in which the entire population participates and there is no scope for a control group.

There is, however, a major drawback with reflexive comparisons: the situation of program participants before and after the intervention may change owing to myriad reasons independent of the program. For example, participants in a training program may have improved employment prospects after the program. While this improvement may be due to the program, it may also be due to the fact that the economy is recovering from a past crisis and employment is growing again. Unless they are carefully done, reflexive comparisons may not be able to distinguish between the program and other external effects, thus compromising the reliability of results.

Non-experimental 
See Key Readings for more information

This evaluation design can be used when it is not possible to randomly select a control group, identify a suitable comparison group through matching methods or use reflexive comparisons. In such situations, program participants can be compared to non-participants using statistical methods to account for differences between the two groups.  

Instrumental variables is one of the econometric techniques that can be used to compare program participants and non-participants correcting for selection bias. It consists of using one or more variables (instruments) that matter to participation but not to outcomes given participation. This identifies the exogenous variation in outcomes attributable to the program, recognizing that its placement may not be random but purposive. The instrumental variables are first used to predict program participation; then the program impact is estimated using the predicted values from the first equation.

As with quasi-experimental methods, this evaluation design is relatively cheap and easy to implement since it can draw on existing data sources. However, it poses a number of difficulties. First, the reliability of results is often reduced as the methodology is less robust statistically. Second, the methodology has some statistical complexities that may require some expertise in the design of the evaluation and in the analysis and interpretation of results. Third, although it is possible to partially correct for selection bias, full correction remains as a challenge.  

Selection bias
See Key Readings for more information.

One consideration that may affect the choice of the evaluation design is the problem of bias, that is, the extent to which various subgroups or target population are likely to participate differently in a program, thus affecting the sample and ultimately the results. There are two types of bias:

  1. Those due to differences in observables (which can be estimated from the data) and
  2. Those due to unobservables (which are either not known by the researcher or are not easily measured), often called selection bias.

The problem of selection bias in impact evaluation is caused by the fact that program participants differ from non-participants in characteristics that cannot be observed by the evaluator and affect both the decision to participate in the program and its outcome (e.g., ability or motivation). For example, program participants may be individuals who have the most to gain from a particular program and are more motivated to commit to program activities. Thus, outcome changes observed among these nonrandom groups of individuals would indicate the program impact on motivated participants, but may not reflect how the program on average would affect the target population.

The selection bias could go in the opposite direction, too. Individuals may choose to participate in a program because of a pessimistic perception of the alternatives available to them outside the program. If their perceptions are based on a realistic assessment of their opportunities, participants outcomes in the absence of the program would be lower than those of non-participants with identical observable characteristics, thus the two groups would not be comparable.

The problem of selection bias arises because of missing data on the common factors affecting both participation and outcomes. In theory, randomized or experimental evaluation is free from the bias problem whereas the problem is practically unavoidable when non-experimental data are employed.

The process of randomization ensures that before the intervention takes place the treatment and control groups are statistically equivalent, on average, with respect to all characteristics, observed and unobserved. Randomized experiments solve the problem of selection bias by generating an experimental control group of people who would have participated in a program but who were randomly denied access to the program or treatment. The random assignment does not remove the selection bias but instead balances the bias between the participant (treatment) and non-participant (control) groups, so that it cancels out when calculating the mean impact estimate.  Any differences in the average outcomes of the two groups after the intervention can be attributed to the intervention.

In quasi-experimental and non-experimental designs, econometric techniques are used to model the participation and outcome processes and arrive at an unbiased estimate of program impact. Propensity score matching and multivariate regression methods control for selection on observables whereas instrumental variable methods control for selection on unobservables. The general idea is to compare program participants and non-participants holding selection processes constant. The validity of quasi-experimental and non-experimental evaluation results depends on how well the model is specified.

 Back to top


Related Sections:

  • See Data & Data Sources for a collection of data initiatives collected for evaluation purposes and for a guide to qualitative and quantitative impact evaluation instruments.
  • See Training Events and Materials for presentations on how to employ the methods and techniques introduced here