Click here for search results

Data Quality

An essential component of any reliable research study is credible data. An impact evaluation that attempts to attribute causality to a specific intervention using data that is poorly collected and managed is at great risk of producing incorrect results. Conducting analysis on data full of errors and inaccurate information jeopardizes the value that policy makers look for in well-organized impact evaluations. For this reason, ensuring the reliability of data, whether primary or secondary, is a critical step in the evaluation process.

There are at least three types of threats to data quality: bias, measurement error, and bad questionnaire design.

Bias refers to a systematic skewing of data collected. A biased estimator is one that systematically over-estimates or under-estimates that which is being estimated. A biased sample refers to a sample in which all members of the population are not equally likely to be represented. Bias may occur because of under-coverage of some groups, due to large non-response rates among particular groups, or because of incentives for lying in response to survey questions. An example of bias would be an overestimation of income levels because those working longer hours in the sampled population have a higher non-response rate. Proper sampling design and particular attention to combating non-response and to accounting for incentives for misreporting can help the evaluator ensure that the data collected is sufficient, accurate, and appropriate for making analytical statements about the population of interest.

Measurement error occurs due to errors in data collection and/or data entry where random error or “noise” is added to the date collected. Measurement error leads to imprecise estimators and weakens the evaluator’s ability to present meaningful results. There is often a tendency to conflate measurement error with bias. The two, however, are distinct since measurement error is non-systematic and has mean-zero, whereas bias refers to a systematic over-estimation or under-estimation.

Bad questionnaire design is a problem in surveys with ambiguously worded questions. Survey questions may be unclear, may be understood differently from what the evaluator intended and/or may lend themselves to multiple interpretations. Questions relating to cash transfers, for example are often misunderstood. The distinction between gifts and loans often differs with the settings and survey questions relating monetary transfers may have to be adapted to the setting in order to capture the information the question is intended to. The quality of the questions asked through surveys is of great importance to ensuring that data is reliable. Surveys should be designed to fit the country and local context, pre-tested, and screened for reliability of answers. Questionnaires should minimize the potential for confusion and inaccurate answers.

Following the collection of data, the next critical point for ensuring data quality is data entry. It is often worthwhile to have each questionnaire entered twice, by two different individuals, and then compare the two versions for inconsistencies, check them against the questionnaire in the field, and retain the correct version. Keeping multiple versions of the survey data is also a useful way of guarding against lost or corrupted data files.

Electronic data collection in the field is one way to minimize the error of data transfer from collection to entry. By giving each enumerator a handheld or laptop computer, survey answers can be entered directly into digital format and skips in the survey can be programmed automatically. Additionally, internal consistency checks can be automated.

The use of secondary data is an excellent way to use preexisting datasets and to prevent the duplication of efforts. However, the same concern about data quality that applies to primary data collection also applies to secondary data usage. When using secondary data, one must become entirely familiar with the sampling strategy used in the data collection, the population of interest, the questionnaires, the coding of the data files, and the interpretation of the data. Cleaning the data of errors, contradictions, and irregularities is essential to the evaluator. Obtaining the data in its raw form can sometimes help to identify any pervasive quality inconsistencies. Before the analysis of the data, verifying the quality of secondary data is the responsibility of the evaluator.

 




Permanent URL for this page: http://go.worldbank.org/JB5GF8OGJ0