·
Economia ·
Econometria
Send your question to AI and receive an answer instantly
Recommended for you
3
Lista de Exercicios Econometria - Modelos de Regressao e Analise de Dados
Econometria
UFPEL
2
Trabalho Estatistica Economica - Impacto do PIB nas Importacoes Brasileiras 1974-2021
Econometria
UFPEL
15
Referências de Marketing e Gestão de Marca
Econometria
UCAM
13
Notas de Aula: Econometria II - Equações Simultâneas
Econometria
UFABC
14
Multicolinearidade em Econometria: Conceitos, Identificação e Impactos
Econometria
ITE
24
Econometria II - Notas de Aula sobre Variáveis Endógenas e Inferência Causal
Econometria
UFABC
12
Econometria I - Lista de Exercícios N 6 - Regressão Múltipla e Interpretação de Resultados
Econometria
UESC
15
Fusões e Aquisições - Efeitos no Crescimento Econômico do Mercado Norte-Americano
Econometria
FUCAPE
8
MAPA Econometria - Regressão POLS, Efeito Fixo, Aleatório e Teste de Hausman
Econometria
UNICESUMAR
18
Análise de Tabelas
Econometria
UFF
Preview text
Universidade Federal de Pelotas Programa de PósGraduação em Organizações e Mercados PPGOM Professor André Carraro Felipe Garcia Lista 1 1 Qual a importância da aleatorização do tratamento em experimentos sociais no que diz respeito à identificação do efeito causal 2 Descreva o método de aleatorização simples do tratamento Quais suas vantagens e quais os possíveis problemas 3 Pesquisadores usualmente colocam no início do trabalho uma tabela com as médias de cada covariada prétratamento diferenciando tratados e controles e uma coluna com testes de diferenças de médias entre os dois grupos O que tal tabela permite analisar Quais as limitações dessa análise 4 Como definir quais covariadas devem ser consideradas para avaliar a qualidade da aleatorização Quais os problemas em se utilizar covariadas prétratamento 5 Descreva o método de aleatorização com estratificação Quais as vantagens e desvantagens de tal método 6 Qual método é superior aleatorização simples ou com estratificação Por quê Em que situações devese tomar cuidado com a estratificação 7 Descreva o método de aleatorização que faz o matching em pares Quais vantagens e desvantagens de tal método Por que para pequenas amostras não se pode usar muitas covariadas para fazer o matching 8 No que consistem os métodos de realeatorização Quais possíveis problemas podem haver 9 Qual a importância do tamanho da amostra Em que sentido amostras maiores melhoram a qualidade da aleatorização 10 Utilizando a base de dados experimentaldta calcule as médias e desvios padrões das covariadas prétratamento e faça o teste de diferenças de médias entre tratados e controles Calcule também o efeito do tratamento sobre a variável de interesse re78 11 Refaça o exercício anterior com uma subamostra aleatória de tamanho 30 Observase alguma diferença nos resultados 200 American Economic Journal Applied Economics 2009 14 200232 httpwwwaeaweborgarticlesphpdoi101257app14200 R andomized experiments are increasingly used in development economics Historically many randomized experiments were largescale government implemented social experiments such as Moving To Opportunity in the United States or ProgresaOportunidades in Mexico These experiments allowed for little involvement by researchers in the actual randomization In contrast in recent years many experiments have been directly implemented by researchers themselves or in partnership with nongovernmental organizations NGOs and the private sector These smallscale experiments with sample sizes often comprising 100 to 500 indi viduals or 20 to 100 schools or health clinics have greatly expanded the range of research questions that can be studied using experiments and have provided impor tant and credible evidence on a range of economic and policy issues Nevertheless this move toward smaller sample sizes means researchers increasingly face the ques tion of not just whether to randomize but how to do so This paper provides the first comprehensive look at how researchers are carrying out randomizations in devel opment field experiments and then analyzes some of the consequences of these choices Bruhn Development Research Group World Bank MSN MC3307 1818 H Street N W Washington DC 20433 email mbruhnworldbankorg McKenzie Development Research Group World Bank MSN MC3 307 1818 H Street N W Washington DC 20433 email dmckenzieworldbankorg We thank the leading researchers in development field experiments who participated in our short survey as well as colleagues who have shared their experiences with implementing randomization We thank Angus Deaton David Evans Xavier Gine Guido Imbens Ben Olken and seminar participants at the World Bank for helpful comments We are also grateful to Radu Ban for sharing his pairwise matching Stata code Jishnu Das for the LEAPS data and Kathleen Beegle and Kristen Himelein for providing us with their constructed IFLS data All views are of course our own To comment on this article in the online discussion forum or to view additional materials visit the articles page at httpwwwaeaweborgarticlesphpdoi101257app14200 In Pursuit of Balance Randomization in Practice in Development Field Experiments By Miriam Bruhn and David McKenzie We present new evidence on the randomization methods used in exist ing experiments and new simulations comparing these methods We find that many papers do not describe the randomization in detail implying that better reporting is needed Our simulations suggest that in samples of 300 or more the different methods perform simi larly However for very persistent outcome variables and in smaller samples pairwise matching and stratification perform best and appear to dominate the rerandomization methods commonly used in practice The simulations also point to specific recommendations for which variables to balance on and for which controls to include in the ex post analysis JEL C83 C93 O12 VOL 1 NO 4 201 BruHN ANd mckENziE iN PursuiT Of BALANcE Simple randomization ensures the allocation of treatment to individuals or institu tions is left purely to chance and is not systematically biased by deliberate selection of individuals or institutions into the treatment Randomization thus ensures that the treatment and control samples are in expectation similar in average both in terms of observed and unobserved characteristics Furthermore it is often argued that the simplicity of experiments offers considerable advantage in making the results con vincing to other social scientists and policymakers and that in some instances ran dom assignment is the fairest and most transparent way of choosing the recipients of a new pilot program Gary Burtless 1995 However it has long been recognized that while pure random assignment guar antees that the treatment and control groups will have identical characteristics on average in any particular random allocation the two groups will differ along some dimensions with the probability that such differences are large falling with sample size1 Although ex post adjustment can be made for such chance imbalances this is less efficient than achieving ex ante balance and cannot be used in cases in which all individuals with a given characteristic are allocated to just the treatment group The standard approach to avoiding imbalance on a few key variables is stratifica tion or blocking originally proposed by R A Fisher 1935 Under this approach units are randomly assigned to treatment and control within strata defined by usu ally one or two observed baseline characteristics However in practice it is unlikely that one or two variables will explain a large share of the variation in the outcome of interest leading to attempts to balance on multiple variables One such method when baseline data are available is pairwise matching Robert Greevy et al 2004 Kosuke Imai Gary King and Clayton Nall forthcoming The methods of implementing randomization have historically been poorly reported in medical journals leading to the formulation of the CONSORT guidelines that set out standards for the reporting of clinical trials Kenneth F Schulz 1996 The recent explosion of field experiments in development economics has not yet met these same standards with many papers omitting key details of the method in which randomiza tion is implemented For this reason we conducted a survey of leading researchers car rying out randomized experiments in developing countries This reveals common use of methods to improve baseline balance including several rerandomization methods not discussed in print These are carrying out an allocation to treatment and control and then using a statistical threshold or ad hoc procedure to decide whether or not to redraw the allocation and drawing 100 or 1000 allocations to treatment and control and choosing the one that shows best balance on a set of observable variables This paper discusses the pros and cons of these different methods for striving toward balance on observables Proponents of methods such as stratification match ing and minimization claim that such methods can improve efficiency increase power and protect against type I errors Kernan et al 1999 and do not seem to have significant disadvantages except in small samples Imai King and Elizabeth 1 For example Walter N Kernan et al 1999 consider a binary variable that is present in 30 percent of the sample They show that the chance that the two treatment group proportions will differ by more than 10 percent is 38 percent in an experiment with 50 individuals 27 percent in an experiment with 100 individuals 9 percent for an experiment with 200 individuals and 2 percent for an experiment with 400 individuals 202 AmEricAN EcONOmic JOurNAL APPLiEd EcONOmics OcTOBEr 2009 A Stuart 2008 Greevy et al 2004 Mikel Aickin 20012 However it is precisely in small samples that the choice of randomization method becomes important since in large samples all methods will achieve balance We simulate different randomiza tion methods in four panel datasets We then compare balance in outcome variables at baseline and at followup The simulations show that when methods other than pure randomization are used the degree of balance achieved on baseline variables is much greater than that achieved on the outcome variable in the absence of treat ment in the followup period The simulations further show that in samples of 300 observations or more the choice of method is not very important for the degree of balance in many outcomes at followup In small samples and with very persistent outcomes however matching or stratification on relevant baseline variables achieves more balance in followup outcomes than does pure randomization We use our simulation results and theory to help answer many of the impor tant practical questions facing researchers engaged in randomized experiments The results allow us to provide guidance on how to conduct inference after stratification matching or rerandomization In practice it appears that many researchers ignore the method of randomization in inference We show that this leads to hypothesis tests with incorrect size On average the standard errors are overly conservative when the method of randomization is not controlled for in the analysis implying that researchers may not detect treatment effects that they would detect if the inference did take into account the randomization method However although this is the case on average in a nontrivial proportion of draws it will be the case that not control ling for the randomization method will result in larger standard errors than if the randomization method is controlled for Thus it is possible that not controlling for the randomization method could lead the researcher to find a significant effect that is no longer significant when stratum or pair dummies are included Moreover we show that stratifying matching or rerandomizing and then analyzing the data with out controlling for the method of randomization results in lower power than if a pure random draw was used to allocate treatments except in cases in which the variables that balance is sought for have no predictive power for the future outcome of interest in which case there is no need to seek balance on them anyway The paper also discusses the use and abuse of tests for baseline differences in means the impact of balancing observables on achieving balance on unobservables and the issue of how many and which variables to use for stratifying or matching Finally based on our simulation results and the previous econometric literature this paper provides a list of actionable recommendations for researchers performing and reporting on randomized experiments This paper draws on a large literature of clinical trials where many related issues have been under discussion for several decades drawing out the lessons for devel opment field experiments It complements several recent papers in development on randomized experiments3 The paper builds on the recent handbook chapter by 2 One other argument in favor of ex ante balancing is that if the treatment effect is heterogeneous and varies with observed covariates ex ante balancing increases the precision of subgroup analysis 3 Summaries of recent experiments and advocacy of the policy case are found in Kremer 2003 Duflo and Kremer 2004 Duflo 2005 and Banerjee et al 2007b VOL 1 NO 4 203 BruHN ANd mckENziE iN PursuiT Of BALANcE Esther Duflo Rachel Glennerster and Michael Kremer 2008 which aims to pro vide a how to of implementing experiments Our focus differs considering how the actual randomization is implemented in practice and considering matching and rerandomization approaches Finally we contribute to the existing literature through new simulations that illustrate the performance of the different methods in a variety of situations experienced in practice While our focus is on field experiments in development economics to date the field with the most active involvement of researchers in randomization random ized experiments are also increasingly being used to investigate important policy questions in other fields Steven D Levitt and John A List 2008 In common with the development literature the extant literature in these other fields has often not explained the precise mechanism used for randomizing However it does appear that rerandomization methods are also being employed in some of these studies The ongoing New York public schools project being undertaken by the American Inequality Lab is one such highprofile example The lessons of this paper will also be important in designing upcoming experiments in other fields of economics The remainder of the paper is set out as follows Section I provides a stocktaking of how randomization is currently being implemented drawing on a summary of papers and a survey of leading experts in development field experiments Section II describes the datasets used in our simulations and outlines in more detail the different methods of randomization Section III provides simulation evidence on the relative performance of the different methods and on answers to key questions faced in practice Section IV concludes with our recommendations I How Is Randomization Being Implemented A randomization as described in Papers We begin by reviewing a selection of research papers containing randomized exper iments in development economics Table 1 summarizes a selection of relatively small scale randomized experiments with baseline data often implemented via NGOs or as pilot studies4 For each study we list the unit in which randomization occurs Typical sample sizes are 100 to 300 units with the smallest sample size being 10 geographic areas used in Nava Ashraf Dean Karlan and Wesley Yin 2006a The transparency in allocating a program to participants is likely to be greatest when assignment to treatment is done in public rather than in private5 Only 2 out of the 18 papers reviewed Marianne Bertrand et al 2007 and Erica Field and Rohini 4 We do not include experiments undertaken by the authors for objectivity reasons and because the final writeup of our paper has been influenced by the current paper 5 Of course privately drawn randomizations still have the virtue of being able to tell participants that the reason they were chosen or not chosen is random However it is our opinion that carrying out the randomization in a public or semipublic manner can make this more credible in the eyes of participants in many settings This may particularly be the case when it is the government doing the allocation see Claudio Ferraz and Federico Finan 2008 5 who note to ensure a fair and transparent process representatives of the press political parties and members of civil society are all invited to witness the lottery Nevertheless public randomization may not be feasible or desirable in particular settings We merely wish to urge researchers to consider whether the random ization can be easily publicly implemented in their setting and to note in their papers how the randomization was done 204 AmEricAN EcONOmic JOurNAL APPLiEd EcONOmics OcTOBEr 2009 Pande 2008 note whether the randomization was done in public or notin both cases public lotteries The majority of the other randomizations we believe are pri vate or at most semipublic where perhaps the NGO andor government officials witness the randomization draw but not the recipients of the program However this is not stated explicitly in the papers Table 1Summary of Selected Randomized Experiments in Developing Countries Paper Randomization unit Sample size Public or private Stratifi cation used Matched pairs of strata Strata or pair dummies used Table for assessing balance variables used to check balance Publishedforthcoming papers Ashraf Karlan and Yin 2006b Microfinance clients 1777 NA No No Yes 12 Ashraf Karlan and Yin 2006a Barangay area 10 NA No Yes Yes Yes 12 Banerjee et al 2007a School 98 NA Yes No NA No Yes 4 School 111 NA Yes No NA No Yes 4 School 67 NA Yes No NA No Yes 4 Bertrand et al 2007 Men wanting a drivers license 822 Public Yes2 No 23 Yes Yes 22 Gustavo J Bobonis Edward Miguel and Char PuriSharma 2006 Preschool cluster 155 NA No No Yes 24 Field and Pande 2008 Microfinance group 100 Public No No No1 Paul Glewwe et al 2004 School 178 NA Yes No NA No Yes 8 Edward Miguel and Kremer 2004 School 75 NA Yes No NA No Yes 21 Benjamin A Olken Village 608 NA Yes No 156 Yes Yes 10 2007a Subdistrict 156 NA Yes No 50 Yes Yes 10 Working papers Ashraf James Berry and Jesse M Shapiro 2007 Household 1260 NA Yes No 5 Yes Yes 14 Martina Björkman and Jakob Svensson 2007 Community 50 NA Yes No NA Yes 39 Duflo Rema Hanna and Stephen Ryan 2007 School 113 NA Yes No NA No4 Yes 15 Pascaline Dupas 2006 School 328 NA Yes No NA No Yes 17 Glewwe Albert Park and Meng Zhao 2006 Township 25 NA No Yes Yes 4 Fang He Leigh L Linden and Margaret MacLeod 2007 School division 194 NA Yes No NA No Yes 22 Dean Karlan and Martina Valdivia 2006 Microfinance group 239 NA Yes No NA No3 Yes 14 Kremer et al 2006 Spring 200 NA Yes No NA No Yes 28 Olken 2007b Village 48 NA Yes No 2 Yes Yes 8 Notes NA denotes information not available in the paper 1Paper says check was done on a number of variables and is available upon request 2It appears randomization was done within recruitment session but the paper was not clear on this 3Dummies for location are included but not for credit officer which was the other stratifying variable 4Dummies for district are included but not for the number of households in the area which were also used for stratifying within district VOL 1 NO 4 205 BruHN ANd mckENziE iN PursuiT Of BALANcE Next we examine which methods are being used to reduce the likelihood of imbal ance on observable covariates Thirteen studies use stratification two use matched pairs and only three appear to use pure randomization Ashraf Berry and Shapiro 2007 is the only documented example we have found of one of the methods that the next section shows to be in common use in our survey of experts They note at the time of randomization we verified that observable characteristics were balanced across treatments and in a few cases rerandomized when this was not the case Few papers provide the details of the method used presumably because there has not been a discussion of the potential importance of these details in the economics literature For example stratification is common but few studies actually give the number of strata used in the study In practice there appears to be disagreement as to whether it is necessary to include strata dummies in the analysis after stratifica tionmore than half the studies using stratification do not include strata dummies Finally all but one of the papers listed in Table 1 present a table for comparing treat ment and control groups and test for imbalance The number of variables used for checking imbalance ranges from 4 to 39 B randomization in Practice According to a survey of Experts The long lag between inception of a randomized experiment and its appearance in at least working paper form means the results above do not necessarily represent how the most recent randomized evaluations are being implemented We surveyed leading experts in randomized evaluations on their experience and approach to implementa tion A short online survey was sent to 35 selected researchers in December 2007 The list was selected from members of the Abdul Latif Jameel Poverty Action Lab Bureau of Research and Economic Analysis of Development BREAD and the World Bank who were known to have conducted randomized experiments We had 25 of these experts answer the survey with 7 out of the 10 individuals who did not respond having worked with those who did respond The median researcher surveyed had participated in 5 randomized experiments with a mean of 5966 Seventyone percent of the experi ments had baseline data including administrative data that could be used at the time when randomization to treatment was done Preliminary discussions with several leading researchers established that several methods involving multiple random draws were being used in practice to increase the likelihood of balance on observed characteristics One such approach is to take a random draw of assignment to treatment examine the difference in means for several key baseline characteristics and then rerandomize if the difference looks too large The decision as to what is too large could be done subjectively or according to some statistical cutoff criteria For example one survey respondent noted that they regressed variables like education on assignment to treatment and then redid the assignment if these coefficients were too big The second approach takes many draws of assignment to treatment and then chooses the one that gives best balance on a set of observable characteristics 6 This is after topcoding the number of experiments at 15 206 AmEricAN EcONOmic JOurNAL APPLiEd EcONOmics OcTOBEr 2009 according to some algorithm or rule For example several researchers say they write a program to carry out 100 or 1000 randomizations and then for each draw regress individual variables against treatment They then choose the draw with the minimum maximum tstatistic7 Some impose further criteria such as requiring the minimum maximum tstatistic for testing balance on observables to be below one The number of variables used to check balance typically ranges from 5 to 20 and often includes the baseline levels of the main outcomes The perceived advantage of this approach is to enable balance on more variables than possible with stratification and to provide balance in means on continuous variables Researchers were asked whether they had ever used a particular method and the method used in their most recent randomized experiment All of the methods are often combined with some stratification so we examine that separately Table 2 reports the results Most researchers have at some point used simple randomization probably with some stratification However we also see much more use of other methods than is apparent from the existing literature Fiftysix percent had used pairwise matching Thirtytwo percent of all researchers and 46 percent of the group with 5 or more experiments have subjectively decided whether to rerandom ize based on an initial test of balance The multiple draws process described above has also been used by 24 percent of researchers and by 38 percent of the group with 5 or more experiments More detailed questions were asked about the most recent randomization in an effort to obtain some of the information not provided in Table 1 Twentythree of the 25 respondents provided information on these Stratification was used in 14 out of the 15 experiments that were not employing a matched pair design The number of variables used in forming strata was small six used only one variable typically geo graphic location four used two variables eg location and gender and four used four variables Of particular note is that it appears rare to stratify on baseline values of the outcome value of interest eg test scores savings levels or incomes with only 2 of these 14 experiments including a baseline outcome as a stratifying factor While the number of stratifying variables is small there is much greater variation in the number of strata ranging from 3 to 200 with a mean median of 47 18 Only one researcher said that stratification was controlled for when calculating standard errors for the treatment effect A notable feature of the survey responses was a much greater number of research ers randomizing within matched pairs than is apparent from the existing develop ment literature The vast majority of these matches was not done using optimal or greedy Mahalanobis matching but was instead based on only a few variables and commonly done by hand In most cases the researchers matched on discrete variables and their interactions only and thus in effect the matching reduced to stratification One explanation for the difference in randomization approaches used by differ ent researchers is that they reflect differences in context with sample size ques tion of interest and organization one is working with potentially placing constraints 7 An alternative approach used by another researcher is to regress the treatment on a set of baseline covariates and choose the draw with the lowest r2 VOL 1 NO 4 207 BruHN ANd mckENziE iN PursuiT Of BALANcE on the method that can be used for randomization We therefore asked researchers for advice on how to evaluate the same hypothetical intervention designed to raise the income of day laborers8 The responses varied greatly across researchers and include each of the methods given in Table 2 What is clear is that there appears to be no general agreement on how to go about randomizing in practice II Data Simulated Methods and Variables for Balancing A data To compare the performance of the different randomization methods in practice we chose four panel datasets that allow us to examine a wide range of potential outcomes of interest including microenterprise profits labor income school atten dance household expenditure test scores and child anthropometrics The first panel dataset covers microenterprises in Sri Lanka and comes from Suresh de Mel David McKenzie and Christopher Woodruff 2008 This data was collected as part of an actual randomized experiment but we keep only data for firms that were in the control group during the first treatment round The dataset contains information on firms profits assets and many other firm and owner char acteristics The simulations we perform for this dataset are meant to mimic a ran domized experiment that administers a treatment aimed at increasing firms profits such as a business training program The second dataset is a subsample of the Mexican employment survey ENE Our subsample includes heads of household between 20 and 65 years of age who were first interviewed in the second quarter of 2002 and who were reinterviewed in the following four quarters We only keep individuals who were employed during 8 See Web Appendix 1 for the exact question and the responses given Table 2Survey Evidence on Randomization Methods Used by Leading Researchers Percent who have ever used Percent using method in most recent experiment Unweighted Weighted 5 experiment group Single random assignment to treatment possibly with stratification 80 84 92 391 Subjectively deciding whether to redraw 32 52 46 43 Using a statistical rule to decide whether to redraw 12 15 15 00 Carrying out many random assignments and choosing best balance 24 45 38 174 Explicitly matching pairs of observations on baseline characteristics 56 52 54 391 Number of researchers 25 25 13 23 Notes Methods are described in more detail in the paper Weighted results weight by the number of experiments the researcher has participated in 5 experiment group refers to researchers who have carried out 5 or more ran domized experiments 208 AmEricAN EcONOmic JOurNAL APPLiEd EcONOmics OcTOBEr 2009 the baseline survey and imagine a treatment that aims at increasing their income such as a training program or a nutrition program The third dataset comes from the Indonesian Family Live Survey IFLS9 We use 1997 data as the baseline and 2000 data as the followup and simulate two different interventions with the IFLS data First we keep only children aged 1016 in 1997 that were in the sixth grade and in school These children then receive a simulated treatment aimed at keeping them in school in the actual data about 26 percent have dropped out 3 years later Second we create a sample of households and simulate a treatment that increases household expenditure per capita The fourth dataset comprises child and household data from the Learning and Educationa Achievement project LEAPS project in Pakistan Tahir Andrabi et al 2008 We focus on children aged 8 to 12 at baseline and examine two child out come variables math test scores and height zscores10 The simulated treatments increase test scores or zscores of these children There is a wide range of policy experiments that have targeted these types of outcomes from providing text books or school meals to giving conditional cash transfers or nutritional supplements B simulated methods For each dataset we draw three subsamples of 30 100 and 300 observations to investigate how the performance of different methods varies with sample size All results are based on 10000 bootstrap iterations that randomly split the sample into a treatment and a control group according to five different methods The first method is a single random draw which we take as the benchmark for our comparison with the pros and cons of other methods stratificationThe second method is stratification Stratified randomization is the most wellknown and as we have seen commonly used method of preventing imbalance between treatment and control groups for the observed variables used in stratification By eliminating particular sources of differences between groups stratification aka blocking can increase the sensitivity of the experiment allowing it to detect smaller treatment differences than would otherwise be possible George E P Box J Stuart Hunter and William G Hunter 2005 The most often perceived disadvantage of stratification compared to some alternative methods is that only a small number of variables can be used in forming strata11 In terms of which variables to stratify on the econometric literature empha sizes variables that are strongly related to the outcome of interest and variables for which subgroup analysis is desired Statistical efficiency is greatest when the vari ables chosen are strongly related to the outcome of interest Imai King and Stuart 9 See httpwwwrandorglaborFLSIFLS 10 We also have performed all simulations with English test scores and weight zscores The results are very close to the results using math test scores and height zscores and are available from the authors upon request 11 This is particularly true in small samples For example considering only binary or dichotomized character istics with 5 variables there are 25 32 strata while 10 variables would give 210 1024 strata In our samples of 30 observations we stratify on two variables forming eight strata In the samples of 100 and 300 observations we also stratify on 3 variables 24 strata and also on 4 variables 48 strata VOL 1 NO 4 209 BruHN ANd mckENziE iN PursuiT Of BALANcE 2008 Stratification is not able to remove all imbalance for continuous variables For example for two normal distributions with different means but the same variance the means of the two distributions between any two fixed variables ie within a stratum will differ in the same direction as the overall mean Douglas G Altman 1985 In the simulations we always stratify on the baseline values of the outcome of interest and on one or two other variables that either relate to the outcome of interest or constitute relevant subgroups for ex post analysis PairWise matchingAs a third method we simulate pairwise matching As opposed to stratification matching provides a method to improve covariate balance for many variables at the same time Greevy et al 2004 describe the use of optimal multivariate matching However we chose to use the less computationally intensive optimal greedy algorithm laid out in King et al forthcoming12 In both cases pairs are formed so as to minimize the Mahalanobis distance between the values of all the selected covariates within pairs and then one unit in each pair is randomly assigned to treatment and the other to control As with stratification matching on covariates can increase balance on these cova riates and increase the efficiency and power of hypothesis tests King et al 2007 emphasize one additional advantage in the context of social science experiments when the matched pairs occur at the level of a community village or school which is that it provides partial protection against political interference or dropout If a unit drops out of the study or suffers interference its pair unit can also be dropped from the study while the set of remaining pairs will still be as balanced as the original dataset In contrast in a pure randomized experiment if even one unit drops out it is no longer guaranteed that the treatment and control groups are balanced on aver age However the converse of this is that if units drop out at random the matched pair design will throw out the corresponding pairs as well leading to a reduction in power and smaller sample size than if an unmatched randomization was used13 Note however that simply dropping the paired unit will only yield a consistent estimate of the average treatment effect for the full sample when the reason for attrition is unrelated to the size of the treatment effect A special case of this occur ring is when there is a constant treatment effect If there are heterogeneous treat ment effects and dropout is related to the size of the treatment effect then one can only identify the average treatment effect for the subsample of units that remain in the sample when the treatment is randomly offered Whether the average treatment effect for the subsample of units remaining in the sample is a quantity of interest will be up to the researcher to argue and will depend on the level of attrition It will understate the average treatment effect for the population of interest if those in the 12 The Stata code performing pairwise Mahalanobis matching with an optimal greedy algorithm takes sev eral days to run in the 300 observations sample If there is little time in the field to perform the randomization this may not be an option It is important to have ample time between receiving baseline data and having to perform the randomization to have the flexibility of using matching techniques if desired Software packages other than Stata may be more suited for this algorithm and may speed up the process We provide our Stata code in the Web Appendix 13 See Greevy et al 2004 for discussion of methods to retain broken pairs 210 AmEricAN EcONOmic JOurNAL APPLiEd EcONOmics OcTOBEr 2009 control group who had most to gain from the treatment drop out of the survey either through disappointment or in order to take up an alternative to the treatment It will overstate the average treatment effect for the population of interest if the individuals in the treatment group who do not benefit much or perhaps even have a negative effect from the treatment drop out rerandomization methodsSince our survey revealed that several researchers are using rerandomization methods we simulate two of these methods The first which we dub the big stick method by analogy with Jose F Soares and C F Jeff Wu 1983 requires a redraw if a draw shows any statistical difference in means between treatment and control group at the 5 percent level or lower The second method picks the draw with the minimum maximum tstat out of 1000 draws We are not aware of any papers that formally set out the rerandomization methods used in practice in development but there are analogs in the sequential allocation methods used in clinical trials Soares and Wu 1983 D R Taves 1974 S J Pocock and R Simon 1975 The use of these related methods remains somewhat contro versial in the medical field Proponents emphasize the ability of such methods to improve balance on up to 10 to 20 covariates with Tom Treasure and Kenneth D MacRae 1998 suggesting that if randomization is the gold standard minimization may be the platinum standard In contrast the European Committee for Proprietary Medicinal Products CPMP 2003 recommends that applicants avoid such methods and argues that minimization may result in more harm than good bringing little statistical benefit in moderate sized trials Why might researchers wish to use these methods instead of stratification In small samples stratification is only possible on one or two variables There may be many variables that the researcher would like to ensure are not too unbalanced without requiring exact balance on each Rerandomization methods may be viewed as a compromise solution by the researchers preventing extreme imbalance on many variables without forcing close balance on each Rerandomization may also offer a way of obtaining approximate balance on a set of relevant variables in a situation of multiple treatment groups of unequal sizes C Variables for Balancing In practice researchers attempt to balance on variables they think are strongly correlated with the outcome of interest The baseline level of the outcome variable is a special case of this kind of variable We always include the baseline outcome among the variables to stratify match or balance on14 In the matching and reran domization methods we also use six additional baseline variables that are thought to effect the outcome of interest Stratification takes a subset of these six additional variables15 14 Note that this is rather an exception in practice where researchers often do not balance on the baseline outcome 15 A list of the variables used for each dataset is in Web Appendix 2 Table A2 VOL 1 NO 4 211 BruHN ANd mckENziE iN PursuiT Of BALANcE Among these balancing variables we tried to pick variables that are likely to be correlated with the outcome based on economic theory and existing data There is however a caveat Most experiments have impacts measured over periods of six months to two years While our economic models and existing datasets can provide good information for deciding on a set of variables useful for explaining current levels they are often much less useful in explaining future levels of the variable of interest In practice often we cannot theoretically or empirically explain many shortrun changes well with observed variables and believe that these changes are the result of shocks As a result it may be the case that the covariates used to obtain balance are not strong predictors of future values of the outcome of interest The set of outcomes we have chosen spans a range of the ability of the base line variables to predict future outcomes At one end is microenterprise profits in Sri Lanka where baseline profits and 6 baseline individual and firm characteristics explain only 122 percent of the variation in profits 6 months later Thus balancing on these common owner and firm characteristics will not control for very much of the variation in future realizations of the outcome of interest School enrollment in the IFLS data is another example in which baseline variables explain very little of future outcomes For a sample of 300 students who were all in school at baseline 7 baseline variables explain 167 percent of the variation in school enrollment for the same students 3 years later The explanatory power is better for labor income in the Mexican ENE data and household expenditure in the IFLS with the baseline outcome and 6 baseline variables explaining 2829 percent of the variation in the future outcome The math test scores and height zscores in the LEAPS data have the most variation explained by baseline characteristics with 436 percent of the variation in followup test scores explained by the baseline test score and 6 baseline characteristics We expect to see greater difference among randomization methods in terms of achieving balance on future outcomes for the variables that are either more per sistent or that have a larger share of their changes explained by baseline charac teristics We expect to see the least difference among methods for the Sri Lankan microenterprise profits data and Indonesian school enrollment data and the most difference for the LEAPS math test score and height zscore data More generally we recommend balancing on the baseline outcome since our data show that this variable typically has the strongest correlation with the future out come In addition we recommend balancing on dummies for geographic regions since they also tend to be quite correlated with our future outcomes This may be because shocks may differ across regions Moreover in practice implementation of treatment may vary across regions which is another reason to balance on region Finally if baseline data are available for several periods one could check which variables are strongly correlated with future outcomes and could balance on these If multiple rounds of the baseline survey are not available data could come from an outside source as long as they includes the same variables and were collected in a comparable environment When choosing balancing variables researchers will often face a tradeoff between balancing on additional variables and achieving better balance on already chosen variables For example with stratification there is a tradeoff between 212 AmEricAN EcONOmic JOurNAL APPLiEd EcONOmics OcTOBEr 2009 stratifying on another variable and breaking down an already chosen variable into finer categories In this case we recommend adding the new variable only if there is strong reason to believe that it would be correlated with followup outcomes or if subsample analysis in this dimension is envisioned In our datasets it tends to be the case that except for the baseline outcome and geography few variables are strongly correlated with the followup outcome Moreover note that simply adding the new variable and switching the randomization method to matching which is not bound by the number of strata does not necessarily solve the problem The more variables are used for matching the worse balance tends to be on any given variable III Simulation Results Web Appendix 3 reports the full set of simulation results for all four datasets for 30 100 and 300 observations We summarize the results of these simulations in this section organizing their discussion around several central questions that a researcher may have when performing a randomized assignment We start by addressing the following core question A Which methods do Better in Terms of Achieving Balance and Avoiding Extremes We first compare the relative performance of the different methods in achiev ing balance between the treatment and control groups in terms of baseline levels of the outcome variable Table 3 shows the average difference in baseline means the ninetyfifth percentile of the difference in means a measure of the degree of imbal ance possible at the extremes and the percentage of simulations in which a ttest for difference in means between the treatment and control has a pvalue less than 010 We present these results for a sample size of 100 with results for the other sample sizes appearing in Web Appendix 3 Figure 1 graphically summarizes the results for a selection of variables plotting the densities of the differences in average outcome variables for all three sample sizes 30 100 and 300 observations Table 3 shows that the mean difference in baseline means is very close to zero for all methodson average all methods lead to balance However Table 3 and Figure 1 also show that stratification matching and especially the minmax tstat method have much less extreme differences in baseline outcomes while the big stick method only results in narrow improvements in balance over a single random draw For example in the Mexican labor income data with a sample of 100 the nintyfifth percentile of the difference in baseline mean income between the treatment and control groups is 0384 standard deviations SD with a pure random draw 0332 SD under the big stick method 0304 SD when stratifying on 4 variables 0100 SD with pairwise greedy matching and 0088 SD under the minmax tstat method The size of the difference in balance achieved with different methods shrinks as the sample size increasesasymptotically all methods will be balanced The key question is to what extent achieving greater balance on baseline vari ables translates into better balance on future values of the outcome of interest in the absence of treatment The lower graphs in Figures 1A1D show the distribution of VOL 1 NO 4 213 BruHN ANd mckENziE iN PursuiT Of BALANcE difference in means between treatment and control at followup for each method while Tables 4A and 4B summarize how the methods perform in obtaining balance in followup outcomes16 Panel A in both Tables 4A and 4B shows that on average all randomization meth ods give balance on the followup variable even with a sample size as small as 30 This is the key virtue of randomization Figure 1 panel B in both Table 4A and Table 4B show there are generally fewer differences across methods in terms of avoiding extreme imbalances than with the baseline data This is particularly true of the Sri Lankan profit data and the Indonesian schooling data for which baseline variables explained relatively little of future outcomes With a sample size of 30 stratification and matching reduce extreme differences between treatment and control but with samples of 100 or 300 there is very little difference between the various methods in terms of how well they balance the future outcome 16 The followup period is six months for the Sri Lankan microenterprise and Mexican labor income data one year for the Pakistan testscore and child height data and three years for the Indonesian schooling and expendi ture data Table 3How Do The Different Methods Compare in Terms of Baseline Balance simulation results for 100 observation sample size Single random draw Stratified on two variables Stratified on four variables Pairwise greedy matching Big stick rule Draw with minmax tstat out of 1000 draws Panel A Average difference in BAsELiNE between treatment and control means in sd Microenterprise profits Sri Lanka 0001 0000 0001 0003 0001 0000 Household expenditure Indonesia 0002 0001 0001 0002 0001 0002 Labor income Mexico 0000 0000 0000 0000 0001 0000 Height zscore Pakistan 0001 0001 0000 0000 0001 0000 Math test score Pakistan 0003 0000 0001 0000 0002 0000 Baseline unobservables Sri Lanka 0000 0000 0000 0000 0000 0001 Baseline unobservables Mexico 0000 0000 0000 0000 0000 0000 Panel B Ninetyfifth percentile of difference in BAsELiNE between treatment and control means in sd Microenterprise profits Sri Lanka 0386 0195 0241 0313 0324 0091 Household expenditure Indonesia 0390 0145 0191 0268 0328 0107 Labor income Mexico 0384 0280 0304 0100 0332 0088 Height zscore Pakistan 0395 0160 0206 0102 0319 0089 Math test score Pakistan 0392 0164 0237 0074 0328 0106 Baseline unobservables Sri Lanka 0434 0417 0414 0434 0434 0434 Baseline unobservables Mexico 0457 0448 0439 0457 0457 0427 Panel c Proportion of pvalues 01 for testing difference in BAsELiNE means Microenterprise profits Sri Lanka 0097 0000 0005 0037 0045 0000 Household expenditure Indonesia 0102 0000 0000 0013 0049 0000 Labor income Mexico 0100 0015 0029 0000 0053 0000 Height zscore Pakistan 0100 0000 0001 0000 0038 0000 Math test score Pakistan 0100 0000 0006 0000 0048 0000 Baseline unobservables Sri Lanka 0101 0096 0095 0082 0098 0091 Baseline unobservables Mexico 0108 0095 0093 0103 0102 0090 Notes Statistics are based on 10000 simulations of each method Details on methods and variables are in Table A2 of the Web Appendix 214 AmEricAN EcONOmic JOurNAL APPLiEd EcONOmics OcTOBEr 2009 Baseline variables have more predictive power for the realizations at followup for the other outcomes we consider The Mexican labor income and Indonesian expen diture data are in an intermediate range of baseline predictive power with the base line outcomes plus 6 other variables explaining about 28 percent of the variation 0 1 2 3 0 05 1 10 1 05 0 05 1 0 2 4 6 0 05 1 15 2 1 05 0 05 1 0 2 4 6 8 10 0 05 1 15 2 25 1 05 0 05 1 0 05 1 15 2 25 0 05 1 15 1 05 0 05 1 0 2 4 6 0 1 2 3 1 05 0 05 1 0 2 4 6 8 10 0 1 2 3 4 5 1 05 0 05 1 0 05 1 15 2 25 0 05 1 15 2 15 1 05 0 05 1 15 2 0 1 2 3 4 5 0 05 1 15 2 25 2 15 1 05 0 05 1 15 2 0 2 4 6 8 10 0 1 2 3 4 2 15 1 05 0 05 1 15 2 0 1 2 3 0 05 1 15 2 1 05 0 05 1 0 2 4 6 8 0 1 2 3 1 05 0 05 1 Difference in average math test score weighted by standard deviation 0 2 4 6 8 10 0 1 2 3 4 5 1 05 0 05 1 Panel A Sri Lanka profits Sample size 30 Panel B Mexican ENE labor income Panel C IFLS expenditure data Panel D LEAPS math test score data Baseline 1 year later 1 Draw 8 Strata Matched Big stick Minmax Sample size 100 Sample size 300 Difference in average in hh expenditure p cap weighted by standard deviation Sample size 30 Sample size 100 Sample size 300 Baseline 3 years later Difference in average income weighted by standard deviation Sample size 30 Sample size 100 Sample size 300 Baseline 6 months later Baseline 6 months later Difference in average profits weighted by standard deviation Sample size 30 Sample size 100 Sample size 300 Figure 1 Distribution of Differences in Means between the Treatment and Control Groups and Baseline and Followup VOL 1 NO 4 215 BruHN ANd mckENziE iN PursuiT Of BALANcE in followup outcomes Panels B and C of Figure 1 show that in contrast to the Sri Lanka and IFLS schooling data even with samples of 100 or 300 we find matching and stratification continue to perform better than a single random draw in reducing extreme imbalances Table 4B shows that with a sample size of 300 the ninetyfifth percentile of the difference in means between treatment and control groups is 023 SD under a pure random draw for both expenditure and labor income This differ ence falls to 020 SD for expenditure and 015 SD for labor income when pairwise matching is used and to 020 SD for both variables when stratifying or using the minmax rerandomization method Our other two outcome variables math test scores and height zscores are in the higher end of baseline predictive power with the baseline outcome and 6 other variables predicting 436 percent and 353 percent of the variation in followup out comes respectively Figure 1 panel D illustrates that the choice of method makes Table 4AHow Do the Different Methods Compare in Terms of Balance on Future Outcomes sample size of 30 Single Stratified Pairwise Big Draw with random on two greedy stick minmax draw variables matching rule tstat Panel A Average difference in fOLLOWuP between treatment and control means in sd Microenterprise profits Sri Lanka 0001 0000 0001 0003 0002 Child schooling Indonesia 0005 0010 0002 0004 0006 Household expenditure Indonesia 0000 0002 0002 0000 0006 Labor income Mexico 0003 0000 0000 0000 0001 Height zscore Pakistan 0007 0001 0004 0003 0001 Math test score Pakistan 0001 0002 0001 0003 0005 Panel B Ninetyfifth percentile of difference in fOLLOWuP between treatment and control means in sd Microenterprise profits Sri Lanka 0713 0627 0556 0705 0708 Child schooling Indonesia 0834 0745 0556 0556 0556 Household expenditure Indonesia 0721 0643 0496 0677 0590 Labor income Mexico 0703 0713 0503 0688 0704 Height zscore Pakistan 0710 0620 0557 0620 0443 Math test score Pakistan 0717 0448 0350 0648 0525 Panel c Proportion of pvalues 01 for testing difference in fOLLOWuP means with inference as if pure randomization was used eg no adjustment for strata or match dummies Microenterprise profits Sri Lanka 0105 0059 0027 0101 0109 Child schooling Indonesia 0052 0113 0033 0041 0010 Household expenditure Indonesia 0102 0069 0014 0083 0046 Labor income Mexico 0101 0106 0007 0093 0103 Height zscore Pakistan 0097 0056 0030 0059 0007 Math test score Pakistan 0101 0006 0000 0072 0022 Panel d Proportion of pvalues 01 for testing difference in fOLLOWuP means with inference which takes account of randomization method ie controls for stratum pair or rerandomizing variables Microenterprise profits Sri Lanka 0103 0091 0098 0103 0122 Child schooling Indonesia 0103 0117 0033 0098 0108 Household expenditure Indonesia 0102 0098 0097 0101 0094 Labor income Mexico 0102 0109 0107 0102 0117 Height zscore Pakistan 0100 0097 0101 0100 0103 Math test score Pakistan 0099 0102 0103 0098 0098 Notes The coefficients in panels A and B are for specifications without controls for stratum or pair dummies Statistics are based on 10000 simulations of each method Details on methods and variables are in Web Appendix Table A2 216 AmEricAN EcONOmic JOurNAL APPLiEd EcONOmics OcTOBEr 2009 more of a difference for these highly predictable followup outcomes than for the less predictable ones Stratifying matching and the minmax tstate method consis tently lead to narrower distributions in the differences at followup when test scores or height zscores are the outcomes Nevertheless even with these more persistent variables the gains from pursuing balance on baseline are relatively modest when the sample size is 300 Using pairwise matching rather than a pure random draw reduces the ninetyfifth percentile of the difference in means from 023 to 017 in the case of math test scores B What does Balance on Observables imply about Balance on unobservables In general what does balancing on observables do in terms of balancing unob servables Aickin 2001 notes that methods that balance on observables can do no Table 4BHow Do the Different Methods Compare in Terms of Balance on Future Outcomes sample size of 300 Single Stratified Stratified Pairwise Big Draw with random on two on four greedy stick minmax draw variables variables matching rule tstat Panel A Average difference in fOLLOWuP between treatment and control means in sd Microenterprise profits Sri Lanka 0000 0001 0001 0000 0000 0000 Child schooling Indonesia 0002 0003 0001 0000 0002 0002 Household expenditure Indonesia 0001 0001 0000 0001 0001 0001 Labor income Mexico 0001 0000 0001 0001 0001 0002 Height zscore Pakistan 0001 0000 0000 0000 0002 0000 Math test score Pakistan 0001 0000 0000 0001 0001 0001 Panel B Ninetyfifth percentile of difference in fOLLOWuP between treatment and control means in sd Microenterprise profits Sri Lanka 0220 0210 0209 0211 0216 0224 Child schooling Indonesia 0213 0219 0212 0227 0227 0196 Household expenditure Indonesia 0226 0194 0196 0200 0219 0198 Labor income Mexico 0227 0196 0198 0149 0213 0195 Height zscore Pakistan 0222 0186 0189 0189 0212 0225 Math test score Pakistan 0227 0180 0184 0167 0209 0175 Panel c Proportion of pvalues 01 for testing difference in fOLLOWuP means with inference as if pure randomization was used eg no adjustment for strata or match dummies Microenterprise profits Sri Lanka 0100 0080 0080 0085 0092 0103 Child schooling Indonesia 0121 0087 0082 0098 0111 0096 Household expenditure Indonesia 0101 0056 0052 0064 0092 0059 Labor income Mexico 0100 0056 0062 0011 0087 0028 Height zscore Pakistan 0097 0044 0049 0049 0081 0097 Math test score Pakistan 0101 0038 0042 0028 0076 0032 Panel d Proportion of pvalues 01 for testing difference in fOLLOWuP means with inference which takes account of randomization method ie controls for stratum pair or rerandomizing variables Microenterprise profits Sri Lanka 0098 0103 0133 0103 0102 0101 Child schooling Indonesia 0098 0102 0104 0098 0104 0104 Household expenditure Indonesia 0099 0100 0099 0101 0105 0100 Labor income Mexico 0100 0095 0101 0104 0100 0112 Height zscore Pakistan 0094 0097 0097 0098 0095 0102 Math test score Pakistan 0101 0097 0099 0097 0100 0102 Notes The coefficients in panels A and B are for specifications without controls for stratum or pair dummies Statistics are based on 10000 simulations of each method Details on methods and variables are in Web Appendix Table A2 VOL 1 NO 4 217 BruHN ANd mckENziE iN PursuiT Of BALANcE worse than pure randomization with regard to balancing unobserved variables17 We illustrate this point empirically in the Sri Lanka and ENE datasets by defining a separate group of variables from the data to be unobservable in the sense that we do not balance stratify or match on them The idea here is that although we have these variables in these particular datasets they may not be available in other data sets such as measures of entrepreneurial ability Moreover these unobservables are meant to capture what balancing does to variables that are thought to have an effect on the outcome variable but are truly unobservable Table 3 indicates that the balance on these unobservables is pretty much the same across all methods Paul R Rosenbaum 2002 21 notes that under pure randomization if we look at a table of observed covariates and see balance this gives us reason to hope and expect that other variables not measured are similarly balanced This holds true for pure random draws but will not be the case with methods that enhance balance on certain observed covariates Presenting a table that shows only the variables used in matching or for rerandomization checks and showing balance on these covariates will overstate the degree of balance attained on other variables that are not closely correlated with those for which balance was pursued For example the ninetyfifth percentile of the difference in means in Table 3 gives a similar level of imbalance for the unobservables as the balanced outcome under a pure random draw whereas under the other methods the unobservables have higher imbalance than the outcome variable18 We therefore recommend that if matching or rerandomization or stratifi cation on continuous variables is used researchers clearly separate these from other variables of interest when presenting a table to show balance C To dummy or Not to dummy We have seen that only a fraction of studies using stratification control for strata in the statistical analysis Kernan et al 1999 state that results should take into account stratification by including strata as covariates in the analysis Failure to do so results in overly conservative standard errors which may lead a researcher to erroneously fail to reject the null hypothesis of no treatment effect While the omission of bal anced covariates will not change the point estimates of the effect in linear models leaving out a balanced covariate can change the estimate of the treatment effect in nonlinear models Gillian M Raab Simon Day and Jill Sales 2000 so that analysis of binary outcomes makes this adjustment more important The CPMP 2003 also recommends that all stratification variables be included as covariates in the primary 17 To see this consider balancing on variable X and the consequences of this for balance on an unobserved variable W W can be written as the sum of the fitted value from regressing W on X and the residual from this regression 1 W PXW i PXW PX X X X1 X Balancing on X will therefore also balance the part of W that is correlated with X PXW Since the remaining part of W i PX W is orthogonal to X it will tend to balance at the same rate as under pure randomization 18 Note the imbalance on unobservables is similar to that of a single random draw which concurs with the point that balancing on observables can do no worse than pure randomization when it comes to balancing unobservables 218 AmEricAN EcONOmic JOurNAL APPLiEd EcONOmics OcTOBEr 2009 analysis in order to reflect the restriction on randomization implied by the stratifi cation Similarly for pairwise matching dummies for each pair should be included in the treatment regression Furthermore in practice stratification is unlikely to achieve perfect balance for all of the variables used in stratification Whenever there is an odd number of units within a stratum there will be imbalance Terry M Therneau 1993 In addition imbalance may arise from units having a baseline missing value on one of the vari ables used in forming strata As a consequence in practice the point estimate of the treatment effect will also likely change if strata dummies are included compared to when they are not included To examine whether or not controlling for stratification matters in practice panels C and D of Tables 4A and 4B compare the size of a hypothesis test for the difference in means of the followup outcome when no treatment has been given Panel C of Tables 4A and 4B shows the proportion of pvalues under 010 when no stratum or pair dummies are included and panel D of Tables 4A and 4B shows the proportion of pvalues under 010 when these dummies are included Recall that this is a test of a null hypothesis we know to be true So to have correct size 10 percent of the pvalues should be below 010 We see that this is the case for the pure random draw whereas failure to control for the dummies leads the stratification and pairwise matching tests to be too conservative on average19 For example with a sample size of 30 less than 5 percent of the pvalues are below 010 for all 6 outcomes when we dont include pair dummies with pairwise matching For the math test score only 06 percent of the pvalues under stratification and none of the pvalues under pair wise matching are under 010 Even with a sample size of 300 less than 5 percent of the pvalues are below 010 for the more persistent outcomes when stratification or matching is used but not accounted for by adding stratum or pair dummies In contrast panel D shows that when we add stratum dummies or pair dummies the hypothesis test has the correct size with 10 percent of the pvalues under 010 even in sample sizes as small as 30 Thus on average it is overly conservative to not include the controls for stratum or pair in analysis The resulting conservative standard errors imply that if research ers do not account for the method of randomization in analysis they may not detect treatment effects that they would otherwise detect However although on average the pvalues are lower when including these dummies Table 5 shows that this is not necessarily the case in any particular random allocation to treatment and control Including stratum dummies only lowers the pvalue in 4888 percent of the replica tions depending on sample size and outcome variable Thus in practice researchers cannot argue that ignoring stratum dummies will always result in larger standard errors than when these dummies are included If researchers could commit to always ignoring the stratification during analysis then this would be on average conserva tive But since it is difficult to commit if no standard for analysis exists researchers may be tempted to try their analysis with and without stratum dummies and report 19 The child schooling in Indonesia is a binary outcome The difference in means attending school can there fore be only a limited number of discrete differences and this discreteness causes the test to not have the correct size even under a pure random draw when the sample is small VOL 1 NO 4 219 BruHN ANd mckENziE iN PursuiT Of BALANcE the results that are more significant We therefore recommend that the standard should be to control for the method of randomization in analysis20 D How should inference Be done After rerandomizing While including strata or pair dummies in the ex post analysis for the stratifica tion and matching methods is quite straightforward the methods of inference are not as clear for rerandomization methods In fact the correct statistical methods for covariatedependent randomization schemes such as minimization are still a conundrum in the statistics literature leading some to argue that the only analysis that we can be completely confident in is a permutation test or rerandomization test Randomization inference can be used for analysis of the method of rerandomizing when the first draw exceeds some statistical threshold although it requires addi tional programming work Using the rule which determines when rerandomization will take place the researcher can map out the set of random draws that would be allowed by the threshold rule throwing out those with excessive imbalance and then carry out permutation tests on the remaining draws21 Such a method is not possible when ad hoc criteria are used to decide whether to redraw 20 If authors believe they have a valid reason not to control for stratum dummies they should explain this reasoning in their text and also mention what the results would be if stratum dummies were included 21 When multiple draws are used to select the allocation that gives best balance over a sequence of 100 or 1000 draws there may be a concern that the resulting assignment to treatment is mostly deterministic This will be the case in very small samples under 12 units but is not a concern for all but the smallest trials Table 5Is It Always Conservative to Ignore the Method of Randomization Proportion of replications in which controlling for stratum or pair dummies lowers the pvalue on a test of difference in means between treatment and control groups Stratified Stratified Pairwise Big Draw with on two on four greedy stick minmax variables variables matching rule tstat Panel A sample size 30 Microenterprise profits Sri Lanka 0690 1000 0493 0555 Child schooling Indonesia 0373 0686 0567 0854 Household expenditure Indonesia 0622 1000 0523 0657 Labor income Mexico 0477 1000 0496 0532 Height zscore Pakistan 0579 1000 0537 0825 Math test score Pakistan 0684 1000 0522 0740 Panel B sample size 300 Microenterprise profits Sri Lanka 0668 0731 1000 0526 0689 Child schooling Indonesia 0705 0634 1000 0506 0674 Household expenditure Indonesia 0869 0733 1000 0522 0738 Labor income Mexico 0874 0712 1000 0525 0725 Height zscore Pakistan 0860 0655 1000 0522 0754 Math test score Pakistan 0882 0735 1000 0533 0776 Notes Statistics are based on 10000 simulations of each method Details on methods and variables are in Web Appendix Table A2 220 AmEricAN EcONOmic JOurNAL APPLiEd EcONOmics OcTOBEr 2009 Optimal modelbased inference is less clear under rerandomization since allo cation to treatment is datadependent To see this consider the data generating processes 2a Yi α βTreati εi 2b Yi α βTreati γ zi ui where Treati is a dummy variable for treatment status and zi are a set of covariates potentially correlated with the outcome Yi Under pure randomization 2a is used for analysis assignment to treatment is in expectation uncorrelated with εi and the standard error will depend on Var εi Suppose instead that rerandomization methods are used which force the difference in means of the covariates in z to be less than some specified threshold z TrEAT z cONTrOL δ If δ is invariant to sample size eg difference in proportions less than 010 then this condition will occur almost surely as the sample size goes to infinity and thus the conditioning will not affect the asymptotics However in practice δ is usually set by some statistical significance threshold Then if 2a is used for analysis that is the covariates are not controlled for we will only have that εi is independent of Treati conditional on z TrEAT z cONTrOL δ The correct standard error should therefore account for this conditioning using Varεi z TrEAT z cONTrOL δ In practice this will be difficult to do so adapting the minimization inference recommendations of Neil W Scott et al 2002 we recommend researchers include all the variables used to check balance as linear covariates in the regression22 Estimation of the treatment effect in 2b will then be conditional on the variables used for checking balance This entails a loss of degrees of freedom compared to not controlling for these covariates but still requires fewer degrees of freedom than pairwise matching The simulation results in Tables 4A and 4B suggest that this approach works in practice Treating the big stick or minmax tstatistic methods as if they were pure random draws results in less than 10 percent of replications having pvalues under 010 panel C whereas including the variables used for checking balance as linear controls results in the correct test size panel D This correction is more important for the minmax method than the big stick method since the minmax method achieves greater baseline balance E How do the different methods compare in Terms of Power for detecting a Given Treatment Effect To compare the power of the different methods we simulate a treatment effect by adding a constant to the followup outcome variable for the treatment group We simulate constant treatments which add Rs 1000 LKR 25 percent of average 22 If an interaction or quadratic term is used to check balance which seems rare in practice then this same term should also be included as a regressor Note that in the special case of the rerandomization method being used to seek balance on a set of binary variables X Y and their interaction X Y for which it is possible to attain exact balance on these variables then rerandomization inference with the X Y and X Y as controls would be equivalent to inference after stratification on these same variables with strata dummies used as controls VOL 1 NO 4 221 BruHN ANd mckENziE iN PursuiT Of BALANcE baseline profits to the Sri Lankan microenterprise profits add Mex920 20 percent of average baseline income to the Mexican labor income add 04 05 standard deviations to log expenditure in Indonesia and add 025 standard deviations to the Pakistan math test scores and child height zscores For the schooling treatment we randomly set one in three schooling dropouts to stay in school These treatments are all relatively small in magnitude for the sample sizes used so that we can see differences in power across methods rather than have all methods give power close to one Table 6 summarizes the power of a hypothesis test for detecting the treatment effect using the ttest on the treatment coefficient in a linear regression of the out come variable on a constant and a dummy variable for treatment status We report the proportion of replications where this test would reject the null hypothesis of no effect at the 10 percent level Panels A and C report results when the regression model does not include controls for the method of randomization while panels B and D report the power when stratum or pair dummies or the variables used in checking balance for rerandomization methods are included The results for the pure random sample in panels B and D include the same set of seven baseline con trols to enable comparison of ex post controls for baseline characteristics to ex ante balancing Table 6 shows that if we do not adjust for the method of randomization the differ ent methods often perform similarly in terms of power In cases where they differ the methods that pursue balance tend to have less power than pure randomization For example with a sample size of 30 the power for both the height and math test scores is approximately 017 under a single random draw but can be as low as 0016 for the math test score under pairwise matching and as low as 0052 for the height zscore with the minmax method As we have seen the size of tests is too low for persistent variables when the method of randomization is not controlled for which makes it difficult to detect a significant effect This translates into low power in such cases Adding the strata and pair dummies or baseline variables used for rerandom izing increases power in almost all cases Some of the increases in power can be sizeablethe power increases from 0016 to 0320 for the math test score with pair wise matching when the pair dummies are added This increase in power is another reason to take into account the method of randomization when conducting analysis Table 6 also allows us to see the gain in power from ex ante balancing compared to ex post balancing The same set of variables used for forming the match and for the rerandomization methods were added as ex post controls when estimating the treatment effect for the single random draw in panels B and D When the variables are not very persistent such as the microenterprise profits and child schooling the power is very similar whether ex ante or ex post balancing is done However we do observe some improvements in power from matching compared to ex post controls for some but not all of the more persistent outcome variables The power increases from 0584 to 0761 for the Mexican labor income when ex ante pairwise matching on seven variables is done rather than a pure random draw followed by linear con trols for these seven variables ex post However there is no discernable change in power from balancing for child height another persistent outcome variable 222 AmEricAN EcONOmic JOurNAL APPLiEd EcONOmics OcTOBEr 2009 Table 6How Do the Different Methods Compare in Terms of Power in Detecting a Given Treatment Effect Sample size of 30 Single Stratified Pairwise Big Draw with Stratified Matching 20 random on two greedy stick minmax on on strata draw variables matching rule tstat iid noise iid noise iid noise Panel A Proportion of pvalues 010 when no adjustment is made for method of randomization Microenterprise profits Sri Lanka 0144 0106 0095 0139 0154 0132 0086 0111 Child schooling Indonesia 0123 0146 0111 0115 0066 0116 0144 0119 Household expenditure Indonesia 0390 0382 0342 0382 0360 0388 0387 0391 Labor income Mexico 0181 0177 0098 0178 0184 0177 0203 0208 Height zscore Pakistan 0174 0134 0133 0134 0052 0195 0194 0193 Math test score Pakistan 0167 0051 0016 0139 0087 0154 0131 0193 Panel B Proportion of pvalues 010 when adjustment is made for randomization method and for the single random draw controls for when the seven baseline variables are added to the regression Microenterprise profits Sri Lanka 0130 0135 0153 0131 0167 0144 0164 0109 Child schooling Indonesia 0109 0131 0121 0112 0095 0118 0144 0149 Household expenditure Indonesia 0409 0424 0580 0419 0461 0387 0356 0280 Labor income Mexico 0164 0172 0242 0167 0196 0165 0173 0125 Height zscore Pakistan 0246 0201 0206 0251 0281 0161 0157 0142 Math test score Pakistan 0183 0313 0320 0187 0217 0159 0170 0129 Sample size of 300 Single Stratified Pairwise Big Draw with Stratified Matching 200 random on four greedy stick minmax on on strata draw variables matching rule tstat iid noise iid noise iid noise Panel c Proportion of pvalues 010 when no adjustment is made for method of randomization Microenterprise profits Sri Lanka 0288 0278 0267 0280 0280 0285 0279 0278 Child schooling Indonesia 0606 0562 0607 0597 0600 0560 0610 0555 Household expenditure Indonesia 0999 1000 1000 0999 1000 0999 0999 0999 Labor income Mexico 0494 0480 0475 0489 0474 0479 0484 0489 Height zscore Pakistan 0728 0756 0766 0743 0767 0727 0728 0739 Math test score Pakistan 0615 0650 0655 0619 0657 0631 0624 0620 Panel d Proportion of pvalues 010 when adjustment is made for randomization method and for the single random draw controls for when the seven baseline variables are added to the regression Microenterprise profits Sri Lanka 0301 0343 0290 0302 0309 0283 0338 0295 Child schooling Indonesia 0608 0589 0602 0600 0595 0458 0607 0403 Household expenditure Indonesia 1000 1000 1000 1000 1000 0999 0998 0994 Labor income Mexico 0584 0541 0761 0584 0582 0501 0602 0408 Height zscore Pakistan 0863 0854 0853 0867 0866 0741 0721 0642 Math test score Pakistan 0812 0781 0829 0816 0826 0630 0603 0460 Notes Statistics are based on 10000 simulations of each method Details on methods and variables are in Web Appendix Table A2 Stratifications on independently and identically distributed noise are for 8 48 strata in the sample of 30 300 observa tions Simulated treatment effects are as follows Microenterprise profitsan Rs 1000 LKR increase in profits about 25 per cent of average baseline profits child schoolingone in three randomly selected children in the treatment group who would have dropped out dont household expenditurean increase of 04 in ln household expenditure per capita which corresponds to about onehalf of a standard deviation or moving a household from the twentyfifth to the fiftieth percentile labor income a MEX920 increase in income about 20 percent of average baseline income height zscorean increase of one quarter of a standard deviation in the zscore where the zscore is defined as standard deviations from mean US height for age math test scorean increase of one quarter of a standard deviation in the test score VOL 1 NO 4 223 BruHN ANd mckENziE iN PursuiT Of BALANcE F can We Go Too far in Pursuing Balance When using stratification matching or rerandomization methods one question is how many variables to balance on and whether balancing on too many variables could be counterproductive The statistical and econometric literature is not very definitive with respect to how many variables to use in stratification23 We therefore investigate how changing the number of strata affects balance and power in practice in our samples of 100 and 300 observations by simulating stratification with 2 3 and 4 stratifying variables resulting in 8 24 and 48 strata respectively The results are shown in Table 7 Both the size of extreme imbalances and the power do not vary much with the number of strata for any of the six outcomes In most cases there is neither much gain nor much loss from including more strata However we do note that for a sample size of 100 when strata dummies are included power is always slightly lower when 4 stratifying variables and 48 strata are included than when 3 stratifying variables and 24 strata are used For example with the math test score power falls from 0464 to 0399 when the number of strata is doubled A question related to the choice of how many variables to balance on is what hap pens when one balances on irrelevant covariates Guido Imbens et al 2009 prove that stratification can do no worse than pure randomization in terms of expected squared error even when there is little or no correlation with the variables being stratified on However although there is no cost to stratification in terms of the vari ance itself there is a cost in terms of estimation of the variance The estimator that takes account of the stratification itself has a larger variance which comes from the degrees of freedom adjustment Although one could use the estimator for the variance which ignores the stratification this is overly conservative and as we have seen results in tests of low power Our personal viewpoint based on this is that there is a possible cost of overstratify ing on irrelevant variables in that the power of the experiment to detect a significant treatment effect can be diminished as a result of the degrees of freedom adjustment To gauge how important this might be in practice consider a few examples using the fact that controlling for k additional variables can at most increase the estimate of the variance by n k 2n 2 For a sample size of 100 even 10 irrelevant covariates could at most increase standard errors by 55 percent equivalent to a reduction in sample size from 100 to 90 With 200 or 400 as the sample size balanc ing on 5 or 10 uncorrelated covariates will not increase standard errors by more than 3 percent However balancing on irrelevant variables will continue to have reper cussions for standard errors if the number of variables balanced on increases at the same rate as the sample size In pairwise matching the number of covariates used as controls in the treatment regression is n2 If the variables used to form matches 23 For example Duflo et al 2008 state that if several binary variables are available for stratification it is a good idea to use all of them even if some of them may not end up having large explanatory power for the final outcome In contrast Kernan et al 1999 argue that fewer strata are better and raise the possibility of unbal anced treatment assignment within strata due to small cell sizes recommending that an appropriate number of strata is between n50 and n100 Finally Therneau 1993 shows in simulations with sample sizes of 100 that with a sufficient number of factors used in stratifying so that the number of strata reaches n2 performance can actually be worse than using unstratified randomization 224 AmEricAN EcONOmic JOurNAL APPLiEd EcONOmics OcTOBEr 2009 do not have any role in explaining the outcome of interest we see that the ratio of standard errors will approach 2 that is can be 41 percent higher under pairwise matching than pure randomization In our simulations we address the issue of balancing on irrelevant variables by stratifying and matching based on independently and identically distributed noise The last three columns of Table 6 show the power of the stratified and matching estimators when pure noise is used Once we control for stratum dummies power is clearly less when irrelevant variables are used for stratifying or matching than when relevant variables are used For example the power with a sample size of 30 for household expenditure under pairwise matching is 0580 when relevant baseline variables are used to form the match compared to 0356 when independently and identically distributed noise is used in the matching Thus the choice of variables used in stratifying or matching does play an important role in determining power However if we wish to compare the impact of matching or stratifying on irrel evant variables to a pure random draw we should compare the power for a single random draw in panels A and C to the power for matching and stratifying on inde pendently and identically distributed noise in panels B and D which contain con trols for stratum or pair dummies The power is very similar for all sample sizes In practice any given draw of independently and identically distributed noise is likely to have some small correlation with the outcome of interest reducing the residual Table 7How Does Stratification Vary with the Number of Strata simulation results Sample size 100 Sample size 300 Stratified Stratified Stratified Stratified Stratified Stratified on two on three on four on two on three on four variables variables variables variables variables variables 8 strata 24 strata 48 strata 8 strata 24 strata 48 strata Panel A imbalanceNinetyfifth percentile of difference in followup means Microenterprise profits Sri Lanka 0322 0338 0338 0210 0213 0209 Child schooling Indonesia 0399 0346 0369 0219 0211 0212 Household expenditure Indonesia 0337 0335 0343 0194 0193 0191 Labor income Mexico 0335 0327 0344 0196 0196 0198 Height zscore Pakistan 0297 0299 0310 0186 0191 0189 Math test score Pakistan 0285 0298 0316 0180 0181 0184 Panel B Power Proportion of pvalues 010 when no strata dummies included Microenterprise profits Sri Lanka 0129 0138 0144 0274 0281 0278 Child schooling Indonesia 0303 0267 0273 0585 0574 0562 Household expenditure Indonesia 0852 0850 0845 0999 1000 1000 Labor income Mexico 0170 0161 0180 0486 0486 0480 Height zscore Pakistan 0286 0295 0297 0757 0757 0756 Math test score Pakistan 0236 0245 0254 0654 0649 0650 Panel c Power Proportion of pvalues 010 when strata dummies included Microenterprise profits Sri Lanka 0186 0273 0242 0305 0327 0343 Child schooling Indonesia 0278 0301 0255 0596 0596 0589 Household expenditure Indonesia 0904 0914 0876 1000 1000 1000 Labor income Mexico 0204 0212 0199 0561 0551 0541 Height zscore Pakistan 0487 0463 0457 0849 0843 0854 Math test score Pakistan 0464 0464 0399 0792 0790 0781 Notes Statistics are based on 10000 simulations of each method Details on methods and variables are in Web Appendix Table A2 VOL 1 NO 4 225 BruHN ANd mckENziE iN PursuiT Of BALANcE sum of squares when controlled for in a regression It seems this small correlation is just enough to offset the fall in degrees of freedom so that the worstcase scenarios discussed above dont come to pass24 Hence in practice it seems that stratifying on independently and identically distributed noise does not do any worse than a simple random draw in terms of power when sample sizes are not very small Finally Table 6 shows that when stratification or matching is done purely on the basis of independently and identically distributed noise treating the randomization as if it was a pure random draw does not lower power compared to the case where a single random draw is used This is in contrast to the case when matching or stratifi cation is done on variables with strong predictive power Intuitively when pure noise is used for stratification it is as if a pure random draw was taken25 The simulation results for stratifying on independently and identically distributed noise with 8 48 strata and 30 300 observations suggest that overstratification is not a concern in practice when using a reasonable number of strata In order to check what would happen in an extreme case we also simulated stratification on independently and identically distributed noise with 20 strata for 30 observations and 200 strata for 300 observations In each case one third of the strata include only one observation reducing the number of observations that contribute to estimating the treatment effect The results included in the last column of Table 6 show that power is now quite a bit lower compared to a pure random draw We thus conclude that although in extreme cases it is possible to lose power due to overstratification in practice it is unlikely that one would encounter this problem G What is the meaning of the standard Table 1 if any Section I points out that most research papers containing randomized experi ments feature a table usually the first in the paper that tests whether there are any statistically significant differences in the baseline means of a number of variables across treatment and control groups The unanimous use of such tests is interest ing in light of concern in the clinical trials literature about both the statistical basis for such tests and their potential for abuse26 Altman 1985 26 writes that when treatment allocation was properly randomized a difference of any sort between the two groups will necessarily be due to chance performing a significance test to compare baseline variables is to assess the probability of something having occurred by chance when we know that it did occur by chance Such a procedure is clearly absurd Altman 1985 26 goes on to add that statistical significance is immaterial when considering whether any imbalance between the groups may have affected the results In particular it is wrong to infer from the lack of statistical significance that 24 Note that even our smallest sample size of 30 is larger than the cases Donald C Martin et al 1993 study where a loss of power can occur 25 However this does not mean that ex post one can check whether the variables used for matching or strati fication have predictive power for the future outcome and if not ignore the method of randomization Ignoring the matching or stratification is only correct if the baseline variables are truly pure noiseif there is any signal in these stratifying or matching variables then ignoring the randomization method will result in incorrect size for hypothesis tests 26 See also Imai King and Stuart 2008 for discussion on this issue in social science field experiments and for their suggestions as to what should constitute a proper check of balance 226 AmEricAN EcONOmic JOurNAL APPLiEd EcONOmics OcTOBEr 2009 the variable in question did not affect the outcome of the trial since a small imbal ance in a variable highly correlated with the outcome of interest can be far more important than a large and significant imbalance for a variable uncorrelated with the variable of interest A particular concern with the use of significance tests is that researchers may decide whether or not to control for a covariate in their treatment regression on the basis of whether it is significant Thomas Permutt 1990 shows that the resulting tests true significance level is lower than the nominal level especially for variables which are more strongly correlated with the outcome of interest He further shows adjusting on the basis of an initial significance test does worse than randomly choos ing a covariate to adjust for He reasons that the initial significance test tends to sup press covariate adjustment precisely where it would on average do some goodthe cases where the adjustment would be enough to produce significance of the outcome but where the difference in means falls short of significance Instead greater power is achieved by always adjusting for a covariate that is highly correlated with the out come of interest regardless of its distribution between groups However although controlling for covariates which are highly correlated with the outcome of interest will increase power and still yield consistent estimates recent work by David A Freedman 2008 discussed further in Angus Deaton 2008 shows that doing so will induce a finitesample bias if the treatment effect is het erogeneous and correlated with the square of the covariate introduced It therefore is of use to compare the point estimate with and without such controls If the point estimate changes a lot when the covariate is added then one can investigate further using interaction models whether the treatment effect varies with the covariate of interest27 A final concern with the use of significant tests for imbalance is their potential for abuse For example Schulz and David A Grimes 2002 report that in the clinical trials literature researchers who use hypothesis tests to compare baseline charac teristics report fewer significant results than expected by chance They suggest one plausible explanation is that some investigators may not report some variables with significant differences believing that doing so would reduce the credibility of their reports We have no evidence to suggest this is occurring in the development litera ture and hope the profession can use this first table in a manner that doesnt lead to the temptation for such abuse In particular we urge referees and editors to view a lack of balance on one or two variables in a randomized experiment as simply the result of chance not a reason per se to reject a paper28 And the criterion for robust ness should be whether these variables are believed to be strongly correlated with the outcome of interest authors can provide correlations between baseline variables 27 Of course doing this requires a valid estimate of the standard errors Consistent estimates are easily avail able but the finitesample properties of such estimators are not so clear See Freedman 2008 and Deaton 2008 for further discussion 28 Unless there is a reason to suspect interference in the randomization in which case a pattern of many variables showing systematic differences in means at high levels of significance may raise red flags Another case in which Table 1 could raise red flags is if there is attrition and observations in the same strata or pair are not dropped from the analysis In this case Table 1 could reveal whether observables are still balanced after attrition VOL 1 NO 4 227 BruHN ANd mckENziE iN PursuiT Of BALANcE and the outcome as a guide rather than whether the pvalue for a difference in means is below 005 or not So how should we interpret such tables The first question of interest in prac tice is given that such a test shows a statistically significant difference in baseline means does this make it more likely that there is also a statistically significant dif ference in followup means in the absence of treatment The answer is yes provided that the baseline data have predictive power for the followup outcomes see Web Appendix 4 The second question of interest is If we observe statistical imbalance at baseline but control for baseline variables in our analysis are we more likely to observe imbalance at followup than if we had obtained a random draw that didnt show baseline imbalance To examine this question we take 10000 simulations of a sin gle random draw and divide them into two sets The first set includes all draws that had a statistically significant difference at the 5 percent level in at least 1 of our 7 baseline variables We call this the unbalanced set The second set is the bal anced set and includes all other draws The top panels of Figure 2 panels A and B show the distribution of the differences in means between treatment and control for baseline labor income and baseline math test scores are more tightly concentrated around zero in the balanced set than the unbalanced set29 The middle panels show that these differences are less pronounced but still persist at followup again show ing that imbalance in baseline makes it more likely to have imbalance at followup However once we control for the seven baseline variables the distributions of a test of no treatment effect in the followup outcome when no treatment was given is identical regardless of whether or not there was baseline imbalance Intuitively when randomization is used to allocate units into treatment and con trol groups if we do find unbalanced baseline characteristics once we control for them the remaining unobservables are no more or less likely to be unbalanced than if we did not find unbalanced baseline characteristics However as recommended by Altman 1985 we should choose which baseline characteristics to control for not on the basis of statistical differences but on the strength of their relationship to the outcome of interest IV Conclusions Our surveys of the recent literature and of the most experienced researchers imple menting randomized experiments in developing countries find that most researchers are not relying on pure randomization but are doing something to pursue balance on observables In addition to stratification we find pairwise matching and reran domization methods to be used much more than is apparent from the existing devel opment literature The paper draws out implications from the existing statistical clinical and social science literature on the pros and cons of these various meth ods of seeking balance and compares the performance of the different methods in simulations 29 Web Appendix A3 presents the same figures for other outcome variables and sample sizes They all show the same patterns as in Figure 9 228 AmEricAN EcONOmic JOurNAL APPLiEd EcONOmics OcTOBEr 2009 Our simulation results show the method of randomization matters more in small sample sizes such as 30 or 100 observations and matters more for relatively persis tent outcome variables such as health and test scores than for less persistent out come variables such as microenterprise profits or household expenditure Overall we find pairwise matching to perform best in achieving balance in small samples provided that the variables used in forming pairs have good predictive power for future outcomes Stratification and rerandomization using a minmax method also lead to some improvements over a pure random draw but in the majority of our simulations are dominated by pairwise matching With sample sizes of 300 we find that the method of randomization matters much less although matching still leads to some improvement in balance for the persistent outcomes Our analysis of how randomization is being carried out in practice suggests sev eral areas where the practice of randomization can be improved or better reported This leads us to draw out the following recommendations 1 Better reporting of the method of random assignment is needed This should include a description of a Which randomization method was used and why b Which variables were used for balancing c For stratification how many strata were used d For rerandomization which cutoff rules were used This is particularly important for experiments with small samples where the randomization method makes more difference 2 clearly describe how the randomization was carried out in practice a Who performed the randomization b How was the randomization done coin toss random number generator etc c Was the randomization carried out in public or private 0 00001 00002 00003 00004 Baseline 0 00001 00002 00003 Followup 0 00001 00002 00003 Followup w controls 4000 2000 0 2000 4000 Difference in average outcome Balanced Unbalanced 0 0005 001 0015 002 0025 0 0005 001 0015 002 0 0005 001 0015 0025 100 50 0 50 100 Difference in average outcome Panel A ENE labor income data 100 observations Panel B LEAPS math test score data 300 observations 002 Figure 2 If We Observe Baseline Imbalance and Control for Baseline Variables Is There Any Difference in Followup Balance VOL 1 NO 4 229 BruHN ANd mckENziE iN PursuiT Of BALANcE 3 rethink the common use of rerandomization Our simulations find pairwise matching to generally perform as well or better than rerandomization in terms of balance and power and like rerandomization matching allows balance to be sought on more variables than possible under stratification Adjusting for the method of randomization is statistically cleaner with matching or strati fication than with rerandomization If rerandomization is used the authors should justify why rerandomization was preferred to the other methods of randomization 4 When deciding which variables to balance on strongly consider the base line outcome variable and geographic region dummies in addition to vari ables desired for subgroup analysis In practice few existing studies stratify on baseline values of the outcome of interest Yet in all of our datasets the baseline outcome variable is the one that is most strongly correlated with the future outcome Justification for regional stratification comes from the fact that treatment implementation and shocks are likely to vary by region 5 Be aware that overstratification can lead to a loss of power in extreme cases This is because using a large number of strata involves a downside in terms of loss in degrees of freedom when estimating standard errors possibly more cases of missing observations and odd numbers within strata when stratifi cation is used We find a loss in power only in an extreme case where we stratify on independently and identically distributed noise and have more strata than observations In practice researchers are unlikely to pursue balance to this extreme meaning that overstratification is unlikely to occur in practice However there is still a tradeoff between stratifying or matching on more variables and achieving closer balance on a smaller number of variables 6 As ye randomize so shall ye analyze Stephen Senn 2004 Our simulations show that while on average failure to account for the method of randomization generally results in overly conservative standard errors there are also a sub stantial number of draws in which standard errors that do not account for the method of randomization overstate the significance of the results Moreover failure to control for the method of randomization results in incorrect test size and low power In general we feel that it is important to follow a standard rule here to avoid ex post decision making of whether to control for the method of randomization or not We recommend that the standard should be to control for the method of randomization30 Since the majority of inference in economics is modelbased rather than randomization inference this means adding con trols for all covariates used in seeking balance That is strata dummies should be included when analyzing the results of stratified randomization Similarly pair dummies should be included for matched randomization or linear vari ables used for rerandomizations 30 If authors believe they have a valid reason not to control for stratum dummies they should explain this reasoning in their text and also mention what the results would be if stratum dummies were included 230 AmEricAN EcONOmic JOurNAL APPLiEd EcONOmics OcTOBEr 2009 7 in the ex post analysis do not automatically control for baseline variables that show a statistically significant difference in means The previous litera ture and our simulations suggest that it is a better rule to control for variables that are thought to influence followup outcomes independent of whether their difference in means is statistically significant or not When there are several such variables and not all of them can be included in the analysis correlations between baseline variables and followup data can be checked explicitly to pick the variables that are most strongly correlated with followup outcomes One should still be cautious in the use of ex post controls given the potential for finitesample bias if treatment heterogeneity is correlated with the square of these covariates REFERENCES Aickin Mikel 2001 Randomization Balance and the Validity and Efficiency of DesignAdaptive Allocation Methods Journal of statistical Planning and inference 941 97119 Altman Douglas G 1985 Comparability of Randomized Groups statistician 34 12536 Andrabi Tahir Jishnu Das Asim I Khwaja and Tristan Zajonc 2008 Do ValueAdded Estimates Add Value Accounting for Learning Dynamics Harvard University Center for International Development Working Paper 158 Ashraf Nava James Berry and Jesse M Shapiro 2007 Can Higher Prices Stimulate Product Use Evidence from a Field Experiment in Zambia National Bureau of Economic Research Working Paper 13247 Ashraf Nava Dean Karlan and Wesley Yin 2006a Deposit Collectors B E Journal of Eco nomic Analysis and Policy Advances in Economic Analysis and Policy 62 122 Ashraf Nava Dean Karlan and Wesley Yin 2006b Tying Odysseus to the Mast Evidence from a Commitment Savings Product in the Philippines Quarterly Journal of Economics 1212 635 72 Banerjee Abhijit V Shawn Cole Esther Duflo and Leigh Linden 2007 Remedying Education Evidence from Two Randomized Experiments in India Quarterly Journal of Economics 1223 123564 Banerjee Abhijit Vinayak with Alice H Amsden Robert H Bates Jagdish Bhagwati Angus Deaton and Nicholas Stern 2007 making Aid Work Cambridge MA MIT Press Bertrand Marianne Simeon Djankov Rema Hanna and Sendhil Mullainathan 2007 Obtaining a Drivers License in India An Experimental Approach to Studying Corruption Quarterly Journal of Economics 1224 163976 Björkman Martina and Jakob Svensson 2009 Power to the People Evidence from a Randomized Field Experiment on CommunityBased Monitoring in Uganda Quarterly Journal of Econom ics 1242 73569 Bobonis Gustavo J Edward Miguel and Charu PuriSharma 2006 Anemia and School Participa tion Journal of Human resources 414 692721 Box George E P J Stuart Hunter and William G Hunter 2005 statistics for Experimenters design innovation and discovery 2nd ed Hoboken NJ WileyInterscience Burtless Gary 1995 The Case for Randomized Field Trials in Economic and Policy Research Journal of Economic Perspectives 92 6384 Committee for Proprietary Medicinal Products CPMP 2003 Points to Consider on Adjustment for Baseline Covariates The European Agency for the Evaluation of Medicinal Products Evalua tion of Medicines for Human Use Report CPMPEWP286399 httpwwwemeaeuropaeupdfs humanewp286399enpdf accessed February 6 2008 London EMEA Deaton Angus 2008 Instruments of Development Randomization in the Tropics and the Search for the Elusive Keys to Economic Development The Keynes Lecture in Economics British Acad emy London October 9 2008 de Mel Suresh David McKenzie and Christopher Woodruff 2008 Returns to Capital in Microen terprises Evidence from a Field Experiment Quarterly Journal of Economics 1234 132972 Duflo Esther 2005 Evaluating the Impact of Development Aid Programmes The Role of Ran domised Evaluations In development Aid Why and How Toward strategies for Effectiveness VOL 1 NO 4 231 BruHN ANd mckENziE iN PursuiT Of BALANcE Proceedings of the AfdEudN conference 2004 20547 Paris Agence Française de Développement httpwwwafdlibrevilleorgjahiawebdavsiteafdusersadministrateurpublic publicationsnotesetdocumentsND22pdfpage208 Duflo Esther Rachel Glennerster and Michael Kremer 2008 Using Randomization in Develop ment Economics Research A Toolkit In Handbook of development Economics Vol 4 ed T Paul Schultz and John Strauss 38953962 Amsterdam NH North Holland Duflo Esther Rema Hanna and Stephen Ryan 2007 Monitoring Works Getting Teachers to Come to School Bureau of Research and Economic Analysis Development BREAD Working Paper 103 Duflo Esther and Michael Kremer 2004 Use of Randomization in the Evaluation of Development Effectiveness In Evaluating development Effectiveness Vol 7 World Bank Series on Evalua tion and Development ed George Keith Pitman Osvaldo N Feinstein and Gregory K Ingram 20532 Piscataway NJ Transaction Publishers Dupas Pascaline 2006 Relative Risks and the Market for Sex Teenagers Sugar Daddies and HIV in Kenya httpwwwinternationalpolicyumicheduedtspdfsDupasRelativeRiskspdf Ferraz Claudio and Frederico Finan 2008 Exposing Corrupt Politicians The Effects of Brazils Publicly Released Audits on Electoral Outcomes Quarterly Journal of Economics 1232 70345 Field Erica and Rohini Pande 2008 Repayment Frequency and Default in Microfinance Evidence from India Journal of the European Economic Association 623 50109 Fisher Ronald A 1935 The design of Experiments Edinburgh Oliver and Boyd Freedman David A 2008 On Regression Adjustments in Experiments with Several Treatments Annals of Applied statistics 21 17696 Glewwe Paul Michael Kremer Sylvie Moulin and Eric Zitzewitz 2004 Retrospective vs Prospec tive Analyses of School Inputs The Case of Flip Charts in Kenya Journal of development Eco nomics 741 25168 Glewwe Paul Albert Park and Meng Zhao 2006 The Impact of Eyeglasses on Academic Perfor mance of Primary School Students Evidence from a Randomized Trial in Rural China University of Minnesota Center for International Food and Agricultural Products Conference Paper 6644 Greevy Robert Bo Lu Jeffrey H Silber and Paul Rosenbaum 2004 Optimal Multivariate Match ing Before Randomization Biostatistics 52 26375 He Fang Leigh L Linden and Margaret MacLeod 2007 Teaching What Teachers Dont Know An Assessment of the Pratham English Language Program httpwwwcolumbiaedumgm2115 PicTalk20Working20Paper2020070326pdf Imai Kosuke Gary King and Clayton Nall Forthcoming The Essential Role of Pair Matching in ClusterRandomized Experiments with Application to the Mexican Universal Health Insurance Evaluation statistical science Imai Kosuke Gary King and Elizabeth A Stuart 2008 Misunderstandings between Experimen talists and Observationalists About Causal Inference Journal of the royal statistical society series A statistics in society 1712 481502 Imbens Guido Gary King David McKenzie and Geert Ridder 2009 On the Finite Sample Ben efits of Stratification in Randomized Experiments Unpublished Karlan Dean and Martin Valdivia 2006 Teaching Entrepreneurship Impact of Business Train ing on Microfinance Clients and Institutions httpaidaeconyaleedukarlanpapersTeaching Entrepeneurshippdf Kernan Walter N Catherine M Viscoli Robert W Makuch Lawrence M Brass and Ralph I Horwitz 1999 Stratified Randomization for Clinical Trials Journal of clinical Epidemiology 521 1926 King Gary Emmanuela Gakidou Nirmala Ravishankar Ryan T Moore Jason Lakin Manett Var gas Martha María TéllezRojo Juan Eugenio Hernández Ávila Mauricio Hernández Ávila and Héctor Hernández Llamas 2007 A Politically Robust Experimental Design for Public Policy Evaluation with Application to the Mexican Universal Health Insurance Program Journal of Policy Analysis and management 263 479506 Kremer Michael 2003 Randomized Evaluations of Educational Programs in Developing Coun tries Some Lessons American Economic review 932 10206 Kremer Michael Jessica Leino Edward Miguel and Aliz Peterson Zwane 2006 Spring Cleaning A Randomized Evaluation of Source Water Quality Improvement httpwwwsscnetuclaedu polisciwgapepapers11Miguelpdf Levitt Steven D and John A List 2008 Field Experiments in Economics The Past The Present and The Future National Bureau of Economic Research Working Paper 14356 232 AmEricAN EcONOmic JOurNAL APPLiEd EcONOmics OcTOBEr 2009 Martin Donald C Paula Diehr Edward B Perrin and Thomas D Koepsell 1993 The Effect of Matching on the Power of Randomized Community Intervention Studies statistics in medicine 1234 32938 Miguel Edward and Michael Kremer 2004 Worms Identifying Impacts on Education and Health in the Presence of Treatment Externalities Econometrica 721 159217 Olken Benjamin A 2007a Monitoring Corruption Evidence from a Field Experiment in Indone sia Journal of Political Economy 1152 200249 Olken Benjamin A 2007b Political Institutions and Local Public Goods Evidence from a Field Exper iment in Indonesia httpwwwpovertyactionlaborgpapers51OlkenPoliticalInstitutions pdf Permutt Thomas 1990 Testing for Imbalance of Covariates in Controlled Experiments statistics in medicine 912 145562 Pocock S J and R Simon 1975 Sequential Treatment Assignment with Balancing for Prognostic Factors in the Controlled Clinical Trial Biometrics 31 10315 Raab Gillian M Simon Day and Jill Sales 2000 How to Select Covariates to Include in the Analy sis of a Clinical Trial controlled clinical Trials 214 33042 Rosenbaum Paul R 2002 Observational studies 2nd ed New York Springer Schulz Kenneth F 1996 Randomised Trials Human Nature and Reporting Guidelines Lancet 3489027 59698 Schulz Kenneth F and David A Grimes 2002 Allocation Concealment in Randomised Trials Defending Against Deciphering Lancet 359930661418 Scott Neil W Gladys C McPherson Craig R Ramsay and Marion K Campbell 2002 The Method of Minimization for Allocation to Clinical Trials A Review controlled clinical Trials 236 66274 Senn Stephen 2004 Added Values Controversies Concerning Randomization and Additivity in Clinical Trials statistics in medicine 2324 372953 Skoufias Emmanuel 2005 PROGRESA and Its Impact on the Welfare of Rural Households in Mex ico International Food Policy Research Institute IFPRI Research Report 139 Soares Jose F and C F Jeff Wu 1983 Some Restricted Randomization Rules in Sequential Designs communications in statistics Theory and methods 1217 201734 Taves D R 1974 Minimization A New Method of Assigning Patients to Treatment and Control Groups clinical Pharmacology Therapeutics 155 44353 Therneau Terry M 1993 How Many Stratification Factors Are Too Many to Use in a Randomiza tion Plan controlled clinical Trials 142 98108 Treasure Tom and Kenneth D MacRae 1998 Minimisation The Platinum Standard for Trials British medical Journal 3177155 36263 1 Qual a importância da aleatorização do tratamento em experimentos sociais no que diz respeito à identificação do efeito causal Resposta A aleatorização do tratamento em experimentos sociais é essencial para a identificação do efeito causal pois assegura que a alocação dos participantes nos grupos de tratamento e controle seja determinada exclusivamente pelo acaso Isso elimina o viés de seleção e possibilita que os grupos sejam em média comparáveis em relação a características observadas e não observadas antes do início do experimento Dessa forma quaisquer diferenças nos resultados observados entre os grupos podem ser atribuídas de maneira confiável ao tratamento ao invés de outros fatores externos ou características intrínsecas dos participantes 2 Descreva o método de aleatorização simples do tratamento Quais suas vantagens e quais os possíveis problemas Resposta O método de aleatorização simples do tratamento distribui aleatoriamente os indivíduos ou instituições entre os grupos de tratamento e controle sem levar em consideração características específicas dos participantes Isso garante que a alocação seja inteiramente casual permitindo que os grupos em média sejam semelhantes tanto em características observáveis quanto não observáveis Uma de suas principais vantagens é a simplicidade uma vez que é fácil de implementar e compreender proporcionando uma forma direta de atribuir o tratamento Além disso a aleatorização simples é percebida como um procedimento justo e transparente facilitando a aceitação por outros pesquisadores e formuladores de políticas Outro benefício é a credibilidade do método pois por ser baseado apenas no acaso tornase difícil contestar a validade do experimento já que elimina a possibilidade de viés intencional na alocação dos grupos No entanto esse método pode apresentar problemas especialmente em amostras pequenas onde é mais provável ocorrer um desequilíbrio significativo entre os grupos em termos de características importantes comprometendo a comparabilidade e a precisão dos resultados Nesses casos ajustes ex post para corrigir os desequilíbrios podem ser ineficientes resultando em uma perda de precisão na estimativa do efeito do tratamento Além disso a aleatorização simples pode levar a uma imprevisibilidade na composição dos grupos gerando variações consideráveis em características essenciais o que pode afetar a interpretação dos resultados e a robustez das conclusões 3 Pesquisadores usualmente colocam no início do trabalho uma tabela com as médias de cada covariada prétratamento diferenciando tratados e controles e uma coluna com testes de diferenças de médias entre os dois grupos O que tal tabela permite analisar Quais as limitações dessa análise Resposta A tabela que apresenta as médias de cada covariada prétratamento para os grupos de tratamento e controle acompanhada por testes de diferenças de médias permite aos pesquisadores avaliar se os grupos são semelhantes em relação às características observáveis antes da intervenção Essa análise de equilíbrio inicial é fundamental para confirmar a eficácia da aleatorização assegurando que não existam desequilíbrios sistemáticos entre os grupos o que é essencial para uma interpretação confiável dos resultados causais do tratamento No entanto essa análise possui limitações significativas Primeiro ela se restringe apenas às variáveis observáveis incluídas na tabela não podendo garantir o equilíbrio em relação às variáveis não observáveis que também podem influenciar os resultados do experimento Segundo os testes de diferença de médias podem não ter poder estatístico suficiente para detectar desequilíbrios em amostras pequenas mesmo quando o desequilíbrio existe Isso significa que a ausência de diferenças estatisticamente significativas entre as médias das covariadas prétratamento não implica necessariamente um equilíbrio completo Por fim o uso desses testes pode ser problemático quando os pesquisadores decidem quais covariadas incluir na análise pois a seleção dessas variáveis pode influenciar a interpretação dos resultados levando a um viés na escolha das variáveis que apresentam maior desequilíbrio 4 Como definir quais covariadas devem ser consideradas para avaliar a qualidade da aleatorização Quais os problemas em se utilizar covariadas prétratamento Resposta Para definir quais covariadas devem ser consideradas na avaliação da qualidade da aleatorização recomendase escolher variáveis que estejam fortemente relacionadas com o resultado de interesse Além disso é indicado incluir covariadas que representem fatores importantes para a análise de subgrupos pois isso melhora a precisão estatística do experimento aumentando a capacidade de detectar o efeito do tratamento Idealmente deve se balancear a variável de desfecho observada no momento inicial do estudo baseline pois ela geralmente tem a correlação mais forte com os resultados futuros Variáveis geográficas também são recomendadas pois podem captar variações regionais que influenciam os resultados seja por diferenças nos choques locais ou na implementação do tratamento No entanto o uso de covariadas prétratamento apresenta algumas problemáticas Em primeiro lugar essas variáveis podem ter um poder preditivo limitado para o desfecho futuro especialmente em situações onde os resultados são mais suscetíveis a mudanças inesperadas ou a fatores não capturados pelas covariadas Isso pode levar a um falso senso de equilíbrio entre os grupos já que a covariada pode não prever adequadamente as variações futuras no desfecho de interesse Além disso o aumento do número de covariadas utilizadas para ajuste pode resultar em menor balanceamento em cada variável específica dificultando a análise e comprometendo a eficácia do processo de randomização Por fim o uso excessivo de covariadas pode aumentar a complexidade da análise tornando mais difícil a interpretação dos resultados e a implementação das técnicas de estratificação ou pareamento especialmente em amostras pequenas 5 Descreva o método de aleatorização com estratificação Quais as vantagens e desvantagens de tal método Resposta O método de aleatorização com estratificação consiste em dividir os participantes em subgrupos homogêneos estratos com base em uma ou mais características observáveis antes da alocação ao tratamento Dentro de cada estrato os indivíduos são aleatoriamente designados para os grupos de tratamento ou controle assegurando equilíbrio entre esses grupos em relação às características utilizadas para a formação dos estratos As vantagens desse método incluem a melhoria do balanceamento entre os grupos para as variáveis estratificadas aumentando a sensibilidade do experimento ao permitir a detecção de diferenças menores no efeito do tratamento Além disso a estratificação ajuda a reduzir a variabilidade nos resultados aumentando a precisão e a eficiência estatística Outro benefício é a possibilidade de realizar análises mais detalhadas por subgrupos uma vez que a estratificação garante um número suficiente de indivíduos em cada grupo para a comparação Contudo o método de aleatorização com estratificação apresenta algumas desvantagens A principal limitação é a restrição no número de variáveis que podem ser utilizadas para criar os estratos especialmente em amostras pequenas já que o aumento do número de variáveis resulta em um crescimento exponencial do número de estratos o que pode levar à criação de estratos com poucos participantes ou até vazios Isso pode comprometer a eficiência do experimento e dificultar a análise estatística Além disso a estratificação é mais eficaz para variáveis categóricas e menos adequada para variáveis contínuas que não podem ser perfeitamente balanceadas dentro dos estratos Por fim existe o risco de subestimação do efeito do tratamento se as variáveis de estratificação não forem adequadamente controladas na análise posterior o que pode gerar resultados estatisticamente conservadores 6 Qual método é superior aleatorização simples ou com estratificação Por quê Em que situações devese tomar cuidado com a estratificação Resposta O método de aleatorização com estratificação é geralmente considerado superior ao método de aleatorização simples especialmente quando se busca maior equilíbrio nas covariadas observáveis A estratificação melhora o balanceamento das variáveischave entre os grupos de tratamento e controle aumentando a precisão estatística e a sensibilidade do experimento para detectar efeitos do tratamento Ao reduzir a variabilidade dentro dos estratos a estratificação contribui para estimativas mais precisas e confiáveis sobretudo em amostras pequenas ou moderadas onde o risco de desequilíbrio entre os grupos é maior Além disso ao assegurar um balanceamento prévio em características específicas a estratificação permite uma análise mais robusta por subgrupos aumentando a validade interna do estudo Entretanto devese ter cautela ao usar a estratificação em determinadas situações Primeiramente em amostras pequenas o uso de muitas variáveis para criar os estratos pode resultar em estratos com poucos ou nenhum participante comprometendo a eficácia da aleatorização e a viabilidade do experimento Também é importante considerar que a estratificação é mais eficaz com variáveis categóricas sendo mais desafiadora para variáveis contínuas que não podem ser balanceadas perfeitamente dentro dos estratos Outro ponto de atenção é que se os estratos não forem controlados adequadamente na análise estatística posterior isso pode levar a estimativas mais conservadoras e a uma menor capacidade de detectar o efeito do tratamento Por fim em grandes amostras a superioridade da estratificação sobre a aleatorização simples tende a diminuir já que todos os métodos de aleatorização alcançam balanceamento à medida que o tamanho da amostra aumenta 7 Descreva o método de aleatorização que faz o matching em pares Quais vantagens e desvantagens de tal método Por que para pequenas amostras não se pode usar muitas covariadas para fazer o matching Resposta O método de aleatorização que utiliza o matching em pares envolve a formação de pares de unidades com características semelhantes antes da alocação ao tratamento Para cada par uma unidade é designada aleatoriamente para o grupo de tratamento enquanto a outra é alocada ao grupo de controle O objetivo é minimizar a distância entre as características dos membros de cada par assegurando maior equilíbrio nas covariadas observáveis e aumentando a comparabilidade entre os grupos As vantagens do método incluem um maior equilíbrio entre as covariadas nos grupos de tratamento e controle o que melhora a precisão das estimativas e aumenta a potência estatística do experimento O matching também oferece proteção parcial contra interferências políticas ou desistências em experimentos sociais Por exemplo se uma unidade de um par é perdida durante o estudo a outra unidade do par pode ser removida mantendo o equilíbrio entre os grupos remanescentes No entanto esse método apresenta algumas desvantagens A formação de pares pode ser mais complexa e demorada especialmente em amostras maiores ou em contextos onde o tempo de campo é limitado Além disso se houver desistência aleatória de participantes o método pode reduzir a amostra de maneira significativa diminuindo a potência estatística do estudo Outra desvantagem é que o matching não pode garantir equilíbrio em variáveis contínuas de forma tão eficaz quanto em variáveis categóricas e o procedimento de matching pode se tornar inviável em contextos com muitas covariadas a serem balanceadas Em amostras pequenas o uso de muitas covariadas para fazer o matching pode gerar pares menos adequados ou até inviáveis resultando em menor eficiência do experimento Isso ocorre porque à medida que aumenta o número de covariadas consideradas tornase mais difícil encontrar pares com características suficientemente semelhantes o que pode levar a pares forçados ou a um número reduzido de pares possíveis Assim o uso excessivo de covariadas pode comprometer o balanceamento das variáveis principais e limitar a capacidade do estudo de detectar os efeitos do tratamento 8 No que consistem os métodos de realeatorização Quais possíveis problemas podem haver Resposta Os métodos de realeatorização consistem em realizar múltiplas alocações aleatórias de tratamento e controle e em seguida selecionar a alocação que apresenta o melhor equilíbrio em um conjunto de covariadas observáveis Há duas abordagens principais a grande penalidade big stick na qual se refaz a aleatorização caso haja uma diferença estatisticamente significativa nas médias das covariadas entre os grupos e a abordagem que escolhe a alocação com o menor valor do máximo testatístico dentre um grande número de iterações por exemplo 1000 sorteios garantindo assim um balanceamento mais uniforme Os métodos de realeatorização apresentam vantagens ao permitir maior controle sobre o equilíbrio das covariadas em múltiplas variáveis especialmente em amostras pequenas onde o simples sorteio aleatório pode resultar em desequilíbrios substanciais Além disso a realeatorização pode ser uma solução eficaz quando há várias variáveis para as quais é desejável evitar um desequilíbrio extremo sem exigir equilíbrio exato em cada uma delas como ocorre na estratificação No entanto existem possíveis problemas associados a esses métodos Primeiramente a realeatorização pode resultar em um processo de alocação menos transparente e mais complexo já que envolve critérios adicionais para escolher a alocação final o que pode dificultar a replicabilidade do experimento Além disso esses métodos podem introduzir um viés de seleção na análise posterior se não forem controlados adequadamente pois a distribuição de tratamento não será mais completamente aleatória Isso pode levar a erros na inferência estatística como estimativas conservadoras das variâncias ou tamanhos incorretos dos testes de hipóteses Por fim o uso de critérios subjetivos para decidir quando realizar uma nuova realeatorização pode comprometer a validade dos resultados especialmente se esses critérios não forem claramente definidos antes do início do estudo 9 Qual a importância do tamanho da amostra Em que sentido amostras maiores melhoram a qualidade da aleatorização Resposta O tamanho da amostra é fundamental para a qualidade da aleatorização em experimentos Amostras maiores tendem a melhorar o equilíbrio entre os grupos de tratamento e controle pois a probabilidade de diferenças extremas entre as características dos grupos diminui à medida que o número de participantes aumenta Isso ocorre porque com mais unidades a distribuição aleatória das características nos dois grupos se aproxima mais da distribuição da população resultando em um balanceamento mais efetivo tanto das variáveis observáveis quanto das não observáveis Em amostras pequenas é mais comum que ocorram desequilíbrios substanciais entre os grupos o que pode comprometer a comparabilidade e consequentemente a validade dos resultados do experimento Ajustes ex post para corrigir esses desequilíbrios são menos eficientes do que um equilíbrio alcançado a priori já que o ajuste posterior pode resultar em estimativas menos precisas do efeito do tratamento Portanto amostras maiores não apenas reduzem a variabilidade das estimativas e aumentam a precisão estatística mas também asseguram que qualquer diferença observada nos resultados seja mais provavelmente atribuível ao tratamento ao invés de ser um efeito do desequilíbrio nas características iniciais entre os grupos Em síntese o aumento do tamanho da amostra reforça a validade interna do experimento garantindo maior confiabilidade na identificação dos efeitos causais pretendidos 10 Utilizando a base de dados experimentaldta calcule as médias e desvios padrões das covariadas prétratamento e faça o teste de diferenças de médias entre tratados e controles Calcule também o efeito do tratamento sobre a variável de interesse re78 Resposta Cálculo das Médias e Desvios Padrão Estatística Descritiva T age educ black hisp marr re74 re75 re78 u74 u75 N 0 260 260 260 260 260 260 260 260 260 260 1 185 185 185 185 185 185 185 185 185 185 Omisso 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 Média 0 251 101 0827 0108 0154 2107 1267 4555 0750 0685 1 258 103 0843 00595 0189 2096 1532 6349 0708 0600 Median a 0 240 100 100 000 000 000 000 3139 100 100 1 250 110 100 000 000 000 000 4232 100 100 Desvio padrão 0 706 161 0379 0311 0361 5688 3103 5484 0434 0466 1 716 201 0365 0237 0393 4887 3219 7867 0456 0491 Mínimo 0 170 300 000 000 000 000 000 000 000 000 1 170 400 000 000 000 000 000 000 000 000 Máxim o 0 550 140 100 100 100 39571 23032 39484 100 100 1 480 160 100 100 100 35040 25142 60308 100 100 Teste de Diferença de Médias Teste t para amostras independentes Estatística gl p age t de Student 11166 443 0265 educ t de Student 14958 443 0135 black t de Student 04548 443 0649 hisp t de Student 17757 443 0076 marr t de Student 09804 443 0327 re74 t de Student 00222 443 0982 re75 t de Student 08746 443 0382 re78 t de Student 28353 443 0005 u74 t de Student 09829 443 0326 u75 t de Student 18466 443 0065 Análise do efeito do tratamento sobre a variável de interesse Preditor Estimativas Erro padrão t p Intercepto 4555 408 1116 001 T 1794 633 284 0005 Modelo R R² 1 0134 00178 11 Refaça o exercício anterior com uma subamostra aleatória de tamanho 30 Observase alguma diferença nos resultados Resposta Cálculo das Médias e Desvios Padrão T age educ black hisp marr re74 re75 re78 u74 u75 N 1 15 15 15 15 15 3 2 4 15 15 0 15 15 15 15 15 15 15 8 15 15 Omisso 1 0 0 0 0 0 12 13 11 0 0 0 0 0 0 0 0 0 0 7 0 0 Média 1 269 107 0733 0200 0467 6760 7256 12363 0133 000 0 257 947 0733 00667 0133 000 000 1475 100 100 Mediana 1 26 11 1 0 0 000 7256 6402 0 0 0 24 9 1 0 0 000 000 000 1 1 Desvio padrão 1 603 153 0458 0414 0516 11709 1696 17278 0352 000 0 756 233 0458 0258 0352 000 000 4171 000 000 Mínimo 1 17 8 0 0 0 000 6057 000 0 0 0 18 4 0 0 0 000 000 000 1 1 Máximo 1 35 14 1 1 1 20280 8456 36647 1 0 0 45 14 1 1 1 000 000 11797 1 1 Teste de Diferença de Médias Estatística gl p age t de Student 0481 280 0635 educ t de Student 1761 280 0089 black t de Student 0000 280 1000 hisp t de Student 1058 280 0299 marr t de Student 2066 280 0048 re74 t de Student 2582 160 0020 re75 t de Student 22010 150 001 re78 t de Student 1763 100 0108 u74 t de Student 9539 280 001 u75 t de Student NaN Análise do efeito do tratamento sobre a variável de interesse Preditor Estimativas Erro padrão t p Intercepto ᵃ 1475 3566 0413 0688 T 10888 6177 1763 0108 Modelo R R² 1 0487 0237 Na comparação entre a base completa e a subamostra aleatória de 30 observações identificamse diferenças notáveis nos resultados Em relação às médias e desvios padrão das covariadas prétratamento houve variações substanciais entre a amostra completa e a subamostra Na amostra completa as covariadas apresentaram valores mais estáveis enquanto na subamostra houve maior variabilidade especialmente em variáveis como re74 e re75 que mostraram médias mais elevadas Já as variáveis u74 e u75 apresentaram distribuições mais extremas na subamostra resultado esperado devido ao menor tamanho da amostra e à consequente maior variabilidade estatística No que tange ao teste de diferença de médias entre os grupos tratados e controles os resultados foram significativamente diferentes nas duas análises Na amostra completa apenas a diferença na variável re78 foi estatisticamente significativa com um pvalor de 0005 indicando uma diferença consistente entre os grupos Já na subamostra outras covariadas como marr e re74 mostraram diferenças estatisticamente significativas enquanto re78 não manteve a significância estatística refletindo a menor precisão estatística da subamostra Quanto ao efeito do tratamento sobre a variável de interesse re78 observase que na amostra completa o efeito foi estatisticamente significativo com um coeficiente de 1794 e pvalor de 0005 Por outro lado na subamostra de 30 observações o efeito do tratamento não foi significativo apresentando um coeficiente mais elevado 10888 e um pvalor de 0108 indicando um intervalo de confiança mais amplo e uma maior imprecisão das estimativas devido ao menor tamanho amostral Em suma a comparação revela que a redução no tamanho da amostra afeta a precisão das estimativas e a detecção de diferenças estatisticamente significativas resultando em maior variabilidade e menor robustez dos resultados
Send your question to AI and receive an answer instantly
Recommended for you
3
Lista de Exercicios Econometria - Modelos de Regressao e Analise de Dados
Econometria
UFPEL
2
Trabalho Estatistica Economica - Impacto do PIB nas Importacoes Brasileiras 1974-2021
Econometria
UFPEL
15
Referências de Marketing e Gestão de Marca
Econometria
UCAM
13
Notas de Aula: Econometria II - Equações Simultâneas
Econometria
UFABC
14
Multicolinearidade em Econometria: Conceitos, Identificação e Impactos
Econometria
ITE
24
Econometria II - Notas de Aula sobre Variáveis Endógenas e Inferência Causal
Econometria
UFABC
12
Econometria I - Lista de Exercícios N 6 - Regressão Múltipla e Interpretação de Resultados
Econometria
UESC
15
Fusões e Aquisições - Efeitos no Crescimento Econômico do Mercado Norte-Americano
Econometria
FUCAPE
8
MAPA Econometria - Regressão POLS, Efeito Fixo, Aleatório e Teste de Hausman
Econometria
UNICESUMAR
18
Análise de Tabelas
Econometria
UFF
Preview text
Universidade Federal de Pelotas Programa de PósGraduação em Organizações e Mercados PPGOM Professor André Carraro Felipe Garcia Lista 1 1 Qual a importância da aleatorização do tratamento em experimentos sociais no que diz respeito à identificação do efeito causal 2 Descreva o método de aleatorização simples do tratamento Quais suas vantagens e quais os possíveis problemas 3 Pesquisadores usualmente colocam no início do trabalho uma tabela com as médias de cada covariada prétratamento diferenciando tratados e controles e uma coluna com testes de diferenças de médias entre os dois grupos O que tal tabela permite analisar Quais as limitações dessa análise 4 Como definir quais covariadas devem ser consideradas para avaliar a qualidade da aleatorização Quais os problemas em se utilizar covariadas prétratamento 5 Descreva o método de aleatorização com estratificação Quais as vantagens e desvantagens de tal método 6 Qual método é superior aleatorização simples ou com estratificação Por quê Em que situações devese tomar cuidado com a estratificação 7 Descreva o método de aleatorização que faz o matching em pares Quais vantagens e desvantagens de tal método Por que para pequenas amostras não se pode usar muitas covariadas para fazer o matching 8 No que consistem os métodos de realeatorização Quais possíveis problemas podem haver 9 Qual a importância do tamanho da amostra Em que sentido amostras maiores melhoram a qualidade da aleatorização 10 Utilizando a base de dados experimentaldta calcule as médias e desvios padrões das covariadas prétratamento e faça o teste de diferenças de médias entre tratados e controles Calcule também o efeito do tratamento sobre a variável de interesse re78 11 Refaça o exercício anterior com uma subamostra aleatória de tamanho 30 Observase alguma diferença nos resultados 200 American Economic Journal Applied Economics 2009 14 200232 httpwwwaeaweborgarticlesphpdoi101257app14200 R andomized experiments are increasingly used in development economics Historically many randomized experiments were largescale government implemented social experiments such as Moving To Opportunity in the United States or ProgresaOportunidades in Mexico These experiments allowed for little involvement by researchers in the actual randomization In contrast in recent years many experiments have been directly implemented by researchers themselves or in partnership with nongovernmental organizations NGOs and the private sector These smallscale experiments with sample sizes often comprising 100 to 500 indi viduals or 20 to 100 schools or health clinics have greatly expanded the range of research questions that can be studied using experiments and have provided impor tant and credible evidence on a range of economic and policy issues Nevertheless this move toward smaller sample sizes means researchers increasingly face the ques tion of not just whether to randomize but how to do so This paper provides the first comprehensive look at how researchers are carrying out randomizations in devel opment field experiments and then analyzes some of the consequences of these choices Bruhn Development Research Group World Bank MSN MC3307 1818 H Street N W Washington DC 20433 email mbruhnworldbankorg McKenzie Development Research Group World Bank MSN MC3 307 1818 H Street N W Washington DC 20433 email dmckenzieworldbankorg We thank the leading researchers in development field experiments who participated in our short survey as well as colleagues who have shared their experiences with implementing randomization We thank Angus Deaton David Evans Xavier Gine Guido Imbens Ben Olken and seminar participants at the World Bank for helpful comments We are also grateful to Radu Ban for sharing his pairwise matching Stata code Jishnu Das for the LEAPS data and Kathleen Beegle and Kristen Himelein for providing us with their constructed IFLS data All views are of course our own To comment on this article in the online discussion forum or to view additional materials visit the articles page at httpwwwaeaweborgarticlesphpdoi101257app14200 In Pursuit of Balance Randomization in Practice in Development Field Experiments By Miriam Bruhn and David McKenzie We present new evidence on the randomization methods used in exist ing experiments and new simulations comparing these methods We find that many papers do not describe the randomization in detail implying that better reporting is needed Our simulations suggest that in samples of 300 or more the different methods perform simi larly However for very persistent outcome variables and in smaller samples pairwise matching and stratification perform best and appear to dominate the rerandomization methods commonly used in practice The simulations also point to specific recommendations for which variables to balance on and for which controls to include in the ex post analysis JEL C83 C93 O12 VOL 1 NO 4 201 BruHN ANd mckENziE iN PursuiT Of BALANcE Simple randomization ensures the allocation of treatment to individuals or institu tions is left purely to chance and is not systematically biased by deliberate selection of individuals or institutions into the treatment Randomization thus ensures that the treatment and control samples are in expectation similar in average both in terms of observed and unobserved characteristics Furthermore it is often argued that the simplicity of experiments offers considerable advantage in making the results con vincing to other social scientists and policymakers and that in some instances ran dom assignment is the fairest and most transparent way of choosing the recipients of a new pilot program Gary Burtless 1995 However it has long been recognized that while pure random assignment guar antees that the treatment and control groups will have identical characteristics on average in any particular random allocation the two groups will differ along some dimensions with the probability that such differences are large falling with sample size1 Although ex post adjustment can be made for such chance imbalances this is less efficient than achieving ex ante balance and cannot be used in cases in which all individuals with a given characteristic are allocated to just the treatment group The standard approach to avoiding imbalance on a few key variables is stratifica tion or blocking originally proposed by R A Fisher 1935 Under this approach units are randomly assigned to treatment and control within strata defined by usu ally one or two observed baseline characteristics However in practice it is unlikely that one or two variables will explain a large share of the variation in the outcome of interest leading to attempts to balance on multiple variables One such method when baseline data are available is pairwise matching Robert Greevy et al 2004 Kosuke Imai Gary King and Clayton Nall forthcoming The methods of implementing randomization have historically been poorly reported in medical journals leading to the formulation of the CONSORT guidelines that set out standards for the reporting of clinical trials Kenneth F Schulz 1996 The recent explosion of field experiments in development economics has not yet met these same standards with many papers omitting key details of the method in which randomiza tion is implemented For this reason we conducted a survey of leading researchers car rying out randomized experiments in developing countries This reveals common use of methods to improve baseline balance including several rerandomization methods not discussed in print These are carrying out an allocation to treatment and control and then using a statistical threshold or ad hoc procedure to decide whether or not to redraw the allocation and drawing 100 or 1000 allocations to treatment and control and choosing the one that shows best balance on a set of observable variables This paper discusses the pros and cons of these different methods for striving toward balance on observables Proponents of methods such as stratification match ing and minimization claim that such methods can improve efficiency increase power and protect against type I errors Kernan et al 1999 and do not seem to have significant disadvantages except in small samples Imai King and Elizabeth 1 For example Walter N Kernan et al 1999 consider a binary variable that is present in 30 percent of the sample They show that the chance that the two treatment group proportions will differ by more than 10 percent is 38 percent in an experiment with 50 individuals 27 percent in an experiment with 100 individuals 9 percent for an experiment with 200 individuals and 2 percent for an experiment with 400 individuals 202 AmEricAN EcONOmic JOurNAL APPLiEd EcONOmics OcTOBEr 2009 A Stuart 2008 Greevy et al 2004 Mikel Aickin 20012 However it is precisely in small samples that the choice of randomization method becomes important since in large samples all methods will achieve balance We simulate different randomiza tion methods in four panel datasets We then compare balance in outcome variables at baseline and at followup The simulations show that when methods other than pure randomization are used the degree of balance achieved on baseline variables is much greater than that achieved on the outcome variable in the absence of treat ment in the followup period The simulations further show that in samples of 300 observations or more the choice of method is not very important for the degree of balance in many outcomes at followup In small samples and with very persistent outcomes however matching or stratification on relevant baseline variables achieves more balance in followup outcomes than does pure randomization We use our simulation results and theory to help answer many of the impor tant practical questions facing researchers engaged in randomized experiments The results allow us to provide guidance on how to conduct inference after stratification matching or rerandomization In practice it appears that many researchers ignore the method of randomization in inference We show that this leads to hypothesis tests with incorrect size On average the standard errors are overly conservative when the method of randomization is not controlled for in the analysis implying that researchers may not detect treatment effects that they would detect if the inference did take into account the randomization method However although this is the case on average in a nontrivial proportion of draws it will be the case that not control ling for the randomization method will result in larger standard errors than if the randomization method is controlled for Thus it is possible that not controlling for the randomization method could lead the researcher to find a significant effect that is no longer significant when stratum or pair dummies are included Moreover we show that stratifying matching or rerandomizing and then analyzing the data with out controlling for the method of randomization results in lower power than if a pure random draw was used to allocate treatments except in cases in which the variables that balance is sought for have no predictive power for the future outcome of interest in which case there is no need to seek balance on them anyway The paper also discusses the use and abuse of tests for baseline differences in means the impact of balancing observables on achieving balance on unobservables and the issue of how many and which variables to use for stratifying or matching Finally based on our simulation results and the previous econometric literature this paper provides a list of actionable recommendations for researchers performing and reporting on randomized experiments This paper draws on a large literature of clinical trials where many related issues have been under discussion for several decades drawing out the lessons for devel opment field experiments It complements several recent papers in development on randomized experiments3 The paper builds on the recent handbook chapter by 2 One other argument in favor of ex ante balancing is that if the treatment effect is heterogeneous and varies with observed covariates ex ante balancing increases the precision of subgroup analysis 3 Summaries of recent experiments and advocacy of the policy case are found in Kremer 2003 Duflo and Kremer 2004 Duflo 2005 and Banerjee et al 2007b VOL 1 NO 4 203 BruHN ANd mckENziE iN PursuiT Of BALANcE Esther Duflo Rachel Glennerster and Michael Kremer 2008 which aims to pro vide a how to of implementing experiments Our focus differs considering how the actual randomization is implemented in practice and considering matching and rerandomization approaches Finally we contribute to the existing literature through new simulations that illustrate the performance of the different methods in a variety of situations experienced in practice While our focus is on field experiments in development economics to date the field with the most active involvement of researchers in randomization random ized experiments are also increasingly being used to investigate important policy questions in other fields Steven D Levitt and John A List 2008 In common with the development literature the extant literature in these other fields has often not explained the precise mechanism used for randomizing However it does appear that rerandomization methods are also being employed in some of these studies The ongoing New York public schools project being undertaken by the American Inequality Lab is one such highprofile example The lessons of this paper will also be important in designing upcoming experiments in other fields of economics The remainder of the paper is set out as follows Section I provides a stocktaking of how randomization is currently being implemented drawing on a summary of papers and a survey of leading experts in development field experiments Section II describes the datasets used in our simulations and outlines in more detail the different methods of randomization Section III provides simulation evidence on the relative performance of the different methods and on answers to key questions faced in practice Section IV concludes with our recommendations I How Is Randomization Being Implemented A randomization as described in Papers We begin by reviewing a selection of research papers containing randomized exper iments in development economics Table 1 summarizes a selection of relatively small scale randomized experiments with baseline data often implemented via NGOs or as pilot studies4 For each study we list the unit in which randomization occurs Typical sample sizes are 100 to 300 units with the smallest sample size being 10 geographic areas used in Nava Ashraf Dean Karlan and Wesley Yin 2006a The transparency in allocating a program to participants is likely to be greatest when assignment to treatment is done in public rather than in private5 Only 2 out of the 18 papers reviewed Marianne Bertrand et al 2007 and Erica Field and Rohini 4 We do not include experiments undertaken by the authors for objectivity reasons and because the final writeup of our paper has been influenced by the current paper 5 Of course privately drawn randomizations still have the virtue of being able to tell participants that the reason they were chosen or not chosen is random However it is our opinion that carrying out the randomization in a public or semipublic manner can make this more credible in the eyes of participants in many settings This may particularly be the case when it is the government doing the allocation see Claudio Ferraz and Federico Finan 2008 5 who note to ensure a fair and transparent process representatives of the press political parties and members of civil society are all invited to witness the lottery Nevertheless public randomization may not be feasible or desirable in particular settings We merely wish to urge researchers to consider whether the random ization can be easily publicly implemented in their setting and to note in their papers how the randomization was done 204 AmEricAN EcONOmic JOurNAL APPLiEd EcONOmics OcTOBEr 2009 Pande 2008 note whether the randomization was done in public or notin both cases public lotteries The majority of the other randomizations we believe are pri vate or at most semipublic where perhaps the NGO andor government officials witness the randomization draw but not the recipients of the program However this is not stated explicitly in the papers Table 1Summary of Selected Randomized Experiments in Developing Countries Paper Randomization unit Sample size Public or private Stratifi cation used Matched pairs of strata Strata or pair dummies used Table for assessing balance variables used to check balance Publishedforthcoming papers Ashraf Karlan and Yin 2006b Microfinance clients 1777 NA No No Yes 12 Ashraf Karlan and Yin 2006a Barangay area 10 NA No Yes Yes Yes 12 Banerjee et al 2007a School 98 NA Yes No NA No Yes 4 School 111 NA Yes No NA No Yes 4 School 67 NA Yes No NA No Yes 4 Bertrand et al 2007 Men wanting a drivers license 822 Public Yes2 No 23 Yes Yes 22 Gustavo J Bobonis Edward Miguel and Char PuriSharma 2006 Preschool cluster 155 NA No No Yes 24 Field and Pande 2008 Microfinance group 100 Public No No No1 Paul Glewwe et al 2004 School 178 NA Yes No NA No Yes 8 Edward Miguel and Kremer 2004 School 75 NA Yes No NA No Yes 21 Benjamin A Olken Village 608 NA Yes No 156 Yes Yes 10 2007a Subdistrict 156 NA Yes No 50 Yes Yes 10 Working papers Ashraf James Berry and Jesse M Shapiro 2007 Household 1260 NA Yes No 5 Yes Yes 14 Martina Björkman and Jakob Svensson 2007 Community 50 NA Yes No NA Yes 39 Duflo Rema Hanna and Stephen Ryan 2007 School 113 NA Yes No NA No4 Yes 15 Pascaline Dupas 2006 School 328 NA Yes No NA No Yes 17 Glewwe Albert Park and Meng Zhao 2006 Township 25 NA No Yes Yes 4 Fang He Leigh L Linden and Margaret MacLeod 2007 School division 194 NA Yes No NA No Yes 22 Dean Karlan and Martina Valdivia 2006 Microfinance group 239 NA Yes No NA No3 Yes 14 Kremer et al 2006 Spring 200 NA Yes No NA No Yes 28 Olken 2007b Village 48 NA Yes No 2 Yes Yes 8 Notes NA denotes information not available in the paper 1Paper says check was done on a number of variables and is available upon request 2It appears randomization was done within recruitment session but the paper was not clear on this 3Dummies for location are included but not for credit officer which was the other stratifying variable 4Dummies for district are included but not for the number of households in the area which were also used for stratifying within district VOL 1 NO 4 205 BruHN ANd mckENziE iN PursuiT Of BALANcE Next we examine which methods are being used to reduce the likelihood of imbal ance on observable covariates Thirteen studies use stratification two use matched pairs and only three appear to use pure randomization Ashraf Berry and Shapiro 2007 is the only documented example we have found of one of the methods that the next section shows to be in common use in our survey of experts They note at the time of randomization we verified that observable characteristics were balanced across treatments and in a few cases rerandomized when this was not the case Few papers provide the details of the method used presumably because there has not been a discussion of the potential importance of these details in the economics literature For example stratification is common but few studies actually give the number of strata used in the study In practice there appears to be disagreement as to whether it is necessary to include strata dummies in the analysis after stratifica tionmore than half the studies using stratification do not include strata dummies Finally all but one of the papers listed in Table 1 present a table for comparing treat ment and control groups and test for imbalance The number of variables used for checking imbalance ranges from 4 to 39 B randomization in Practice According to a survey of Experts The long lag between inception of a randomized experiment and its appearance in at least working paper form means the results above do not necessarily represent how the most recent randomized evaluations are being implemented We surveyed leading experts in randomized evaluations on their experience and approach to implementa tion A short online survey was sent to 35 selected researchers in December 2007 The list was selected from members of the Abdul Latif Jameel Poverty Action Lab Bureau of Research and Economic Analysis of Development BREAD and the World Bank who were known to have conducted randomized experiments We had 25 of these experts answer the survey with 7 out of the 10 individuals who did not respond having worked with those who did respond The median researcher surveyed had participated in 5 randomized experiments with a mean of 5966 Seventyone percent of the experi ments had baseline data including administrative data that could be used at the time when randomization to treatment was done Preliminary discussions with several leading researchers established that several methods involving multiple random draws were being used in practice to increase the likelihood of balance on observed characteristics One such approach is to take a random draw of assignment to treatment examine the difference in means for several key baseline characteristics and then rerandomize if the difference looks too large The decision as to what is too large could be done subjectively or according to some statistical cutoff criteria For example one survey respondent noted that they regressed variables like education on assignment to treatment and then redid the assignment if these coefficients were too big The second approach takes many draws of assignment to treatment and then chooses the one that gives best balance on a set of observable characteristics 6 This is after topcoding the number of experiments at 15 206 AmEricAN EcONOmic JOurNAL APPLiEd EcONOmics OcTOBEr 2009 according to some algorithm or rule For example several researchers say they write a program to carry out 100 or 1000 randomizations and then for each draw regress individual variables against treatment They then choose the draw with the minimum maximum tstatistic7 Some impose further criteria such as requiring the minimum maximum tstatistic for testing balance on observables to be below one The number of variables used to check balance typically ranges from 5 to 20 and often includes the baseline levels of the main outcomes The perceived advantage of this approach is to enable balance on more variables than possible with stratification and to provide balance in means on continuous variables Researchers were asked whether they had ever used a particular method and the method used in their most recent randomized experiment All of the methods are often combined with some stratification so we examine that separately Table 2 reports the results Most researchers have at some point used simple randomization probably with some stratification However we also see much more use of other methods than is apparent from the existing literature Fiftysix percent had used pairwise matching Thirtytwo percent of all researchers and 46 percent of the group with 5 or more experiments have subjectively decided whether to rerandom ize based on an initial test of balance The multiple draws process described above has also been used by 24 percent of researchers and by 38 percent of the group with 5 or more experiments More detailed questions were asked about the most recent randomization in an effort to obtain some of the information not provided in Table 1 Twentythree of the 25 respondents provided information on these Stratification was used in 14 out of the 15 experiments that were not employing a matched pair design The number of variables used in forming strata was small six used only one variable typically geo graphic location four used two variables eg location and gender and four used four variables Of particular note is that it appears rare to stratify on baseline values of the outcome value of interest eg test scores savings levels or incomes with only 2 of these 14 experiments including a baseline outcome as a stratifying factor While the number of stratifying variables is small there is much greater variation in the number of strata ranging from 3 to 200 with a mean median of 47 18 Only one researcher said that stratification was controlled for when calculating standard errors for the treatment effect A notable feature of the survey responses was a much greater number of research ers randomizing within matched pairs than is apparent from the existing develop ment literature The vast majority of these matches was not done using optimal or greedy Mahalanobis matching but was instead based on only a few variables and commonly done by hand In most cases the researchers matched on discrete variables and their interactions only and thus in effect the matching reduced to stratification One explanation for the difference in randomization approaches used by differ ent researchers is that they reflect differences in context with sample size ques tion of interest and organization one is working with potentially placing constraints 7 An alternative approach used by another researcher is to regress the treatment on a set of baseline covariates and choose the draw with the lowest r2 VOL 1 NO 4 207 BruHN ANd mckENziE iN PursuiT Of BALANcE on the method that can be used for randomization We therefore asked researchers for advice on how to evaluate the same hypothetical intervention designed to raise the income of day laborers8 The responses varied greatly across researchers and include each of the methods given in Table 2 What is clear is that there appears to be no general agreement on how to go about randomizing in practice II Data Simulated Methods and Variables for Balancing A data To compare the performance of the different randomization methods in practice we chose four panel datasets that allow us to examine a wide range of potential outcomes of interest including microenterprise profits labor income school atten dance household expenditure test scores and child anthropometrics The first panel dataset covers microenterprises in Sri Lanka and comes from Suresh de Mel David McKenzie and Christopher Woodruff 2008 This data was collected as part of an actual randomized experiment but we keep only data for firms that were in the control group during the first treatment round The dataset contains information on firms profits assets and many other firm and owner char acteristics The simulations we perform for this dataset are meant to mimic a ran domized experiment that administers a treatment aimed at increasing firms profits such as a business training program The second dataset is a subsample of the Mexican employment survey ENE Our subsample includes heads of household between 20 and 65 years of age who were first interviewed in the second quarter of 2002 and who were reinterviewed in the following four quarters We only keep individuals who were employed during 8 See Web Appendix 1 for the exact question and the responses given Table 2Survey Evidence on Randomization Methods Used by Leading Researchers Percent who have ever used Percent using method in most recent experiment Unweighted Weighted 5 experiment group Single random assignment to treatment possibly with stratification 80 84 92 391 Subjectively deciding whether to redraw 32 52 46 43 Using a statistical rule to decide whether to redraw 12 15 15 00 Carrying out many random assignments and choosing best balance 24 45 38 174 Explicitly matching pairs of observations on baseline characteristics 56 52 54 391 Number of researchers 25 25 13 23 Notes Methods are described in more detail in the paper Weighted results weight by the number of experiments the researcher has participated in 5 experiment group refers to researchers who have carried out 5 or more ran domized experiments 208 AmEricAN EcONOmic JOurNAL APPLiEd EcONOmics OcTOBEr 2009 the baseline survey and imagine a treatment that aims at increasing their income such as a training program or a nutrition program The third dataset comes from the Indonesian Family Live Survey IFLS9 We use 1997 data as the baseline and 2000 data as the followup and simulate two different interventions with the IFLS data First we keep only children aged 1016 in 1997 that were in the sixth grade and in school These children then receive a simulated treatment aimed at keeping them in school in the actual data about 26 percent have dropped out 3 years later Second we create a sample of households and simulate a treatment that increases household expenditure per capita The fourth dataset comprises child and household data from the Learning and Educationa Achievement project LEAPS project in Pakistan Tahir Andrabi et al 2008 We focus on children aged 8 to 12 at baseline and examine two child out come variables math test scores and height zscores10 The simulated treatments increase test scores or zscores of these children There is a wide range of policy experiments that have targeted these types of outcomes from providing text books or school meals to giving conditional cash transfers or nutritional supplements B simulated methods For each dataset we draw three subsamples of 30 100 and 300 observations to investigate how the performance of different methods varies with sample size All results are based on 10000 bootstrap iterations that randomly split the sample into a treatment and a control group according to five different methods The first method is a single random draw which we take as the benchmark for our comparison with the pros and cons of other methods stratificationThe second method is stratification Stratified randomization is the most wellknown and as we have seen commonly used method of preventing imbalance between treatment and control groups for the observed variables used in stratification By eliminating particular sources of differences between groups stratification aka blocking can increase the sensitivity of the experiment allowing it to detect smaller treatment differences than would otherwise be possible George E P Box J Stuart Hunter and William G Hunter 2005 The most often perceived disadvantage of stratification compared to some alternative methods is that only a small number of variables can be used in forming strata11 In terms of which variables to stratify on the econometric literature empha sizes variables that are strongly related to the outcome of interest and variables for which subgroup analysis is desired Statistical efficiency is greatest when the vari ables chosen are strongly related to the outcome of interest Imai King and Stuart 9 See httpwwwrandorglaborFLSIFLS 10 We also have performed all simulations with English test scores and weight zscores The results are very close to the results using math test scores and height zscores and are available from the authors upon request 11 This is particularly true in small samples For example considering only binary or dichotomized character istics with 5 variables there are 25 32 strata while 10 variables would give 210 1024 strata In our samples of 30 observations we stratify on two variables forming eight strata In the samples of 100 and 300 observations we also stratify on 3 variables 24 strata and also on 4 variables 48 strata VOL 1 NO 4 209 BruHN ANd mckENziE iN PursuiT Of BALANcE 2008 Stratification is not able to remove all imbalance for continuous variables For example for two normal distributions with different means but the same variance the means of the two distributions between any two fixed variables ie within a stratum will differ in the same direction as the overall mean Douglas G Altman 1985 In the simulations we always stratify on the baseline values of the outcome of interest and on one or two other variables that either relate to the outcome of interest or constitute relevant subgroups for ex post analysis PairWise matchingAs a third method we simulate pairwise matching As opposed to stratification matching provides a method to improve covariate balance for many variables at the same time Greevy et al 2004 describe the use of optimal multivariate matching However we chose to use the less computationally intensive optimal greedy algorithm laid out in King et al forthcoming12 In both cases pairs are formed so as to minimize the Mahalanobis distance between the values of all the selected covariates within pairs and then one unit in each pair is randomly assigned to treatment and the other to control As with stratification matching on covariates can increase balance on these cova riates and increase the efficiency and power of hypothesis tests King et al 2007 emphasize one additional advantage in the context of social science experiments when the matched pairs occur at the level of a community village or school which is that it provides partial protection against political interference or dropout If a unit drops out of the study or suffers interference its pair unit can also be dropped from the study while the set of remaining pairs will still be as balanced as the original dataset In contrast in a pure randomized experiment if even one unit drops out it is no longer guaranteed that the treatment and control groups are balanced on aver age However the converse of this is that if units drop out at random the matched pair design will throw out the corresponding pairs as well leading to a reduction in power and smaller sample size than if an unmatched randomization was used13 Note however that simply dropping the paired unit will only yield a consistent estimate of the average treatment effect for the full sample when the reason for attrition is unrelated to the size of the treatment effect A special case of this occur ring is when there is a constant treatment effect If there are heterogeneous treat ment effects and dropout is related to the size of the treatment effect then one can only identify the average treatment effect for the subsample of units that remain in the sample when the treatment is randomly offered Whether the average treatment effect for the subsample of units remaining in the sample is a quantity of interest will be up to the researcher to argue and will depend on the level of attrition It will understate the average treatment effect for the population of interest if those in the 12 The Stata code performing pairwise Mahalanobis matching with an optimal greedy algorithm takes sev eral days to run in the 300 observations sample If there is little time in the field to perform the randomization this may not be an option It is important to have ample time between receiving baseline data and having to perform the randomization to have the flexibility of using matching techniques if desired Software packages other than Stata may be more suited for this algorithm and may speed up the process We provide our Stata code in the Web Appendix 13 See Greevy et al 2004 for discussion of methods to retain broken pairs 210 AmEricAN EcONOmic JOurNAL APPLiEd EcONOmics OcTOBEr 2009 control group who had most to gain from the treatment drop out of the survey either through disappointment or in order to take up an alternative to the treatment It will overstate the average treatment effect for the population of interest if the individuals in the treatment group who do not benefit much or perhaps even have a negative effect from the treatment drop out rerandomization methodsSince our survey revealed that several researchers are using rerandomization methods we simulate two of these methods The first which we dub the big stick method by analogy with Jose F Soares and C F Jeff Wu 1983 requires a redraw if a draw shows any statistical difference in means between treatment and control group at the 5 percent level or lower The second method picks the draw with the minimum maximum tstat out of 1000 draws We are not aware of any papers that formally set out the rerandomization methods used in practice in development but there are analogs in the sequential allocation methods used in clinical trials Soares and Wu 1983 D R Taves 1974 S J Pocock and R Simon 1975 The use of these related methods remains somewhat contro versial in the medical field Proponents emphasize the ability of such methods to improve balance on up to 10 to 20 covariates with Tom Treasure and Kenneth D MacRae 1998 suggesting that if randomization is the gold standard minimization may be the platinum standard In contrast the European Committee for Proprietary Medicinal Products CPMP 2003 recommends that applicants avoid such methods and argues that minimization may result in more harm than good bringing little statistical benefit in moderate sized trials Why might researchers wish to use these methods instead of stratification In small samples stratification is only possible on one or two variables There may be many variables that the researcher would like to ensure are not too unbalanced without requiring exact balance on each Rerandomization methods may be viewed as a compromise solution by the researchers preventing extreme imbalance on many variables without forcing close balance on each Rerandomization may also offer a way of obtaining approximate balance on a set of relevant variables in a situation of multiple treatment groups of unequal sizes C Variables for Balancing In practice researchers attempt to balance on variables they think are strongly correlated with the outcome of interest The baseline level of the outcome variable is a special case of this kind of variable We always include the baseline outcome among the variables to stratify match or balance on14 In the matching and reran domization methods we also use six additional baseline variables that are thought to effect the outcome of interest Stratification takes a subset of these six additional variables15 14 Note that this is rather an exception in practice where researchers often do not balance on the baseline outcome 15 A list of the variables used for each dataset is in Web Appendix 2 Table A2 VOL 1 NO 4 211 BruHN ANd mckENziE iN PursuiT Of BALANcE Among these balancing variables we tried to pick variables that are likely to be correlated with the outcome based on economic theory and existing data There is however a caveat Most experiments have impacts measured over periods of six months to two years While our economic models and existing datasets can provide good information for deciding on a set of variables useful for explaining current levels they are often much less useful in explaining future levels of the variable of interest In practice often we cannot theoretically or empirically explain many shortrun changes well with observed variables and believe that these changes are the result of shocks As a result it may be the case that the covariates used to obtain balance are not strong predictors of future values of the outcome of interest The set of outcomes we have chosen spans a range of the ability of the base line variables to predict future outcomes At one end is microenterprise profits in Sri Lanka where baseline profits and 6 baseline individual and firm characteristics explain only 122 percent of the variation in profits 6 months later Thus balancing on these common owner and firm characteristics will not control for very much of the variation in future realizations of the outcome of interest School enrollment in the IFLS data is another example in which baseline variables explain very little of future outcomes For a sample of 300 students who were all in school at baseline 7 baseline variables explain 167 percent of the variation in school enrollment for the same students 3 years later The explanatory power is better for labor income in the Mexican ENE data and household expenditure in the IFLS with the baseline outcome and 6 baseline variables explaining 2829 percent of the variation in the future outcome The math test scores and height zscores in the LEAPS data have the most variation explained by baseline characteristics with 436 percent of the variation in followup test scores explained by the baseline test score and 6 baseline characteristics We expect to see greater difference among randomization methods in terms of achieving balance on future outcomes for the variables that are either more per sistent or that have a larger share of their changes explained by baseline charac teristics We expect to see the least difference among methods for the Sri Lankan microenterprise profits data and Indonesian school enrollment data and the most difference for the LEAPS math test score and height zscore data More generally we recommend balancing on the baseline outcome since our data show that this variable typically has the strongest correlation with the future out come In addition we recommend balancing on dummies for geographic regions since they also tend to be quite correlated with our future outcomes This may be because shocks may differ across regions Moreover in practice implementation of treatment may vary across regions which is another reason to balance on region Finally if baseline data are available for several periods one could check which variables are strongly correlated with future outcomes and could balance on these If multiple rounds of the baseline survey are not available data could come from an outside source as long as they includes the same variables and were collected in a comparable environment When choosing balancing variables researchers will often face a tradeoff between balancing on additional variables and achieving better balance on already chosen variables For example with stratification there is a tradeoff between 212 AmEricAN EcONOmic JOurNAL APPLiEd EcONOmics OcTOBEr 2009 stratifying on another variable and breaking down an already chosen variable into finer categories In this case we recommend adding the new variable only if there is strong reason to believe that it would be correlated with followup outcomes or if subsample analysis in this dimension is envisioned In our datasets it tends to be the case that except for the baseline outcome and geography few variables are strongly correlated with the followup outcome Moreover note that simply adding the new variable and switching the randomization method to matching which is not bound by the number of strata does not necessarily solve the problem The more variables are used for matching the worse balance tends to be on any given variable III Simulation Results Web Appendix 3 reports the full set of simulation results for all four datasets for 30 100 and 300 observations We summarize the results of these simulations in this section organizing their discussion around several central questions that a researcher may have when performing a randomized assignment We start by addressing the following core question A Which methods do Better in Terms of Achieving Balance and Avoiding Extremes We first compare the relative performance of the different methods in achiev ing balance between the treatment and control groups in terms of baseline levels of the outcome variable Table 3 shows the average difference in baseline means the ninetyfifth percentile of the difference in means a measure of the degree of imbal ance possible at the extremes and the percentage of simulations in which a ttest for difference in means between the treatment and control has a pvalue less than 010 We present these results for a sample size of 100 with results for the other sample sizes appearing in Web Appendix 3 Figure 1 graphically summarizes the results for a selection of variables plotting the densities of the differences in average outcome variables for all three sample sizes 30 100 and 300 observations Table 3 shows that the mean difference in baseline means is very close to zero for all methodson average all methods lead to balance However Table 3 and Figure 1 also show that stratification matching and especially the minmax tstat method have much less extreme differences in baseline outcomes while the big stick method only results in narrow improvements in balance over a single random draw For example in the Mexican labor income data with a sample of 100 the nintyfifth percentile of the difference in baseline mean income between the treatment and control groups is 0384 standard deviations SD with a pure random draw 0332 SD under the big stick method 0304 SD when stratifying on 4 variables 0100 SD with pairwise greedy matching and 0088 SD under the minmax tstat method The size of the difference in balance achieved with different methods shrinks as the sample size increasesasymptotically all methods will be balanced The key question is to what extent achieving greater balance on baseline vari ables translates into better balance on future values of the outcome of interest in the absence of treatment The lower graphs in Figures 1A1D show the distribution of VOL 1 NO 4 213 BruHN ANd mckENziE iN PursuiT Of BALANcE difference in means between treatment and control at followup for each method while Tables 4A and 4B summarize how the methods perform in obtaining balance in followup outcomes16 Panel A in both Tables 4A and 4B shows that on average all randomization meth ods give balance on the followup variable even with a sample size as small as 30 This is the key virtue of randomization Figure 1 panel B in both Table 4A and Table 4B show there are generally fewer differences across methods in terms of avoiding extreme imbalances than with the baseline data This is particularly true of the Sri Lankan profit data and the Indonesian schooling data for which baseline variables explained relatively little of future outcomes With a sample size of 30 stratification and matching reduce extreme differences between treatment and control but with samples of 100 or 300 there is very little difference between the various methods in terms of how well they balance the future outcome 16 The followup period is six months for the Sri Lankan microenterprise and Mexican labor income data one year for the Pakistan testscore and child height data and three years for the Indonesian schooling and expendi ture data Table 3How Do The Different Methods Compare in Terms of Baseline Balance simulation results for 100 observation sample size Single random draw Stratified on two variables Stratified on four variables Pairwise greedy matching Big stick rule Draw with minmax tstat out of 1000 draws Panel A Average difference in BAsELiNE between treatment and control means in sd Microenterprise profits Sri Lanka 0001 0000 0001 0003 0001 0000 Household expenditure Indonesia 0002 0001 0001 0002 0001 0002 Labor income Mexico 0000 0000 0000 0000 0001 0000 Height zscore Pakistan 0001 0001 0000 0000 0001 0000 Math test score Pakistan 0003 0000 0001 0000 0002 0000 Baseline unobservables Sri Lanka 0000 0000 0000 0000 0000 0001 Baseline unobservables Mexico 0000 0000 0000 0000 0000 0000 Panel B Ninetyfifth percentile of difference in BAsELiNE between treatment and control means in sd Microenterprise profits Sri Lanka 0386 0195 0241 0313 0324 0091 Household expenditure Indonesia 0390 0145 0191 0268 0328 0107 Labor income Mexico 0384 0280 0304 0100 0332 0088 Height zscore Pakistan 0395 0160 0206 0102 0319 0089 Math test score Pakistan 0392 0164 0237 0074 0328 0106 Baseline unobservables Sri Lanka 0434 0417 0414 0434 0434 0434 Baseline unobservables Mexico 0457 0448 0439 0457 0457 0427 Panel c Proportion of pvalues 01 for testing difference in BAsELiNE means Microenterprise profits Sri Lanka 0097 0000 0005 0037 0045 0000 Household expenditure Indonesia 0102 0000 0000 0013 0049 0000 Labor income Mexico 0100 0015 0029 0000 0053 0000 Height zscore Pakistan 0100 0000 0001 0000 0038 0000 Math test score Pakistan 0100 0000 0006 0000 0048 0000 Baseline unobservables Sri Lanka 0101 0096 0095 0082 0098 0091 Baseline unobservables Mexico 0108 0095 0093 0103 0102 0090 Notes Statistics are based on 10000 simulations of each method Details on methods and variables are in Table A2 of the Web Appendix 214 AmEricAN EcONOmic JOurNAL APPLiEd EcONOmics OcTOBEr 2009 Baseline variables have more predictive power for the realizations at followup for the other outcomes we consider The Mexican labor income and Indonesian expen diture data are in an intermediate range of baseline predictive power with the base line outcomes plus 6 other variables explaining about 28 percent of the variation 0 1 2 3 0 05 1 10 1 05 0 05 1 0 2 4 6 0 05 1 15 2 1 05 0 05 1 0 2 4 6 8 10 0 05 1 15 2 25 1 05 0 05 1 0 05 1 15 2 25 0 05 1 15 1 05 0 05 1 0 2 4 6 0 1 2 3 1 05 0 05 1 0 2 4 6 8 10 0 1 2 3 4 5 1 05 0 05 1 0 05 1 15 2 25 0 05 1 15 2 15 1 05 0 05 1 15 2 0 1 2 3 4 5 0 05 1 15 2 25 2 15 1 05 0 05 1 15 2 0 2 4 6 8 10 0 1 2 3 4 2 15 1 05 0 05 1 15 2 0 1 2 3 0 05 1 15 2 1 05 0 05 1 0 2 4 6 8 0 1 2 3 1 05 0 05 1 Difference in average math test score weighted by standard deviation 0 2 4 6 8 10 0 1 2 3 4 5 1 05 0 05 1 Panel A Sri Lanka profits Sample size 30 Panel B Mexican ENE labor income Panel C IFLS expenditure data Panel D LEAPS math test score data Baseline 1 year later 1 Draw 8 Strata Matched Big stick Minmax Sample size 100 Sample size 300 Difference in average in hh expenditure p cap weighted by standard deviation Sample size 30 Sample size 100 Sample size 300 Baseline 3 years later Difference in average income weighted by standard deviation Sample size 30 Sample size 100 Sample size 300 Baseline 6 months later Baseline 6 months later Difference in average profits weighted by standard deviation Sample size 30 Sample size 100 Sample size 300 Figure 1 Distribution of Differences in Means between the Treatment and Control Groups and Baseline and Followup VOL 1 NO 4 215 BruHN ANd mckENziE iN PursuiT Of BALANcE in followup outcomes Panels B and C of Figure 1 show that in contrast to the Sri Lanka and IFLS schooling data even with samples of 100 or 300 we find matching and stratification continue to perform better than a single random draw in reducing extreme imbalances Table 4B shows that with a sample size of 300 the ninetyfifth percentile of the difference in means between treatment and control groups is 023 SD under a pure random draw for both expenditure and labor income This differ ence falls to 020 SD for expenditure and 015 SD for labor income when pairwise matching is used and to 020 SD for both variables when stratifying or using the minmax rerandomization method Our other two outcome variables math test scores and height zscores are in the higher end of baseline predictive power with the baseline outcome and 6 other variables predicting 436 percent and 353 percent of the variation in followup out comes respectively Figure 1 panel D illustrates that the choice of method makes Table 4AHow Do the Different Methods Compare in Terms of Balance on Future Outcomes sample size of 30 Single Stratified Pairwise Big Draw with random on two greedy stick minmax draw variables matching rule tstat Panel A Average difference in fOLLOWuP between treatment and control means in sd Microenterprise profits Sri Lanka 0001 0000 0001 0003 0002 Child schooling Indonesia 0005 0010 0002 0004 0006 Household expenditure Indonesia 0000 0002 0002 0000 0006 Labor income Mexico 0003 0000 0000 0000 0001 Height zscore Pakistan 0007 0001 0004 0003 0001 Math test score Pakistan 0001 0002 0001 0003 0005 Panel B Ninetyfifth percentile of difference in fOLLOWuP between treatment and control means in sd Microenterprise profits Sri Lanka 0713 0627 0556 0705 0708 Child schooling Indonesia 0834 0745 0556 0556 0556 Household expenditure Indonesia 0721 0643 0496 0677 0590 Labor income Mexico 0703 0713 0503 0688 0704 Height zscore Pakistan 0710 0620 0557 0620 0443 Math test score Pakistan 0717 0448 0350 0648 0525 Panel c Proportion of pvalues 01 for testing difference in fOLLOWuP means with inference as if pure randomization was used eg no adjustment for strata or match dummies Microenterprise profits Sri Lanka 0105 0059 0027 0101 0109 Child schooling Indonesia 0052 0113 0033 0041 0010 Household expenditure Indonesia 0102 0069 0014 0083 0046 Labor income Mexico 0101 0106 0007 0093 0103 Height zscore Pakistan 0097 0056 0030 0059 0007 Math test score Pakistan 0101 0006 0000 0072 0022 Panel d Proportion of pvalues 01 for testing difference in fOLLOWuP means with inference which takes account of randomization method ie controls for stratum pair or rerandomizing variables Microenterprise profits Sri Lanka 0103 0091 0098 0103 0122 Child schooling Indonesia 0103 0117 0033 0098 0108 Household expenditure Indonesia 0102 0098 0097 0101 0094 Labor income Mexico 0102 0109 0107 0102 0117 Height zscore Pakistan 0100 0097 0101 0100 0103 Math test score Pakistan 0099 0102 0103 0098 0098 Notes The coefficients in panels A and B are for specifications without controls for stratum or pair dummies Statistics are based on 10000 simulations of each method Details on methods and variables are in Web Appendix Table A2 216 AmEricAN EcONOmic JOurNAL APPLiEd EcONOmics OcTOBEr 2009 more of a difference for these highly predictable followup outcomes than for the less predictable ones Stratifying matching and the minmax tstate method consis tently lead to narrower distributions in the differences at followup when test scores or height zscores are the outcomes Nevertheless even with these more persistent variables the gains from pursuing balance on baseline are relatively modest when the sample size is 300 Using pairwise matching rather than a pure random draw reduces the ninetyfifth percentile of the difference in means from 023 to 017 in the case of math test scores B What does Balance on Observables imply about Balance on unobservables In general what does balancing on observables do in terms of balancing unob servables Aickin 2001 notes that methods that balance on observables can do no Table 4BHow Do the Different Methods Compare in Terms of Balance on Future Outcomes sample size of 300 Single Stratified Stratified Pairwise Big Draw with random on two on four greedy stick minmax draw variables variables matching rule tstat Panel A Average difference in fOLLOWuP between treatment and control means in sd Microenterprise profits Sri Lanka 0000 0001 0001 0000 0000 0000 Child schooling Indonesia 0002 0003 0001 0000 0002 0002 Household expenditure Indonesia 0001 0001 0000 0001 0001 0001 Labor income Mexico 0001 0000 0001 0001 0001 0002 Height zscore Pakistan 0001 0000 0000 0000 0002 0000 Math test score Pakistan 0001 0000 0000 0001 0001 0001 Panel B Ninetyfifth percentile of difference in fOLLOWuP between treatment and control means in sd Microenterprise profits Sri Lanka 0220 0210 0209 0211 0216 0224 Child schooling Indonesia 0213 0219 0212 0227 0227 0196 Household expenditure Indonesia 0226 0194 0196 0200 0219 0198 Labor income Mexico 0227 0196 0198 0149 0213 0195 Height zscore Pakistan 0222 0186 0189 0189 0212 0225 Math test score Pakistan 0227 0180 0184 0167 0209 0175 Panel c Proportion of pvalues 01 for testing difference in fOLLOWuP means with inference as if pure randomization was used eg no adjustment for strata or match dummies Microenterprise profits Sri Lanka 0100 0080 0080 0085 0092 0103 Child schooling Indonesia 0121 0087 0082 0098 0111 0096 Household expenditure Indonesia 0101 0056 0052 0064 0092 0059 Labor income Mexico 0100 0056 0062 0011 0087 0028 Height zscore Pakistan 0097 0044 0049 0049 0081 0097 Math test score Pakistan 0101 0038 0042 0028 0076 0032 Panel d Proportion of pvalues 01 for testing difference in fOLLOWuP means with inference which takes account of randomization method ie controls for stratum pair or rerandomizing variables Microenterprise profits Sri Lanka 0098 0103 0133 0103 0102 0101 Child schooling Indonesia 0098 0102 0104 0098 0104 0104 Household expenditure Indonesia 0099 0100 0099 0101 0105 0100 Labor income Mexico 0100 0095 0101 0104 0100 0112 Height zscore Pakistan 0094 0097 0097 0098 0095 0102 Math test score Pakistan 0101 0097 0099 0097 0100 0102 Notes The coefficients in panels A and B are for specifications without controls for stratum or pair dummies Statistics are based on 10000 simulations of each method Details on methods and variables are in Web Appendix Table A2 VOL 1 NO 4 217 BruHN ANd mckENziE iN PursuiT Of BALANcE worse than pure randomization with regard to balancing unobserved variables17 We illustrate this point empirically in the Sri Lanka and ENE datasets by defining a separate group of variables from the data to be unobservable in the sense that we do not balance stratify or match on them The idea here is that although we have these variables in these particular datasets they may not be available in other data sets such as measures of entrepreneurial ability Moreover these unobservables are meant to capture what balancing does to variables that are thought to have an effect on the outcome variable but are truly unobservable Table 3 indicates that the balance on these unobservables is pretty much the same across all methods Paul R Rosenbaum 2002 21 notes that under pure randomization if we look at a table of observed covariates and see balance this gives us reason to hope and expect that other variables not measured are similarly balanced This holds true for pure random draws but will not be the case with methods that enhance balance on certain observed covariates Presenting a table that shows only the variables used in matching or for rerandomization checks and showing balance on these covariates will overstate the degree of balance attained on other variables that are not closely correlated with those for which balance was pursued For example the ninetyfifth percentile of the difference in means in Table 3 gives a similar level of imbalance for the unobservables as the balanced outcome under a pure random draw whereas under the other methods the unobservables have higher imbalance than the outcome variable18 We therefore recommend that if matching or rerandomization or stratifi cation on continuous variables is used researchers clearly separate these from other variables of interest when presenting a table to show balance C To dummy or Not to dummy We have seen that only a fraction of studies using stratification control for strata in the statistical analysis Kernan et al 1999 state that results should take into account stratification by including strata as covariates in the analysis Failure to do so results in overly conservative standard errors which may lead a researcher to erroneously fail to reject the null hypothesis of no treatment effect While the omission of bal anced covariates will not change the point estimates of the effect in linear models leaving out a balanced covariate can change the estimate of the treatment effect in nonlinear models Gillian M Raab Simon Day and Jill Sales 2000 so that analysis of binary outcomes makes this adjustment more important The CPMP 2003 also recommends that all stratification variables be included as covariates in the primary 17 To see this consider balancing on variable X and the consequences of this for balance on an unobserved variable W W can be written as the sum of the fitted value from regressing W on X and the residual from this regression 1 W PXW i PXW PX X X X1 X Balancing on X will therefore also balance the part of W that is correlated with X PXW Since the remaining part of W i PX W is orthogonal to X it will tend to balance at the same rate as under pure randomization 18 Note the imbalance on unobservables is similar to that of a single random draw which concurs with the point that balancing on observables can do no worse than pure randomization when it comes to balancing unobservables 218 AmEricAN EcONOmic JOurNAL APPLiEd EcONOmics OcTOBEr 2009 analysis in order to reflect the restriction on randomization implied by the stratifi cation Similarly for pairwise matching dummies for each pair should be included in the treatment regression Furthermore in practice stratification is unlikely to achieve perfect balance for all of the variables used in stratification Whenever there is an odd number of units within a stratum there will be imbalance Terry M Therneau 1993 In addition imbalance may arise from units having a baseline missing value on one of the vari ables used in forming strata As a consequence in practice the point estimate of the treatment effect will also likely change if strata dummies are included compared to when they are not included To examine whether or not controlling for stratification matters in practice panels C and D of Tables 4A and 4B compare the size of a hypothesis test for the difference in means of the followup outcome when no treatment has been given Panel C of Tables 4A and 4B shows the proportion of pvalues under 010 when no stratum or pair dummies are included and panel D of Tables 4A and 4B shows the proportion of pvalues under 010 when these dummies are included Recall that this is a test of a null hypothesis we know to be true So to have correct size 10 percent of the pvalues should be below 010 We see that this is the case for the pure random draw whereas failure to control for the dummies leads the stratification and pairwise matching tests to be too conservative on average19 For example with a sample size of 30 less than 5 percent of the pvalues are below 010 for all 6 outcomes when we dont include pair dummies with pairwise matching For the math test score only 06 percent of the pvalues under stratification and none of the pvalues under pair wise matching are under 010 Even with a sample size of 300 less than 5 percent of the pvalues are below 010 for the more persistent outcomes when stratification or matching is used but not accounted for by adding stratum or pair dummies In contrast panel D shows that when we add stratum dummies or pair dummies the hypothesis test has the correct size with 10 percent of the pvalues under 010 even in sample sizes as small as 30 Thus on average it is overly conservative to not include the controls for stratum or pair in analysis The resulting conservative standard errors imply that if research ers do not account for the method of randomization in analysis they may not detect treatment effects that they would otherwise detect However although on average the pvalues are lower when including these dummies Table 5 shows that this is not necessarily the case in any particular random allocation to treatment and control Including stratum dummies only lowers the pvalue in 4888 percent of the replica tions depending on sample size and outcome variable Thus in practice researchers cannot argue that ignoring stratum dummies will always result in larger standard errors than when these dummies are included If researchers could commit to always ignoring the stratification during analysis then this would be on average conserva tive But since it is difficult to commit if no standard for analysis exists researchers may be tempted to try their analysis with and without stratum dummies and report 19 The child schooling in Indonesia is a binary outcome The difference in means attending school can there fore be only a limited number of discrete differences and this discreteness causes the test to not have the correct size even under a pure random draw when the sample is small VOL 1 NO 4 219 BruHN ANd mckENziE iN PursuiT Of BALANcE the results that are more significant We therefore recommend that the standard should be to control for the method of randomization in analysis20 D How should inference Be done After rerandomizing While including strata or pair dummies in the ex post analysis for the stratifica tion and matching methods is quite straightforward the methods of inference are not as clear for rerandomization methods In fact the correct statistical methods for covariatedependent randomization schemes such as minimization are still a conundrum in the statistics literature leading some to argue that the only analysis that we can be completely confident in is a permutation test or rerandomization test Randomization inference can be used for analysis of the method of rerandomizing when the first draw exceeds some statistical threshold although it requires addi tional programming work Using the rule which determines when rerandomization will take place the researcher can map out the set of random draws that would be allowed by the threshold rule throwing out those with excessive imbalance and then carry out permutation tests on the remaining draws21 Such a method is not possible when ad hoc criteria are used to decide whether to redraw 20 If authors believe they have a valid reason not to control for stratum dummies they should explain this reasoning in their text and also mention what the results would be if stratum dummies were included 21 When multiple draws are used to select the allocation that gives best balance over a sequence of 100 or 1000 draws there may be a concern that the resulting assignment to treatment is mostly deterministic This will be the case in very small samples under 12 units but is not a concern for all but the smallest trials Table 5Is It Always Conservative to Ignore the Method of Randomization Proportion of replications in which controlling for stratum or pair dummies lowers the pvalue on a test of difference in means between treatment and control groups Stratified Stratified Pairwise Big Draw with on two on four greedy stick minmax variables variables matching rule tstat Panel A sample size 30 Microenterprise profits Sri Lanka 0690 1000 0493 0555 Child schooling Indonesia 0373 0686 0567 0854 Household expenditure Indonesia 0622 1000 0523 0657 Labor income Mexico 0477 1000 0496 0532 Height zscore Pakistan 0579 1000 0537 0825 Math test score Pakistan 0684 1000 0522 0740 Panel B sample size 300 Microenterprise profits Sri Lanka 0668 0731 1000 0526 0689 Child schooling Indonesia 0705 0634 1000 0506 0674 Household expenditure Indonesia 0869 0733 1000 0522 0738 Labor income Mexico 0874 0712 1000 0525 0725 Height zscore Pakistan 0860 0655 1000 0522 0754 Math test score Pakistan 0882 0735 1000 0533 0776 Notes Statistics are based on 10000 simulations of each method Details on methods and variables are in Web Appendix Table A2 220 AmEricAN EcONOmic JOurNAL APPLiEd EcONOmics OcTOBEr 2009 Optimal modelbased inference is less clear under rerandomization since allo cation to treatment is datadependent To see this consider the data generating processes 2a Yi α βTreati εi 2b Yi α βTreati γ zi ui where Treati is a dummy variable for treatment status and zi are a set of covariates potentially correlated with the outcome Yi Under pure randomization 2a is used for analysis assignment to treatment is in expectation uncorrelated with εi and the standard error will depend on Var εi Suppose instead that rerandomization methods are used which force the difference in means of the covariates in z to be less than some specified threshold z TrEAT z cONTrOL δ If δ is invariant to sample size eg difference in proportions less than 010 then this condition will occur almost surely as the sample size goes to infinity and thus the conditioning will not affect the asymptotics However in practice δ is usually set by some statistical significance threshold Then if 2a is used for analysis that is the covariates are not controlled for we will only have that εi is independent of Treati conditional on z TrEAT z cONTrOL δ The correct standard error should therefore account for this conditioning using Varεi z TrEAT z cONTrOL δ In practice this will be difficult to do so adapting the minimization inference recommendations of Neil W Scott et al 2002 we recommend researchers include all the variables used to check balance as linear covariates in the regression22 Estimation of the treatment effect in 2b will then be conditional on the variables used for checking balance This entails a loss of degrees of freedom compared to not controlling for these covariates but still requires fewer degrees of freedom than pairwise matching The simulation results in Tables 4A and 4B suggest that this approach works in practice Treating the big stick or minmax tstatistic methods as if they were pure random draws results in less than 10 percent of replications having pvalues under 010 panel C whereas including the variables used for checking balance as linear controls results in the correct test size panel D This correction is more important for the minmax method than the big stick method since the minmax method achieves greater baseline balance E How do the different methods compare in Terms of Power for detecting a Given Treatment Effect To compare the power of the different methods we simulate a treatment effect by adding a constant to the followup outcome variable for the treatment group We simulate constant treatments which add Rs 1000 LKR 25 percent of average 22 If an interaction or quadratic term is used to check balance which seems rare in practice then this same term should also be included as a regressor Note that in the special case of the rerandomization method being used to seek balance on a set of binary variables X Y and their interaction X Y for which it is possible to attain exact balance on these variables then rerandomization inference with the X Y and X Y as controls would be equivalent to inference after stratification on these same variables with strata dummies used as controls VOL 1 NO 4 221 BruHN ANd mckENziE iN PursuiT Of BALANcE baseline profits to the Sri Lankan microenterprise profits add Mex920 20 percent of average baseline income to the Mexican labor income add 04 05 standard deviations to log expenditure in Indonesia and add 025 standard deviations to the Pakistan math test scores and child height zscores For the schooling treatment we randomly set one in three schooling dropouts to stay in school These treatments are all relatively small in magnitude for the sample sizes used so that we can see differences in power across methods rather than have all methods give power close to one Table 6 summarizes the power of a hypothesis test for detecting the treatment effect using the ttest on the treatment coefficient in a linear regression of the out come variable on a constant and a dummy variable for treatment status We report the proportion of replications where this test would reject the null hypothesis of no effect at the 10 percent level Panels A and C report results when the regression model does not include controls for the method of randomization while panels B and D report the power when stratum or pair dummies or the variables used in checking balance for rerandomization methods are included The results for the pure random sample in panels B and D include the same set of seven baseline con trols to enable comparison of ex post controls for baseline characteristics to ex ante balancing Table 6 shows that if we do not adjust for the method of randomization the differ ent methods often perform similarly in terms of power In cases where they differ the methods that pursue balance tend to have less power than pure randomization For example with a sample size of 30 the power for both the height and math test scores is approximately 017 under a single random draw but can be as low as 0016 for the math test score under pairwise matching and as low as 0052 for the height zscore with the minmax method As we have seen the size of tests is too low for persistent variables when the method of randomization is not controlled for which makes it difficult to detect a significant effect This translates into low power in such cases Adding the strata and pair dummies or baseline variables used for rerandom izing increases power in almost all cases Some of the increases in power can be sizeablethe power increases from 0016 to 0320 for the math test score with pair wise matching when the pair dummies are added This increase in power is another reason to take into account the method of randomization when conducting analysis Table 6 also allows us to see the gain in power from ex ante balancing compared to ex post balancing The same set of variables used for forming the match and for the rerandomization methods were added as ex post controls when estimating the treatment effect for the single random draw in panels B and D When the variables are not very persistent such as the microenterprise profits and child schooling the power is very similar whether ex ante or ex post balancing is done However we do observe some improvements in power from matching compared to ex post controls for some but not all of the more persistent outcome variables The power increases from 0584 to 0761 for the Mexican labor income when ex ante pairwise matching on seven variables is done rather than a pure random draw followed by linear con trols for these seven variables ex post However there is no discernable change in power from balancing for child height another persistent outcome variable 222 AmEricAN EcONOmic JOurNAL APPLiEd EcONOmics OcTOBEr 2009 Table 6How Do the Different Methods Compare in Terms of Power in Detecting a Given Treatment Effect Sample size of 30 Single Stratified Pairwise Big Draw with Stratified Matching 20 random on two greedy stick minmax on on strata draw variables matching rule tstat iid noise iid noise iid noise Panel A Proportion of pvalues 010 when no adjustment is made for method of randomization Microenterprise profits Sri Lanka 0144 0106 0095 0139 0154 0132 0086 0111 Child schooling Indonesia 0123 0146 0111 0115 0066 0116 0144 0119 Household expenditure Indonesia 0390 0382 0342 0382 0360 0388 0387 0391 Labor income Mexico 0181 0177 0098 0178 0184 0177 0203 0208 Height zscore Pakistan 0174 0134 0133 0134 0052 0195 0194 0193 Math test score Pakistan 0167 0051 0016 0139 0087 0154 0131 0193 Panel B Proportion of pvalues 010 when adjustment is made for randomization method and for the single random draw controls for when the seven baseline variables are added to the regression Microenterprise profits Sri Lanka 0130 0135 0153 0131 0167 0144 0164 0109 Child schooling Indonesia 0109 0131 0121 0112 0095 0118 0144 0149 Household expenditure Indonesia 0409 0424 0580 0419 0461 0387 0356 0280 Labor income Mexico 0164 0172 0242 0167 0196 0165 0173 0125 Height zscore Pakistan 0246 0201 0206 0251 0281 0161 0157 0142 Math test score Pakistan 0183 0313 0320 0187 0217 0159 0170 0129 Sample size of 300 Single Stratified Pairwise Big Draw with Stratified Matching 200 random on four greedy stick minmax on on strata draw variables matching rule tstat iid noise iid noise iid noise Panel c Proportion of pvalues 010 when no adjustment is made for method of randomization Microenterprise profits Sri Lanka 0288 0278 0267 0280 0280 0285 0279 0278 Child schooling Indonesia 0606 0562 0607 0597 0600 0560 0610 0555 Household expenditure Indonesia 0999 1000 1000 0999 1000 0999 0999 0999 Labor income Mexico 0494 0480 0475 0489 0474 0479 0484 0489 Height zscore Pakistan 0728 0756 0766 0743 0767 0727 0728 0739 Math test score Pakistan 0615 0650 0655 0619 0657 0631 0624 0620 Panel d Proportion of pvalues 010 when adjustment is made for randomization method and for the single random draw controls for when the seven baseline variables are added to the regression Microenterprise profits Sri Lanka 0301 0343 0290 0302 0309 0283 0338 0295 Child schooling Indonesia 0608 0589 0602 0600 0595 0458 0607 0403 Household expenditure Indonesia 1000 1000 1000 1000 1000 0999 0998 0994 Labor income Mexico 0584 0541 0761 0584 0582 0501 0602 0408 Height zscore Pakistan 0863 0854 0853 0867 0866 0741 0721 0642 Math test score Pakistan 0812 0781 0829 0816 0826 0630 0603 0460 Notes Statistics are based on 10000 simulations of each method Details on methods and variables are in Web Appendix Table A2 Stratifications on independently and identically distributed noise are for 8 48 strata in the sample of 30 300 observa tions Simulated treatment effects are as follows Microenterprise profitsan Rs 1000 LKR increase in profits about 25 per cent of average baseline profits child schoolingone in three randomly selected children in the treatment group who would have dropped out dont household expenditurean increase of 04 in ln household expenditure per capita which corresponds to about onehalf of a standard deviation or moving a household from the twentyfifth to the fiftieth percentile labor income a MEX920 increase in income about 20 percent of average baseline income height zscorean increase of one quarter of a standard deviation in the zscore where the zscore is defined as standard deviations from mean US height for age math test scorean increase of one quarter of a standard deviation in the test score VOL 1 NO 4 223 BruHN ANd mckENziE iN PursuiT Of BALANcE F can We Go Too far in Pursuing Balance When using stratification matching or rerandomization methods one question is how many variables to balance on and whether balancing on too many variables could be counterproductive The statistical and econometric literature is not very definitive with respect to how many variables to use in stratification23 We therefore investigate how changing the number of strata affects balance and power in practice in our samples of 100 and 300 observations by simulating stratification with 2 3 and 4 stratifying variables resulting in 8 24 and 48 strata respectively The results are shown in Table 7 Both the size of extreme imbalances and the power do not vary much with the number of strata for any of the six outcomes In most cases there is neither much gain nor much loss from including more strata However we do note that for a sample size of 100 when strata dummies are included power is always slightly lower when 4 stratifying variables and 48 strata are included than when 3 stratifying variables and 24 strata are used For example with the math test score power falls from 0464 to 0399 when the number of strata is doubled A question related to the choice of how many variables to balance on is what hap pens when one balances on irrelevant covariates Guido Imbens et al 2009 prove that stratification can do no worse than pure randomization in terms of expected squared error even when there is little or no correlation with the variables being stratified on However although there is no cost to stratification in terms of the vari ance itself there is a cost in terms of estimation of the variance The estimator that takes account of the stratification itself has a larger variance which comes from the degrees of freedom adjustment Although one could use the estimator for the variance which ignores the stratification this is overly conservative and as we have seen results in tests of low power Our personal viewpoint based on this is that there is a possible cost of overstratify ing on irrelevant variables in that the power of the experiment to detect a significant treatment effect can be diminished as a result of the degrees of freedom adjustment To gauge how important this might be in practice consider a few examples using the fact that controlling for k additional variables can at most increase the estimate of the variance by n k 2n 2 For a sample size of 100 even 10 irrelevant covariates could at most increase standard errors by 55 percent equivalent to a reduction in sample size from 100 to 90 With 200 or 400 as the sample size balanc ing on 5 or 10 uncorrelated covariates will not increase standard errors by more than 3 percent However balancing on irrelevant variables will continue to have reper cussions for standard errors if the number of variables balanced on increases at the same rate as the sample size In pairwise matching the number of covariates used as controls in the treatment regression is n2 If the variables used to form matches 23 For example Duflo et al 2008 state that if several binary variables are available for stratification it is a good idea to use all of them even if some of them may not end up having large explanatory power for the final outcome In contrast Kernan et al 1999 argue that fewer strata are better and raise the possibility of unbal anced treatment assignment within strata due to small cell sizes recommending that an appropriate number of strata is between n50 and n100 Finally Therneau 1993 shows in simulations with sample sizes of 100 that with a sufficient number of factors used in stratifying so that the number of strata reaches n2 performance can actually be worse than using unstratified randomization 224 AmEricAN EcONOmic JOurNAL APPLiEd EcONOmics OcTOBEr 2009 do not have any role in explaining the outcome of interest we see that the ratio of standard errors will approach 2 that is can be 41 percent higher under pairwise matching than pure randomization In our simulations we address the issue of balancing on irrelevant variables by stratifying and matching based on independently and identically distributed noise The last three columns of Table 6 show the power of the stratified and matching estimators when pure noise is used Once we control for stratum dummies power is clearly less when irrelevant variables are used for stratifying or matching than when relevant variables are used For example the power with a sample size of 30 for household expenditure under pairwise matching is 0580 when relevant baseline variables are used to form the match compared to 0356 when independently and identically distributed noise is used in the matching Thus the choice of variables used in stratifying or matching does play an important role in determining power However if we wish to compare the impact of matching or stratifying on irrel evant variables to a pure random draw we should compare the power for a single random draw in panels A and C to the power for matching and stratifying on inde pendently and identically distributed noise in panels B and D which contain con trols for stratum or pair dummies The power is very similar for all sample sizes In practice any given draw of independently and identically distributed noise is likely to have some small correlation with the outcome of interest reducing the residual Table 7How Does Stratification Vary with the Number of Strata simulation results Sample size 100 Sample size 300 Stratified Stratified Stratified Stratified Stratified Stratified on two on three on four on two on three on four variables variables variables variables variables variables 8 strata 24 strata 48 strata 8 strata 24 strata 48 strata Panel A imbalanceNinetyfifth percentile of difference in followup means Microenterprise profits Sri Lanka 0322 0338 0338 0210 0213 0209 Child schooling Indonesia 0399 0346 0369 0219 0211 0212 Household expenditure Indonesia 0337 0335 0343 0194 0193 0191 Labor income Mexico 0335 0327 0344 0196 0196 0198 Height zscore Pakistan 0297 0299 0310 0186 0191 0189 Math test score Pakistan 0285 0298 0316 0180 0181 0184 Panel B Power Proportion of pvalues 010 when no strata dummies included Microenterprise profits Sri Lanka 0129 0138 0144 0274 0281 0278 Child schooling Indonesia 0303 0267 0273 0585 0574 0562 Household expenditure Indonesia 0852 0850 0845 0999 1000 1000 Labor income Mexico 0170 0161 0180 0486 0486 0480 Height zscore Pakistan 0286 0295 0297 0757 0757 0756 Math test score Pakistan 0236 0245 0254 0654 0649 0650 Panel c Power Proportion of pvalues 010 when strata dummies included Microenterprise profits Sri Lanka 0186 0273 0242 0305 0327 0343 Child schooling Indonesia 0278 0301 0255 0596 0596 0589 Household expenditure Indonesia 0904 0914 0876 1000 1000 1000 Labor income Mexico 0204 0212 0199 0561 0551 0541 Height zscore Pakistan 0487 0463 0457 0849 0843 0854 Math test score Pakistan 0464 0464 0399 0792 0790 0781 Notes Statistics are based on 10000 simulations of each method Details on methods and variables are in Web Appendix Table A2 VOL 1 NO 4 225 BruHN ANd mckENziE iN PursuiT Of BALANcE sum of squares when controlled for in a regression It seems this small correlation is just enough to offset the fall in degrees of freedom so that the worstcase scenarios discussed above dont come to pass24 Hence in practice it seems that stratifying on independently and identically distributed noise does not do any worse than a simple random draw in terms of power when sample sizes are not very small Finally Table 6 shows that when stratification or matching is done purely on the basis of independently and identically distributed noise treating the randomization as if it was a pure random draw does not lower power compared to the case where a single random draw is used This is in contrast to the case when matching or stratifi cation is done on variables with strong predictive power Intuitively when pure noise is used for stratification it is as if a pure random draw was taken25 The simulation results for stratifying on independently and identically distributed noise with 8 48 strata and 30 300 observations suggest that overstratification is not a concern in practice when using a reasonable number of strata In order to check what would happen in an extreme case we also simulated stratification on independently and identically distributed noise with 20 strata for 30 observations and 200 strata for 300 observations In each case one third of the strata include only one observation reducing the number of observations that contribute to estimating the treatment effect The results included in the last column of Table 6 show that power is now quite a bit lower compared to a pure random draw We thus conclude that although in extreme cases it is possible to lose power due to overstratification in practice it is unlikely that one would encounter this problem G What is the meaning of the standard Table 1 if any Section I points out that most research papers containing randomized experi ments feature a table usually the first in the paper that tests whether there are any statistically significant differences in the baseline means of a number of variables across treatment and control groups The unanimous use of such tests is interest ing in light of concern in the clinical trials literature about both the statistical basis for such tests and their potential for abuse26 Altman 1985 26 writes that when treatment allocation was properly randomized a difference of any sort between the two groups will necessarily be due to chance performing a significance test to compare baseline variables is to assess the probability of something having occurred by chance when we know that it did occur by chance Such a procedure is clearly absurd Altman 1985 26 goes on to add that statistical significance is immaterial when considering whether any imbalance between the groups may have affected the results In particular it is wrong to infer from the lack of statistical significance that 24 Note that even our smallest sample size of 30 is larger than the cases Donald C Martin et al 1993 study where a loss of power can occur 25 However this does not mean that ex post one can check whether the variables used for matching or strati fication have predictive power for the future outcome and if not ignore the method of randomization Ignoring the matching or stratification is only correct if the baseline variables are truly pure noiseif there is any signal in these stratifying or matching variables then ignoring the randomization method will result in incorrect size for hypothesis tests 26 See also Imai King and Stuart 2008 for discussion on this issue in social science field experiments and for their suggestions as to what should constitute a proper check of balance 226 AmEricAN EcONOmic JOurNAL APPLiEd EcONOmics OcTOBEr 2009 the variable in question did not affect the outcome of the trial since a small imbal ance in a variable highly correlated with the outcome of interest can be far more important than a large and significant imbalance for a variable uncorrelated with the variable of interest A particular concern with the use of significance tests is that researchers may decide whether or not to control for a covariate in their treatment regression on the basis of whether it is significant Thomas Permutt 1990 shows that the resulting tests true significance level is lower than the nominal level especially for variables which are more strongly correlated with the outcome of interest He further shows adjusting on the basis of an initial significance test does worse than randomly choos ing a covariate to adjust for He reasons that the initial significance test tends to sup press covariate adjustment precisely where it would on average do some goodthe cases where the adjustment would be enough to produce significance of the outcome but where the difference in means falls short of significance Instead greater power is achieved by always adjusting for a covariate that is highly correlated with the out come of interest regardless of its distribution between groups However although controlling for covariates which are highly correlated with the outcome of interest will increase power and still yield consistent estimates recent work by David A Freedman 2008 discussed further in Angus Deaton 2008 shows that doing so will induce a finitesample bias if the treatment effect is het erogeneous and correlated with the square of the covariate introduced It therefore is of use to compare the point estimate with and without such controls If the point estimate changes a lot when the covariate is added then one can investigate further using interaction models whether the treatment effect varies with the covariate of interest27 A final concern with the use of significant tests for imbalance is their potential for abuse For example Schulz and David A Grimes 2002 report that in the clinical trials literature researchers who use hypothesis tests to compare baseline charac teristics report fewer significant results than expected by chance They suggest one plausible explanation is that some investigators may not report some variables with significant differences believing that doing so would reduce the credibility of their reports We have no evidence to suggest this is occurring in the development litera ture and hope the profession can use this first table in a manner that doesnt lead to the temptation for such abuse In particular we urge referees and editors to view a lack of balance on one or two variables in a randomized experiment as simply the result of chance not a reason per se to reject a paper28 And the criterion for robust ness should be whether these variables are believed to be strongly correlated with the outcome of interest authors can provide correlations between baseline variables 27 Of course doing this requires a valid estimate of the standard errors Consistent estimates are easily avail able but the finitesample properties of such estimators are not so clear See Freedman 2008 and Deaton 2008 for further discussion 28 Unless there is a reason to suspect interference in the randomization in which case a pattern of many variables showing systematic differences in means at high levels of significance may raise red flags Another case in which Table 1 could raise red flags is if there is attrition and observations in the same strata or pair are not dropped from the analysis In this case Table 1 could reveal whether observables are still balanced after attrition VOL 1 NO 4 227 BruHN ANd mckENziE iN PursuiT Of BALANcE and the outcome as a guide rather than whether the pvalue for a difference in means is below 005 or not So how should we interpret such tables The first question of interest in prac tice is given that such a test shows a statistically significant difference in baseline means does this make it more likely that there is also a statistically significant dif ference in followup means in the absence of treatment The answer is yes provided that the baseline data have predictive power for the followup outcomes see Web Appendix 4 The second question of interest is If we observe statistical imbalance at baseline but control for baseline variables in our analysis are we more likely to observe imbalance at followup than if we had obtained a random draw that didnt show baseline imbalance To examine this question we take 10000 simulations of a sin gle random draw and divide them into two sets The first set includes all draws that had a statistically significant difference at the 5 percent level in at least 1 of our 7 baseline variables We call this the unbalanced set The second set is the bal anced set and includes all other draws The top panels of Figure 2 panels A and B show the distribution of the differences in means between treatment and control for baseline labor income and baseline math test scores are more tightly concentrated around zero in the balanced set than the unbalanced set29 The middle panels show that these differences are less pronounced but still persist at followup again show ing that imbalance in baseline makes it more likely to have imbalance at followup However once we control for the seven baseline variables the distributions of a test of no treatment effect in the followup outcome when no treatment was given is identical regardless of whether or not there was baseline imbalance Intuitively when randomization is used to allocate units into treatment and con trol groups if we do find unbalanced baseline characteristics once we control for them the remaining unobservables are no more or less likely to be unbalanced than if we did not find unbalanced baseline characteristics However as recommended by Altman 1985 we should choose which baseline characteristics to control for not on the basis of statistical differences but on the strength of their relationship to the outcome of interest IV Conclusions Our surveys of the recent literature and of the most experienced researchers imple menting randomized experiments in developing countries find that most researchers are not relying on pure randomization but are doing something to pursue balance on observables In addition to stratification we find pairwise matching and reran domization methods to be used much more than is apparent from the existing devel opment literature The paper draws out implications from the existing statistical clinical and social science literature on the pros and cons of these various meth ods of seeking balance and compares the performance of the different methods in simulations 29 Web Appendix A3 presents the same figures for other outcome variables and sample sizes They all show the same patterns as in Figure 9 228 AmEricAN EcONOmic JOurNAL APPLiEd EcONOmics OcTOBEr 2009 Our simulation results show the method of randomization matters more in small sample sizes such as 30 or 100 observations and matters more for relatively persis tent outcome variables such as health and test scores than for less persistent out come variables such as microenterprise profits or household expenditure Overall we find pairwise matching to perform best in achieving balance in small samples provided that the variables used in forming pairs have good predictive power for future outcomes Stratification and rerandomization using a minmax method also lead to some improvements over a pure random draw but in the majority of our simulations are dominated by pairwise matching With sample sizes of 300 we find that the method of randomization matters much less although matching still leads to some improvement in balance for the persistent outcomes Our analysis of how randomization is being carried out in practice suggests sev eral areas where the practice of randomization can be improved or better reported This leads us to draw out the following recommendations 1 Better reporting of the method of random assignment is needed This should include a description of a Which randomization method was used and why b Which variables were used for balancing c For stratification how many strata were used d For rerandomization which cutoff rules were used This is particularly important for experiments with small samples where the randomization method makes more difference 2 clearly describe how the randomization was carried out in practice a Who performed the randomization b How was the randomization done coin toss random number generator etc c Was the randomization carried out in public or private 0 00001 00002 00003 00004 Baseline 0 00001 00002 00003 Followup 0 00001 00002 00003 Followup w controls 4000 2000 0 2000 4000 Difference in average outcome Balanced Unbalanced 0 0005 001 0015 002 0025 0 0005 001 0015 002 0 0005 001 0015 0025 100 50 0 50 100 Difference in average outcome Panel A ENE labor income data 100 observations Panel B LEAPS math test score data 300 observations 002 Figure 2 If We Observe Baseline Imbalance and Control for Baseline Variables Is There Any Difference in Followup Balance VOL 1 NO 4 229 BruHN ANd mckENziE iN PursuiT Of BALANcE 3 rethink the common use of rerandomization Our simulations find pairwise matching to generally perform as well or better than rerandomization in terms of balance and power and like rerandomization matching allows balance to be sought on more variables than possible under stratification Adjusting for the method of randomization is statistically cleaner with matching or strati fication than with rerandomization If rerandomization is used the authors should justify why rerandomization was preferred to the other methods of randomization 4 When deciding which variables to balance on strongly consider the base line outcome variable and geographic region dummies in addition to vari ables desired for subgroup analysis In practice few existing studies stratify on baseline values of the outcome of interest Yet in all of our datasets the baseline outcome variable is the one that is most strongly correlated with the future outcome Justification for regional stratification comes from the fact that treatment implementation and shocks are likely to vary by region 5 Be aware that overstratification can lead to a loss of power in extreme cases This is because using a large number of strata involves a downside in terms of loss in degrees of freedom when estimating standard errors possibly more cases of missing observations and odd numbers within strata when stratifi cation is used We find a loss in power only in an extreme case where we stratify on independently and identically distributed noise and have more strata than observations In practice researchers are unlikely to pursue balance to this extreme meaning that overstratification is unlikely to occur in practice However there is still a tradeoff between stratifying or matching on more variables and achieving closer balance on a smaller number of variables 6 As ye randomize so shall ye analyze Stephen Senn 2004 Our simulations show that while on average failure to account for the method of randomization generally results in overly conservative standard errors there are also a sub stantial number of draws in which standard errors that do not account for the method of randomization overstate the significance of the results Moreover failure to control for the method of randomization results in incorrect test size and low power In general we feel that it is important to follow a standard rule here to avoid ex post decision making of whether to control for the method of randomization or not We recommend that the standard should be to control for the method of randomization30 Since the majority of inference in economics is modelbased rather than randomization inference this means adding con trols for all covariates used in seeking balance That is strata dummies should be included when analyzing the results of stratified randomization Similarly pair dummies should be included for matched randomization or linear vari ables used for rerandomizations 30 If authors believe they have a valid reason not to control for stratum dummies they should explain this reasoning in their text and also mention what the results would be if stratum dummies were included 230 AmEricAN EcONOmic JOurNAL APPLiEd EcONOmics OcTOBEr 2009 7 in the ex post analysis do not automatically control for baseline variables that show a statistically significant difference in means The previous litera ture and our simulations suggest that it is a better rule to control for variables that are thought to influence followup outcomes independent of whether their difference in means is statistically significant or not When there are several such variables and not all of them can be included in the analysis correlations between baseline variables and followup data can be checked explicitly to pick the variables that are most strongly correlated with followup outcomes One should still be cautious in the use of ex post controls given the potential for finitesample bias if treatment heterogeneity is correlated with the square of these covariates REFERENCES Aickin Mikel 2001 Randomization Balance and the Validity and Efficiency of DesignAdaptive Allocation Methods Journal of statistical Planning and inference 941 97119 Altman Douglas G 1985 Comparability of Randomized Groups statistician 34 12536 Andrabi Tahir Jishnu Das Asim I Khwaja and Tristan Zajonc 2008 Do ValueAdded Estimates Add Value Accounting for Learning Dynamics Harvard University Center for International Development Working Paper 158 Ashraf Nava James Berry and Jesse M Shapiro 2007 Can Higher Prices Stimulate Product Use Evidence from a Field Experiment in Zambia National Bureau of Economic Research Working Paper 13247 Ashraf Nava Dean Karlan and Wesley Yin 2006a Deposit Collectors B E Journal of Eco nomic Analysis and Policy Advances in Economic Analysis and Policy 62 122 Ashraf Nava Dean Karlan and Wesley Yin 2006b Tying Odysseus to the Mast Evidence from a Commitment Savings Product in the Philippines Quarterly Journal of Economics 1212 635 72 Banerjee Abhijit V Shawn Cole Esther Duflo and Leigh Linden 2007 Remedying Education Evidence from Two Randomized Experiments in India Quarterly Journal of Economics 1223 123564 Banerjee Abhijit Vinayak with Alice H Amsden Robert H Bates Jagdish Bhagwati Angus Deaton and Nicholas Stern 2007 making Aid Work Cambridge MA MIT Press Bertrand Marianne Simeon Djankov Rema Hanna and Sendhil Mullainathan 2007 Obtaining a Drivers License in India An Experimental Approach to Studying Corruption Quarterly Journal of Economics 1224 163976 Björkman Martina and Jakob Svensson 2009 Power to the People Evidence from a Randomized Field Experiment on CommunityBased Monitoring in Uganda Quarterly Journal of Econom ics 1242 73569 Bobonis Gustavo J Edward Miguel and Charu PuriSharma 2006 Anemia and School Participa tion Journal of Human resources 414 692721 Box George E P J Stuart Hunter and William G Hunter 2005 statistics for Experimenters design innovation and discovery 2nd ed Hoboken NJ WileyInterscience Burtless Gary 1995 The Case for Randomized Field Trials in Economic and Policy Research Journal of Economic Perspectives 92 6384 Committee for Proprietary Medicinal Products CPMP 2003 Points to Consider on Adjustment for Baseline Covariates The European Agency for the Evaluation of Medicinal Products Evalua tion of Medicines for Human Use Report CPMPEWP286399 httpwwwemeaeuropaeupdfs humanewp286399enpdf accessed February 6 2008 London EMEA Deaton Angus 2008 Instruments of Development Randomization in the Tropics and the Search for the Elusive Keys to Economic Development The Keynes Lecture in Economics British Acad emy London October 9 2008 de Mel Suresh David McKenzie and Christopher Woodruff 2008 Returns to Capital in Microen terprises Evidence from a Field Experiment Quarterly Journal of Economics 1234 132972 Duflo Esther 2005 Evaluating the Impact of Development Aid Programmes The Role of Ran domised Evaluations In development Aid Why and How Toward strategies for Effectiveness VOL 1 NO 4 231 BruHN ANd mckENziE iN PursuiT Of BALANcE Proceedings of the AfdEudN conference 2004 20547 Paris Agence Française de Développement httpwwwafdlibrevilleorgjahiawebdavsiteafdusersadministrateurpublic publicationsnotesetdocumentsND22pdfpage208 Duflo Esther Rachel Glennerster and Michael Kremer 2008 Using Randomization in Develop ment Economics Research A Toolkit In Handbook of development Economics Vol 4 ed T Paul Schultz and John Strauss 38953962 Amsterdam NH North Holland Duflo Esther Rema Hanna and Stephen Ryan 2007 Monitoring Works Getting Teachers to Come to School Bureau of Research and Economic Analysis Development BREAD Working Paper 103 Duflo Esther and Michael Kremer 2004 Use of Randomization in the Evaluation of Development Effectiveness In Evaluating development Effectiveness Vol 7 World Bank Series on Evalua tion and Development ed George Keith Pitman Osvaldo N Feinstein and Gregory K Ingram 20532 Piscataway NJ Transaction Publishers Dupas Pascaline 2006 Relative Risks and the Market for Sex Teenagers Sugar Daddies and HIV in Kenya httpwwwinternationalpolicyumicheduedtspdfsDupasRelativeRiskspdf Ferraz Claudio and Frederico Finan 2008 Exposing Corrupt Politicians The Effects of Brazils Publicly Released Audits on Electoral Outcomes Quarterly Journal of Economics 1232 70345 Field Erica and Rohini Pande 2008 Repayment Frequency and Default in Microfinance Evidence from India Journal of the European Economic Association 623 50109 Fisher Ronald A 1935 The design of Experiments Edinburgh Oliver and Boyd Freedman David A 2008 On Regression Adjustments in Experiments with Several Treatments Annals of Applied statistics 21 17696 Glewwe Paul Michael Kremer Sylvie Moulin and Eric Zitzewitz 2004 Retrospective vs Prospec tive Analyses of School Inputs The Case of Flip Charts in Kenya Journal of development Eco nomics 741 25168 Glewwe Paul Albert Park and Meng Zhao 2006 The Impact of Eyeglasses on Academic Perfor mance of Primary School Students Evidence from a Randomized Trial in Rural China University of Minnesota Center for International Food and Agricultural Products Conference Paper 6644 Greevy Robert Bo Lu Jeffrey H Silber and Paul Rosenbaum 2004 Optimal Multivariate Match ing Before Randomization Biostatistics 52 26375 He Fang Leigh L Linden and Margaret MacLeod 2007 Teaching What Teachers Dont Know An Assessment of the Pratham English Language Program httpwwwcolumbiaedumgm2115 PicTalk20Working20Paper2020070326pdf Imai Kosuke Gary King and Clayton Nall Forthcoming The Essential Role of Pair Matching in ClusterRandomized Experiments with Application to the Mexican Universal Health Insurance Evaluation statistical science Imai Kosuke Gary King and Elizabeth A Stuart 2008 Misunderstandings between Experimen talists and Observationalists About Causal Inference Journal of the royal statistical society series A statistics in society 1712 481502 Imbens Guido Gary King David McKenzie and Geert Ridder 2009 On the Finite Sample Ben efits of Stratification in Randomized Experiments Unpublished Karlan Dean and Martin Valdivia 2006 Teaching Entrepreneurship Impact of Business Train ing on Microfinance Clients and Institutions httpaidaeconyaleedukarlanpapersTeaching Entrepeneurshippdf Kernan Walter N Catherine M Viscoli Robert W Makuch Lawrence M Brass and Ralph I Horwitz 1999 Stratified Randomization for Clinical Trials Journal of clinical Epidemiology 521 1926 King Gary Emmanuela Gakidou Nirmala Ravishankar Ryan T Moore Jason Lakin Manett Var gas Martha María TéllezRojo Juan Eugenio Hernández Ávila Mauricio Hernández Ávila and Héctor Hernández Llamas 2007 A Politically Robust Experimental Design for Public Policy Evaluation with Application to the Mexican Universal Health Insurance Program Journal of Policy Analysis and management 263 479506 Kremer Michael 2003 Randomized Evaluations of Educational Programs in Developing Coun tries Some Lessons American Economic review 932 10206 Kremer Michael Jessica Leino Edward Miguel and Aliz Peterson Zwane 2006 Spring Cleaning A Randomized Evaluation of Source Water Quality Improvement httpwwwsscnetuclaedu polisciwgapepapers11Miguelpdf Levitt Steven D and John A List 2008 Field Experiments in Economics The Past The Present and The Future National Bureau of Economic Research Working Paper 14356 232 AmEricAN EcONOmic JOurNAL APPLiEd EcONOmics OcTOBEr 2009 Martin Donald C Paula Diehr Edward B Perrin and Thomas D Koepsell 1993 The Effect of Matching on the Power of Randomized Community Intervention Studies statistics in medicine 1234 32938 Miguel Edward and Michael Kremer 2004 Worms Identifying Impacts on Education and Health in the Presence of Treatment Externalities Econometrica 721 159217 Olken Benjamin A 2007a Monitoring Corruption Evidence from a Field Experiment in Indone sia Journal of Political Economy 1152 200249 Olken Benjamin A 2007b Political Institutions and Local Public Goods Evidence from a Field Exper iment in Indonesia httpwwwpovertyactionlaborgpapers51OlkenPoliticalInstitutions pdf Permutt Thomas 1990 Testing for Imbalance of Covariates in Controlled Experiments statistics in medicine 912 145562 Pocock S J and R Simon 1975 Sequential Treatment Assignment with Balancing for Prognostic Factors in the Controlled Clinical Trial Biometrics 31 10315 Raab Gillian M Simon Day and Jill Sales 2000 How to Select Covariates to Include in the Analy sis of a Clinical Trial controlled clinical Trials 214 33042 Rosenbaum Paul R 2002 Observational studies 2nd ed New York Springer Schulz Kenneth F 1996 Randomised Trials Human Nature and Reporting Guidelines Lancet 3489027 59698 Schulz Kenneth F and David A Grimes 2002 Allocation Concealment in Randomised Trials Defending Against Deciphering Lancet 359930661418 Scott Neil W Gladys C McPherson Craig R Ramsay and Marion K Campbell 2002 The Method of Minimization for Allocation to Clinical Trials A Review controlled clinical Trials 236 66274 Senn Stephen 2004 Added Values Controversies Concerning Randomization and Additivity in Clinical Trials statistics in medicine 2324 372953 Skoufias Emmanuel 2005 PROGRESA and Its Impact on the Welfare of Rural Households in Mex ico International Food Policy Research Institute IFPRI Research Report 139 Soares Jose F and C F Jeff Wu 1983 Some Restricted Randomization Rules in Sequential Designs communications in statistics Theory and methods 1217 201734 Taves D R 1974 Minimization A New Method of Assigning Patients to Treatment and Control Groups clinical Pharmacology Therapeutics 155 44353 Therneau Terry M 1993 How Many Stratification Factors Are Too Many to Use in a Randomiza tion Plan controlled clinical Trials 142 98108 Treasure Tom and Kenneth D MacRae 1998 Minimisation The Platinum Standard for Trials British medical Journal 3177155 36263 1 Qual a importância da aleatorização do tratamento em experimentos sociais no que diz respeito à identificação do efeito causal Resposta A aleatorização do tratamento em experimentos sociais é essencial para a identificação do efeito causal pois assegura que a alocação dos participantes nos grupos de tratamento e controle seja determinada exclusivamente pelo acaso Isso elimina o viés de seleção e possibilita que os grupos sejam em média comparáveis em relação a características observadas e não observadas antes do início do experimento Dessa forma quaisquer diferenças nos resultados observados entre os grupos podem ser atribuídas de maneira confiável ao tratamento ao invés de outros fatores externos ou características intrínsecas dos participantes 2 Descreva o método de aleatorização simples do tratamento Quais suas vantagens e quais os possíveis problemas Resposta O método de aleatorização simples do tratamento distribui aleatoriamente os indivíduos ou instituições entre os grupos de tratamento e controle sem levar em consideração características específicas dos participantes Isso garante que a alocação seja inteiramente casual permitindo que os grupos em média sejam semelhantes tanto em características observáveis quanto não observáveis Uma de suas principais vantagens é a simplicidade uma vez que é fácil de implementar e compreender proporcionando uma forma direta de atribuir o tratamento Além disso a aleatorização simples é percebida como um procedimento justo e transparente facilitando a aceitação por outros pesquisadores e formuladores de políticas Outro benefício é a credibilidade do método pois por ser baseado apenas no acaso tornase difícil contestar a validade do experimento já que elimina a possibilidade de viés intencional na alocação dos grupos No entanto esse método pode apresentar problemas especialmente em amostras pequenas onde é mais provável ocorrer um desequilíbrio significativo entre os grupos em termos de características importantes comprometendo a comparabilidade e a precisão dos resultados Nesses casos ajustes ex post para corrigir os desequilíbrios podem ser ineficientes resultando em uma perda de precisão na estimativa do efeito do tratamento Além disso a aleatorização simples pode levar a uma imprevisibilidade na composição dos grupos gerando variações consideráveis em características essenciais o que pode afetar a interpretação dos resultados e a robustez das conclusões 3 Pesquisadores usualmente colocam no início do trabalho uma tabela com as médias de cada covariada prétratamento diferenciando tratados e controles e uma coluna com testes de diferenças de médias entre os dois grupos O que tal tabela permite analisar Quais as limitações dessa análise Resposta A tabela que apresenta as médias de cada covariada prétratamento para os grupos de tratamento e controle acompanhada por testes de diferenças de médias permite aos pesquisadores avaliar se os grupos são semelhantes em relação às características observáveis antes da intervenção Essa análise de equilíbrio inicial é fundamental para confirmar a eficácia da aleatorização assegurando que não existam desequilíbrios sistemáticos entre os grupos o que é essencial para uma interpretação confiável dos resultados causais do tratamento No entanto essa análise possui limitações significativas Primeiro ela se restringe apenas às variáveis observáveis incluídas na tabela não podendo garantir o equilíbrio em relação às variáveis não observáveis que também podem influenciar os resultados do experimento Segundo os testes de diferença de médias podem não ter poder estatístico suficiente para detectar desequilíbrios em amostras pequenas mesmo quando o desequilíbrio existe Isso significa que a ausência de diferenças estatisticamente significativas entre as médias das covariadas prétratamento não implica necessariamente um equilíbrio completo Por fim o uso desses testes pode ser problemático quando os pesquisadores decidem quais covariadas incluir na análise pois a seleção dessas variáveis pode influenciar a interpretação dos resultados levando a um viés na escolha das variáveis que apresentam maior desequilíbrio 4 Como definir quais covariadas devem ser consideradas para avaliar a qualidade da aleatorização Quais os problemas em se utilizar covariadas prétratamento Resposta Para definir quais covariadas devem ser consideradas na avaliação da qualidade da aleatorização recomendase escolher variáveis que estejam fortemente relacionadas com o resultado de interesse Além disso é indicado incluir covariadas que representem fatores importantes para a análise de subgrupos pois isso melhora a precisão estatística do experimento aumentando a capacidade de detectar o efeito do tratamento Idealmente deve se balancear a variável de desfecho observada no momento inicial do estudo baseline pois ela geralmente tem a correlação mais forte com os resultados futuros Variáveis geográficas também são recomendadas pois podem captar variações regionais que influenciam os resultados seja por diferenças nos choques locais ou na implementação do tratamento No entanto o uso de covariadas prétratamento apresenta algumas problemáticas Em primeiro lugar essas variáveis podem ter um poder preditivo limitado para o desfecho futuro especialmente em situações onde os resultados são mais suscetíveis a mudanças inesperadas ou a fatores não capturados pelas covariadas Isso pode levar a um falso senso de equilíbrio entre os grupos já que a covariada pode não prever adequadamente as variações futuras no desfecho de interesse Além disso o aumento do número de covariadas utilizadas para ajuste pode resultar em menor balanceamento em cada variável específica dificultando a análise e comprometendo a eficácia do processo de randomização Por fim o uso excessivo de covariadas pode aumentar a complexidade da análise tornando mais difícil a interpretação dos resultados e a implementação das técnicas de estratificação ou pareamento especialmente em amostras pequenas 5 Descreva o método de aleatorização com estratificação Quais as vantagens e desvantagens de tal método Resposta O método de aleatorização com estratificação consiste em dividir os participantes em subgrupos homogêneos estratos com base em uma ou mais características observáveis antes da alocação ao tratamento Dentro de cada estrato os indivíduos são aleatoriamente designados para os grupos de tratamento ou controle assegurando equilíbrio entre esses grupos em relação às características utilizadas para a formação dos estratos As vantagens desse método incluem a melhoria do balanceamento entre os grupos para as variáveis estratificadas aumentando a sensibilidade do experimento ao permitir a detecção de diferenças menores no efeito do tratamento Além disso a estratificação ajuda a reduzir a variabilidade nos resultados aumentando a precisão e a eficiência estatística Outro benefício é a possibilidade de realizar análises mais detalhadas por subgrupos uma vez que a estratificação garante um número suficiente de indivíduos em cada grupo para a comparação Contudo o método de aleatorização com estratificação apresenta algumas desvantagens A principal limitação é a restrição no número de variáveis que podem ser utilizadas para criar os estratos especialmente em amostras pequenas já que o aumento do número de variáveis resulta em um crescimento exponencial do número de estratos o que pode levar à criação de estratos com poucos participantes ou até vazios Isso pode comprometer a eficiência do experimento e dificultar a análise estatística Além disso a estratificação é mais eficaz para variáveis categóricas e menos adequada para variáveis contínuas que não podem ser perfeitamente balanceadas dentro dos estratos Por fim existe o risco de subestimação do efeito do tratamento se as variáveis de estratificação não forem adequadamente controladas na análise posterior o que pode gerar resultados estatisticamente conservadores 6 Qual método é superior aleatorização simples ou com estratificação Por quê Em que situações devese tomar cuidado com a estratificação Resposta O método de aleatorização com estratificação é geralmente considerado superior ao método de aleatorização simples especialmente quando se busca maior equilíbrio nas covariadas observáveis A estratificação melhora o balanceamento das variáveischave entre os grupos de tratamento e controle aumentando a precisão estatística e a sensibilidade do experimento para detectar efeitos do tratamento Ao reduzir a variabilidade dentro dos estratos a estratificação contribui para estimativas mais precisas e confiáveis sobretudo em amostras pequenas ou moderadas onde o risco de desequilíbrio entre os grupos é maior Além disso ao assegurar um balanceamento prévio em características específicas a estratificação permite uma análise mais robusta por subgrupos aumentando a validade interna do estudo Entretanto devese ter cautela ao usar a estratificação em determinadas situações Primeiramente em amostras pequenas o uso de muitas variáveis para criar os estratos pode resultar em estratos com poucos ou nenhum participante comprometendo a eficácia da aleatorização e a viabilidade do experimento Também é importante considerar que a estratificação é mais eficaz com variáveis categóricas sendo mais desafiadora para variáveis contínuas que não podem ser balanceadas perfeitamente dentro dos estratos Outro ponto de atenção é que se os estratos não forem controlados adequadamente na análise estatística posterior isso pode levar a estimativas mais conservadoras e a uma menor capacidade de detectar o efeito do tratamento Por fim em grandes amostras a superioridade da estratificação sobre a aleatorização simples tende a diminuir já que todos os métodos de aleatorização alcançam balanceamento à medida que o tamanho da amostra aumenta 7 Descreva o método de aleatorização que faz o matching em pares Quais vantagens e desvantagens de tal método Por que para pequenas amostras não se pode usar muitas covariadas para fazer o matching Resposta O método de aleatorização que utiliza o matching em pares envolve a formação de pares de unidades com características semelhantes antes da alocação ao tratamento Para cada par uma unidade é designada aleatoriamente para o grupo de tratamento enquanto a outra é alocada ao grupo de controle O objetivo é minimizar a distância entre as características dos membros de cada par assegurando maior equilíbrio nas covariadas observáveis e aumentando a comparabilidade entre os grupos As vantagens do método incluem um maior equilíbrio entre as covariadas nos grupos de tratamento e controle o que melhora a precisão das estimativas e aumenta a potência estatística do experimento O matching também oferece proteção parcial contra interferências políticas ou desistências em experimentos sociais Por exemplo se uma unidade de um par é perdida durante o estudo a outra unidade do par pode ser removida mantendo o equilíbrio entre os grupos remanescentes No entanto esse método apresenta algumas desvantagens A formação de pares pode ser mais complexa e demorada especialmente em amostras maiores ou em contextos onde o tempo de campo é limitado Além disso se houver desistência aleatória de participantes o método pode reduzir a amostra de maneira significativa diminuindo a potência estatística do estudo Outra desvantagem é que o matching não pode garantir equilíbrio em variáveis contínuas de forma tão eficaz quanto em variáveis categóricas e o procedimento de matching pode se tornar inviável em contextos com muitas covariadas a serem balanceadas Em amostras pequenas o uso de muitas covariadas para fazer o matching pode gerar pares menos adequados ou até inviáveis resultando em menor eficiência do experimento Isso ocorre porque à medida que aumenta o número de covariadas consideradas tornase mais difícil encontrar pares com características suficientemente semelhantes o que pode levar a pares forçados ou a um número reduzido de pares possíveis Assim o uso excessivo de covariadas pode comprometer o balanceamento das variáveis principais e limitar a capacidade do estudo de detectar os efeitos do tratamento 8 No que consistem os métodos de realeatorização Quais possíveis problemas podem haver Resposta Os métodos de realeatorização consistem em realizar múltiplas alocações aleatórias de tratamento e controle e em seguida selecionar a alocação que apresenta o melhor equilíbrio em um conjunto de covariadas observáveis Há duas abordagens principais a grande penalidade big stick na qual se refaz a aleatorização caso haja uma diferença estatisticamente significativa nas médias das covariadas entre os grupos e a abordagem que escolhe a alocação com o menor valor do máximo testatístico dentre um grande número de iterações por exemplo 1000 sorteios garantindo assim um balanceamento mais uniforme Os métodos de realeatorização apresentam vantagens ao permitir maior controle sobre o equilíbrio das covariadas em múltiplas variáveis especialmente em amostras pequenas onde o simples sorteio aleatório pode resultar em desequilíbrios substanciais Além disso a realeatorização pode ser uma solução eficaz quando há várias variáveis para as quais é desejável evitar um desequilíbrio extremo sem exigir equilíbrio exato em cada uma delas como ocorre na estratificação No entanto existem possíveis problemas associados a esses métodos Primeiramente a realeatorização pode resultar em um processo de alocação menos transparente e mais complexo já que envolve critérios adicionais para escolher a alocação final o que pode dificultar a replicabilidade do experimento Além disso esses métodos podem introduzir um viés de seleção na análise posterior se não forem controlados adequadamente pois a distribuição de tratamento não será mais completamente aleatória Isso pode levar a erros na inferência estatística como estimativas conservadoras das variâncias ou tamanhos incorretos dos testes de hipóteses Por fim o uso de critérios subjetivos para decidir quando realizar uma nuova realeatorização pode comprometer a validade dos resultados especialmente se esses critérios não forem claramente definidos antes do início do estudo 9 Qual a importância do tamanho da amostra Em que sentido amostras maiores melhoram a qualidade da aleatorização Resposta O tamanho da amostra é fundamental para a qualidade da aleatorização em experimentos Amostras maiores tendem a melhorar o equilíbrio entre os grupos de tratamento e controle pois a probabilidade de diferenças extremas entre as características dos grupos diminui à medida que o número de participantes aumenta Isso ocorre porque com mais unidades a distribuição aleatória das características nos dois grupos se aproxima mais da distribuição da população resultando em um balanceamento mais efetivo tanto das variáveis observáveis quanto das não observáveis Em amostras pequenas é mais comum que ocorram desequilíbrios substanciais entre os grupos o que pode comprometer a comparabilidade e consequentemente a validade dos resultados do experimento Ajustes ex post para corrigir esses desequilíbrios são menos eficientes do que um equilíbrio alcançado a priori já que o ajuste posterior pode resultar em estimativas menos precisas do efeito do tratamento Portanto amostras maiores não apenas reduzem a variabilidade das estimativas e aumentam a precisão estatística mas também asseguram que qualquer diferença observada nos resultados seja mais provavelmente atribuível ao tratamento ao invés de ser um efeito do desequilíbrio nas características iniciais entre os grupos Em síntese o aumento do tamanho da amostra reforça a validade interna do experimento garantindo maior confiabilidade na identificação dos efeitos causais pretendidos 10 Utilizando a base de dados experimentaldta calcule as médias e desvios padrões das covariadas prétratamento e faça o teste de diferenças de médias entre tratados e controles Calcule também o efeito do tratamento sobre a variável de interesse re78 Resposta Cálculo das Médias e Desvios Padrão Estatística Descritiva T age educ black hisp marr re74 re75 re78 u74 u75 N 0 260 260 260 260 260 260 260 260 260 260 1 185 185 185 185 185 185 185 185 185 185 Omisso 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 Média 0 251 101 0827 0108 0154 2107 1267 4555 0750 0685 1 258 103 0843 00595 0189 2096 1532 6349 0708 0600 Median a 0 240 100 100 000 000 000 000 3139 100 100 1 250 110 100 000 000 000 000 4232 100 100 Desvio padrão 0 706 161 0379 0311 0361 5688 3103 5484 0434 0466 1 716 201 0365 0237 0393 4887 3219 7867 0456 0491 Mínimo 0 170 300 000 000 000 000 000 000 000 000 1 170 400 000 000 000 000 000 000 000 000 Máxim o 0 550 140 100 100 100 39571 23032 39484 100 100 1 480 160 100 100 100 35040 25142 60308 100 100 Teste de Diferença de Médias Teste t para amostras independentes Estatística gl p age t de Student 11166 443 0265 educ t de Student 14958 443 0135 black t de Student 04548 443 0649 hisp t de Student 17757 443 0076 marr t de Student 09804 443 0327 re74 t de Student 00222 443 0982 re75 t de Student 08746 443 0382 re78 t de Student 28353 443 0005 u74 t de Student 09829 443 0326 u75 t de Student 18466 443 0065 Análise do efeito do tratamento sobre a variável de interesse Preditor Estimativas Erro padrão t p Intercepto 4555 408 1116 001 T 1794 633 284 0005 Modelo R R² 1 0134 00178 11 Refaça o exercício anterior com uma subamostra aleatória de tamanho 30 Observase alguma diferença nos resultados Resposta Cálculo das Médias e Desvios Padrão T age educ black hisp marr re74 re75 re78 u74 u75 N 1 15 15 15 15 15 3 2 4 15 15 0 15 15 15 15 15 15 15 8 15 15 Omisso 1 0 0 0 0 0 12 13 11 0 0 0 0 0 0 0 0 0 0 7 0 0 Média 1 269 107 0733 0200 0467 6760 7256 12363 0133 000 0 257 947 0733 00667 0133 000 000 1475 100 100 Mediana 1 26 11 1 0 0 000 7256 6402 0 0 0 24 9 1 0 0 000 000 000 1 1 Desvio padrão 1 603 153 0458 0414 0516 11709 1696 17278 0352 000 0 756 233 0458 0258 0352 000 000 4171 000 000 Mínimo 1 17 8 0 0 0 000 6057 000 0 0 0 18 4 0 0 0 000 000 000 1 1 Máximo 1 35 14 1 1 1 20280 8456 36647 1 0 0 45 14 1 1 1 000 000 11797 1 1 Teste de Diferença de Médias Estatística gl p age t de Student 0481 280 0635 educ t de Student 1761 280 0089 black t de Student 0000 280 1000 hisp t de Student 1058 280 0299 marr t de Student 2066 280 0048 re74 t de Student 2582 160 0020 re75 t de Student 22010 150 001 re78 t de Student 1763 100 0108 u74 t de Student 9539 280 001 u75 t de Student NaN Análise do efeito do tratamento sobre a variável de interesse Preditor Estimativas Erro padrão t p Intercepto ᵃ 1475 3566 0413 0688 T 10888 6177 1763 0108 Modelo R R² 1 0487 0237 Na comparação entre a base completa e a subamostra aleatória de 30 observações identificamse diferenças notáveis nos resultados Em relação às médias e desvios padrão das covariadas prétratamento houve variações substanciais entre a amostra completa e a subamostra Na amostra completa as covariadas apresentaram valores mais estáveis enquanto na subamostra houve maior variabilidade especialmente em variáveis como re74 e re75 que mostraram médias mais elevadas Já as variáveis u74 e u75 apresentaram distribuições mais extremas na subamostra resultado esperado devido ao menor tamanho da amostra e à consequente maior variabilidade estatística No que tange ao teste de diferença de médias entre os grupos tratados e controles os resultados foram significativamente diferentes nas duas análises Na amostra completa apenas a diferença na variável re78 foi estatisticamente significativa com um pvalor de 0005 indicando uma diferença consistente entre os grupos Já na subamostra outras covariadas como marr e re74 mostraram diferenças estatisticamente significativas enquanto re78 não manteve a significância estatística refletindo a menor precisão estatística da subamostra Quanto ao efeito do tratamento sobre a variável de interesse re78 observase que na amostra completa o efeito foi estatisticamente significativo com um coeficiente de 1794 e pvalor de 0005 Por outro lado na subamostra de 30 observações o efeito do tratamento não foi significativo apresentando um coeficiente mais elevado 10888 e um pvalor de 0108 indicando um intervalo de confiança mais amplo e uma maior imprecisão das estimativas devido ao menor tamanho amostral Em suma a comparação revela que a redução no tamanho da amostra afeta a precisão das estimativas e a detecção de diferenças estatisticamente significativas resultando em maior variabilidade e menor robustez dos resultados