Overview

In the developing world, often data are collected to be representative of the nation or large sub-national areas, such as states or provinces. Although these data collection approaches reduce administrative costs, they often cannot provide precise enough estimates at granular geographic levels to target social programs.

Background

In response to this need, the World Bank research department has been conducting a process called “Poverty Mapping.” Researchers use a survey-to-census imputation model to provide estimates of the proportion of the population living in poverty at more granular geographic levels, such as counties or districts.

This methodology presents several short comings, and it has been the subject of controversy. In 2006, development economists Abhijit Banerjee, Angus Deaton, Nora Lustig, and Ken Rogoff wrote, “The difficult and contentious issue with this work [poverty mapping] is the accuracy of these estimates, and indeed whether they are accurate enough to be useful at all.”

Still, the World Bank uses this approach. Foremost, the methodology fails to take a scientific approach to prediction. It does not retain any out of sample “test set” to validate the model’s accuracy and the model’s calibration is seemingly arbitrary.

Research Questions and Outline

This project aims to improve upon the Poverty Mapping methodology. Specifically, it will answer three questions:

  1. How accurate is the World Bank’s poverty mapping approach?

  2. Are there other data science methods that could more accurately predict poverty status than the current poverty mapping approach?

  3. How do the methods’ accuracy impact the calculation of the number of poor households?

To answer these questions, I will focus on Ethiopia. The World Bank classifies Ethiopia as an low income with about a 37 percent of its population living on less than $2 per day.

I find that models created by the World Bank methodology make fairly inaccurate predictions of income, and subsequently poverty status. In this work, other approaches to prediction, such as decision trees and random forests, do not create more accurate predictions.

The outline of this report is as follows. First, I will create a baseline accuracy score by replicating the World Bank methodology, while reserving 15 percent of the sample for testing The data is semi-public with registration here: Ethiopia Socioeconomic Survey 2015-2016. Second, I will use other supervised learning approaches to create alternative methodologies, comparing their accuracy to the World Bank methodology. Finally, I conclude by comparing the differences in the size of the “poor” population in the eleven regions of Ethiopia.

Replicating the Poverty Mapping Approach

Researchers use a survey-to-census imputation model to provide estimates of the proportion of the population living in poverty at more granular geographic levels, such as counties or districts. To perform this process, the research normally follows the following steps.

  1. Begin with a living standards measurement survey (LSMS), in which annual consumption/income, and subsequently poverty status, are determined.

  2. Build an OLS model on the LSMS data that uses a set of features to predict income. These indicators must exist in both the LSMS and census data sets.

  3. A few rules of thumb guide model construction:
  1. retain predictors that have a statistically significant relationship with income;
  2. focus on getting the R-squared above 0.5.
  1. Apply this model’s coefficients to the observations in the census to predict income.

  2. Then, the census data set provides an adequate sample size to determine poverty at a more granular administrative level.

## Loading required package: knitr
## 
## Call:
## lm(formula = log_total_cons_ann ~ married_hhead + log(hh_size) + 
##     female_hhead + hhead_illiterate + hhead_orthodox + hhead_protestant + 
##     hhead_islam + pipedwater + unprotectedwater + notoilet + 
##     latrine + flushtoilet + modernkitchen + advcookingfuel + 
##     electriclighting + finishedwalls + woodwalls + finishedroof + 
##     finishedfloor + dirtfloor + radio + tv + radio * tv + factor(saq01) + 
##     urban, data = lsms.test)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0290 -0.3421  0.0014  0.3505  3.5892 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       8.6341684  0.1230047  70.194  < 2e-16 ***
## married_hhead     0.1136395  0.0309357   3.673 0.000242 ***
## log(hh_size)      0.5750947  0.0173444  33.157  < 2e-16 ***
## female_hhead     -0.0796969  0.0265401  -3.003 0.002691 ** 
## hhead_illiterate -0.0688195  0.0245494  -2.803 0.005083 ** 
## hhead_orthodox    0.2336226  0.0563164   4.148 3.42e-05 ***
## hhead_protestant  0.0938819  0.0563910   1.665 0.096024 .  
## hhead_islam       0.1996735  0.0575385   3.470 0.000525 ***
## pipedwater        0.0396232  0.0207289   1.911 0.056013 .  
## unprotectedwater -0.1602258  0.0474650  -3.376 0.000743 ***
## notoilet          0.0007166  0.0744786   0.010 0.992324    
## latrine           0.0500769  0.0727078   0.689 0.491027    
## flushtoilet       0.2166690  0.0823270   2.632 0.008526 ** 
## modernkitchen     0.1355771  0.0424866   3.191 0.001429 ** 
## advcookingfuel    0.1132539  0.0396341   2.857 0.004292 ** 
## electriclighting  0.2355023  0.0242463   9.713  < 2e-16 ***
## finishedwalls    -0.0341276  0.0405365  -0.842 0.399896    
## woodwalls        -0.0373845  0.0302975  -1.234 0.217308    
## finishedroof      0.1058068  0.0227856   4.644 3.53e-06 ***
## finishedfloor     0.0148061  0.0731783   0.202 0.839670    
## dirtfloor        -0.1707737  0.0715607  -2.386 0.017060 *  
## radio             0.1978952  0.0240584   8.226 2.60e-16 ***
## tv                0.2754699  0.0372268   7.400 1.66e-13 ***
## factor(saq01)2    0.3182874  0.0609995   5.218 1.90e-07 ***
## factor(saq01)3   -0.1450826  0.0368803  -3.934 8.50e-05 ***
## factor(saq01)4    0.0837644  0.0384917   2.176 0.029602 *  
## factor(saq01)5    0.2537660  0.0531695   4.773 1.88e-06 ***
## factor(saq01)6   -0.2885733  0.0640830  -4.503 6.89e-06 ***
## factor(saq01)7   -0.0713601  0.0414625  -1.721 0.085315 .  
## factor(saq01)12   0.1561401  0.0663845   2.352 0.018718 *  
## factor(saq01)13   0.2219387  0.0596083   3.723 0.000199 ***
## factor(saq01)14  -0.0526360  0.0527669  -0.998 0.318574    
## factor(saq01)15   0.0754475  0.0520768   1.449 0.147479    
## urban            -0.0088972  0.0295727  -0.301 0.763539    
## radio:tv         -0.0617019  0.0424542  -1.453 0.146199    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5383 on 3971 degrees of freedom
## Multiple R-squared:  0.5129, Adjusted R-squared:  0.5087 
## F-statistic:   123 on 34 and 3971 DF,  p-value: < 2.2e-16

This model has an R-squared of above 0.5 and thus satisfies the first optimization requirement of the World Bank approach. I will now remove any not statistically significant predictors. Several non-statistically significant predictors are part of a battery of dummy variables. If the battery is jointly significant, these covariates should be retained.

## Loading required package: Formula
## Linear hypothesis test
## 
## Hypothesis:
## notoilet = 0
## latrine = 0
## flushtoilet = 0
## 
## Model 1: restricted model
## Model 2: log_total_cons_ann ~ married_hhead + log(hh_size) + female_hhead + 
##     hhead_illiterate + hhead_orthodox + hhead_protestant + hhead_islam + 
##     pipedwater + unprotectedwater + notoilet + latrine + flushtoilet + 
##     modernkitchen + advcookingfuel + electriclighting + finishedwalls + 
##     woodwalls + finishedroof + finishedfloor + dirtfloor + radio + 
##     tv + radio * tv + factor(saq01) + urban
## 
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   3974 1156.8                                  
## 2   3971 1150.6  3    6.1582 7.0842 9.573e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Linear hypothesis test
## 
## Hypothesis:
## hhead_orthodox = 0
## hhead_protestant = 0
## hhead_islam = 0
## 
## Model 1: restricted model
## Model 2: log_total_cons_ann ~ married_hhead + log(hh_size) + female_hhead + 
##     hhead_illiterate + hhead_orthodox + hhead_protestant + hhead_islam + 
##     pipedwater + unprotectedwater + notoilet + latrine + flushtoilet + 
##     modernkitchen + advcookingfuel + electriclighting + finishedwalls + 
##     woodwalls + finishedroof + finishedfloor + dirtfloor + radio + 
##     tv + radio * tv + factor(saq01) + urban
## 
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   3974 1160.3                                  
## 2   3971 1150.6  3    9.7061 11.166 2.703e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Linear hypothesis test
## 
## Hypothesis:
## pipedwater = 0
## unprotectedwater = 0
## 
## Model 1: restricted model
## Model 2: log_total_cons_ann ~ married_hhead + log(hh_size) + female_hhead + 
##     hhead_illiterate + hhead_orthodox + hhead_protestant + hhead_islam + 
##     pipedwater + unprotectedwater + notoilet + latrine + flushtoilet + 
##     modernkitchen + advcookingfuel + electriclighting + finishedwalls + 
##     woodwalls + finishedroof + finishedfloor + dirtfloor + radio + 
##     tv + radio * tv + factor(saq01) + urban
## 
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   3973 1155.6                                  
## 2   3971 1150.6  2    4.9853 8.6025 0.0001871 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Linear hypothesis test
## 
## Hypothesis:
## finishedwalls = 0
## woodwalls = 0
## 
## Model 1: restricted model
## Model 2: log_total_cons_ann ~ married_hhead + log(hh_size) + female_hhead + 
##     hhead_illiterate + hhead_orthodox + hhead_protestant + hhead_islam + 
##     pipedwater + unprotectedwater + notoilet + latrine + flushtoilet + 
##     modernkitchen + advcookingfuel + electriclighting + finishedwalls + 
##     woodwalls + finishedroof + finishedfloor + dirtfloor + radio + 
##     tv + radio * tv + factor(saq01) + urban
## 
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1   3973 1151.1                           
## 2   3971 1150.6  2   0.46501 0.8024 0.4483
## Linear hypothesis test
## 
## Hypothesis:
## finishedfloor = 0
## dirtfloor = 0
## 
## Model 1: restricted model
## Model 2: log_total_cons_ann ~ married_hhead + log(hh_size) + female_hhead + 
##     hhead_illiterate + hhead_orthodox + hhead_protestant + hhead_islam + 
##     pipedwater + unprotectedwater + notoilet + latrine + flushtoilet + 
##     modernkitchen + advcookingfuel + electriclighting + finishedwalls + 
##     woodwalls + finishedroof + finishedfloor + dirtfloor + radio + 
##     tv + radio * tv + factor(saq01) + urban
## 
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   3973 1160.8                                  
## 2   3971 1150.6  2     10.22 17.635 2.371e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

I find that wall types is not a statistically significant predictor and remove it from the model. I know run the final model.

## 
## Call:
## lm(formula = log_total_cons_ann ~ married_hhead + adulteq + log(hh_size) + 
##     female_hhead + hhead_illiterate + hhead_orthodox + hhead_protestant + 
##     hhead_islam + pipedwater + unprotectedwater + notoilet + 
##     latrine + flushtoilet + modernkitchen + advcookingfuel + 
##     electriclighting + finishedroof + finishedfloor + dirtfloor + 
##     radio + tv + radio * tv + factor(saq01) + urban + factor(saq01) * 
##     urban, data = lsms.test)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1566 -0.3473  0.0018  0.3488  3.4613 
## 
## Coefficients: (1 not defined because of singularities)
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            8.653864   0.122545  70.618  < 2e-16 ***
## married_hhead          0.129101   0.031432   4.107 4.08e-05 ***
## adulteq                0.007703   0.011452   0.673 0.501234    
## log(hh_size)           0.549072   0.038820  14.144  < 2e-16 ***
## female_hhead          -0.072274   0.026554  -2.722 0.006522 ** 
## hhead_illiterate      -0.078796   0.024391  -3.231 0.001246 ** 
## hhead_orthodox         0.222765   0.055884   3.986 6.84e-05 ***
## hhead_protestant       0.076821   0.055806   1.377 0.168721    
## hhead_islam            0.152758   0.057231   2.669 0.007636 ** 
## pipedwater             0.024642   0.020607   1.196 0.231851    
## unprotectedwater      -0.159509   0.046592  -3.424 0.000624 ***
## notoilet              -0.019154   0.073790  -0.260 0.795200    
## latrine                0.065140   0.071932   0.906 0.365208    
## flushtoilet            0.203539   0.081532   2.496 0.012585 *  
## modernkitchen          0.135184   0.041327   3.271 0.001081 ** 
## advcookingfuel         0.123532   0.038948   3.172 0.001527 ** 
## electriclighting       0.206046   0.024282   8.485  < 2e-16 ***
## finishedroof           0.081013   0.022538   3.595 0.000329 ***
## finishedfloor          0.039330   0.071709   0.548 0.583403    
## dirtfloor             -0.168205   0.070872  -2.373 0.017674 *  
## radio                  0.191390   0.023856   8.023 1.35e-15 ***
## tv                     0.280851   0.037051   7.580 4.27e-14 ***
## factor(saq01)2         0.360944   0.069483   5.195 2.15e-07 ***
## factor(saq01)3        -0.186273   0.041034  -4.539 5.81e-06 ***
## factor(saq01)4         0.124437   0.043692   2.848 0.004422 ** 
## factor(saq01)5         0.363018   0.059248   6.127 9.83e-10 ***
## factor(saq01)6        -0.321953   0.067438  -4.774 1.87e-06 ***
## factor(saq01)7        -0.153168   0.045010  -3.403 0.000673 ***
## factor(saq01)12        0.067181   0.070766   0.949 0.342503    
## factor(saq01)13        0.372467   0.068826   5.412 6.61e-08 ***
## factor(saq01)14       -0.092472   0.053684  -1.723 0.085050 .  
## factor(saq01)15        0.296451   0.068329   4.339 1.47e-05 ***
## urban                  0.024876   0.056379   0.441 0.659072    
## radio:tv              -0.061931   0.042252  -1.466 0.142794    
## factor(saq01)2:urban  -0.118532   0.132532  -0.894 0.371181    
## factor(saq01)3:urban   0.082863   0.065311   1.269 0.204608    
## factor(saq01)4:urban  -0.148389   0.064874  -2.287 0.022229 *  
## factor(saq01)5:urban  -0.397278   0.102845  -3.863 0.000114 ***
## factor(saq01)6:urban   0.233044   0.185474   1.256 0.209017    
## factor(saq01)7:urban   0.181083   0.066838   2.709 0.006771 ** 
## factor(saq01)12:urban  0.425988   0.159436   2.672 0.007574 ** 
## factor(saq01)13:urban -0.456260   0.117251  -3.891 0.000101 ***
## factor(saq01)14:urban        NA         NA      NA       NA    
## factor(saq01)15:urban -0.473984   0.101016  -4.692 2.79e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5317 on 3963 degrees of freedom
## Multiple R-squared:  0.5257, Adjusted R-squared:  0.5207 
## F-statistic: 104.6 on 42 and 3963 DF,  p-value: < 2.2e-16

This final model, has an r-squared above the desired threshold (0.5) and does not include any covariates that are not at least jointly statistically significant.

Now, I will use it to predict the average annual household consumption.

I score these predicted results with the Mean Absolute Percentage Difference. I will score both the in- and out- of sample, as well as the combined sample. This will be the benchmark to which to compare other approaches.

## [1] 46.27976
## [1] 45.84788
## [1] 48.72

I obtain a MAPE of 46 for the whole sample. I obtain a MAPE of 46 for the in-sample. And, a MAPE of 49 for an out-of-sample test.

Supervised learning

Decision Tree

Supervised learning is branch of data science that allows computer algorithms to classify and/or predict known data, called targets.

First, I use a regression decision tree to predict the annual income of a household. A decision trees separate distinctly different records into more similar groups. These divisions are called branches. These groups are then divided again until fairly homogeneous groups are formed, e.g. leaves.

Again, I predict the annual consumption from the model. And test it’s accuracy using the MAPE function.

## [1] 79.36518
## [1] 79.27328
## [1] 79.88441

I find a very high MAPE for this model for this model (79 to 80). These scores are high for the whole sample, the in-sample, and out-of-sample tests. In short, it severely under performs the World Bank methodology.

Random Forest

Next, I use a random regression forest. The intuition behind a random forest is that many multiple decision trees can better model the data than a single tree. I model the data in two ways. The first is the level version of annual consumption. The second is the log annual consumption, similar to the outcome variable for the World Bank OLS models. I run with a very large number of trees (2,000), and I optimize my parameter selection for the models presented here. The first two images are optimizing my parameters

## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## mtry = 1  OOB error = 743782011 
## Searching left ...
## Searching right ...
## mtry = 2     OOB error = 699210845 
## 0.05992504 1e-05 
## mtry = 4     OOB error = 696445734 
## 0.003954618 1e-05 
## mtry = 8     OOB error = 712862411 
## -0.02357208 1e-05

## mtry = 1  OOB error = 743489561 
## Searching left ...
## Searching right ...
## mtry = 2     OOB error = 699448685 
## 0.05923537 1e-05 
## mtry = 4     OOB error = 697947326 
## 0.002146489 1e-05 
## mtry = 8     OOB error = 714224500 
## -0.02332149 1e-05

Then, I predict that target variable and check the MAPE as with the other models.

## [1] 45.27449
## [1] 41.89723
## [1] 64.3567

On the first model, level annual consumption, I find a MAPE of 46 for the whole sample, 41 for the in-sample, and 64 for the out-of-sample. Whole and in-sample errors are similar to the World Bank’s OLS; however, the out-of-sample is much higher (64 vs. 48). This does not provide a strong alternative to the World Bank OLS methodology.

## [1] 34.79679
## [1] 31.99547
## [1] 50.62483

For the log of annual consumption, I find a MAPE of 35 for the whole sample. 32 for the in-sample, and 51 for the out-of-sample. The whole sample and the in-sample are slightly better than the World Bank OLS, and the out-sample is fairly similar (51 vs. 48). This target variable and method appears to have comparable accuracy to the World Bank OLS approach.

In this case, it does not appear that the two machine learning approaches, decision tree and random forests, perform better than the World Bank approach. I note that all of the methods are fairly inaccurate.

Classifying Poverty

The primary objective of poverty mapping is to actually determining which households are “poor” or below a threshold of annual consumption. The OLS method does not provide a strong classification strategy to determine which households are poor.

In practice, the annual consumption is predicted and then the threshold is applied. In this example, I use $2 per adult equivalent daily consumption in the household. Although this definition attempts to follow the World Bank international definition of “extreme poverty,” the resulting proportion is slightly higher than the most current public figures and can be further refined. Because the primary purpose of this research is to explore classification methods and not establish poverty thresholds, I continue with the classification exposition.

To test the accuracy of the predictions, I use a Mean-F1 score. This is standard statistic to assess the accuracy of classification algorithms. Specifically, the statistic is two times the ratio of the precision rate multiplied by the recall rate divided by the sum of the precision and recall rates. In classification, precision is defined as the true positives divided by the total of true positives and false positives. And, recall is defined as the true positives divided by the sum of the true positive plus the false negatives (total accurately classified). The score is out of 1, where a value of 1 is a very good predictor.

## [1] 0.7443672
## [1] 0.7485292
## [1] 0.7208434

Here, we find an acceptable, but not exceptional mean F1 score of 74 for the whole sample. And, a slightly worse mean F1 score of 72 for the out-of-sample.

Recognizing that the decision tree was a very poor model in the earlier work, I skip it in this section. And, I use a new random forest to model poverty status. Again, the first visuals are optimizing my model.

## mtry = 1  OOB error = 27.88% 
## Searching left ...
## Searching right ...
## mtry = 2     OOB error = 26.36% 
## 0.05461056 1e-05 
## mtry = 4     OOB error = 25.59% 
## 0.02935606 1e-05 
## mtry = 8     OOB error = 27.21% 
## -0.06341463 1e-05

I test the random forest prediction again with the mean F1 score.

## [1] 0.7668413
## [1] 0.7891978
## [1] 0.6404888

I find that random forest perform very similar to the World Bank OLS specification. I find a whole sample, F1-Mean score of 77. An in-sample score of 79, and a marginally lower out-of-sample score of 64.

Calculating the Number of Poor Households

As stated earlier, the primary purpose of these models is to impute expenditure data into a census. In the census, the sample size is robust enough to make estimates of the proportion of the population at smaller administrative levels, such as districts or counties.

The most recent census in Ethiopia poses several challenges to this type of application. First and foremost, it was conducted in 2007 - making it about a decade old. Predicting poverty a decade ago is less helpful for policy makers needing to make policy decisions today. In addition, the data for which I obtained for this census does not provide any geographic data, thus rendering this final step impossible.

Instead, I use the whole-of-sample predictions to compare how they would calculate poverty in the regions of Ethiopia. I choose to use the whole-of-sample because it is the largest sample at which to calculate the regional proportions. The out-of-sample predictions would more accurately mirror predictions into a census, but this sample is not large enough to calculate proportions in the regions.

First, I find the national proportion of the households in poverty according to the OLS model and the random forest models.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  1.0000  0.5018  1.0000  1.0000
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  1.0000  0.5052  1.0000  1.0000
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.4755  1.0000  1.0000
##    rfpredictected_poor_binary olspredfromlog_poor_binary poor_binary
## 1                  0.43272727                 0.39636364   0.3872727
## 2                  0.35555556                 0.40740741   0.3259259
## 3                  0.61498439                 0.66077003   0.5983351
## 4                  0.49201278                 0.43769968   0.4302449
## 5                  0.49011858                 0.46640316   0.4584980
## 6                  0.59842520                 0.85039370   0.7559055
## 7                  0.63531670                 0.69481766   0.5834933
## 8                  0.49572650                 0.57264957   0.5213675
## 9                  0.29746835                 0.08227848   0.2721519
## 10                 0.08786611                 0.01255230   0.1464435
## 11                 0.20103093                 0.15463918   0.2422680
##             region
## 1           Tigray
## 2             Afar
## 3           Amhara
## 4           Oromia
## 5          Somalie
## 6  Benshagul Gumuz
## 7             snnp
## 8         Gambelia
## 9           Harari
## 10     Addis Ababa
## 11         Diredwa

According to the data, just under half of the households are considered poor (47.5 percent). In the OLS and RF models, I find very similar proportion of the national population that is poor (50 percent).

At the regional level, I find very different proportions. For example in the capital Addis Ababa, the data has 15 percent of the household as poor. The OLS model predicts only 1 percent of the population being poor, while the regression tree predicted 9 percent as poor.

Next, I consider the total number of poor households when these proportions are applied to the regional populations. Significantly different population figures could lead policymakers to allocate resources differently

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -116400   -2558   12090   71670   42530  442100
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -50920   -3148   11260   69440   77010  448300
## [1] 96168.44
## [1] 86005.29

When applying the predictions to the number of households in the regions, I find very different results. Compared to the survey data, the OLS model difference ranges between 116,000 fewer poor households to 442,000 too many poor households. The absolute mean difference is about 96,000 households.

For the RF model, the difference from the survey data range between 51,000 too few poor households and 448,000 too many poor households. This spread is only slightly smaller than the OLS model. The mean absolute difference is 86,000 households, which is quite similar to the World Bank’s OLS specification.

Conclusion

This project reexamined the World Bank’s approach to measuring poverty. After recreating the methodology on annual consumption data, I found a fairly large mean absolute percent error of 48 in the out-of-sample test. This is substantial error in the prediction. This confirms somewhat the criticism that the approach may not be accurate enough to be useful. In the future, researchers conducting poverty mapping should preserve some sample out of the models to rigorously assess how accurate the model is at predicting poverty.

I then sought to apply supervised learning to create more accurate predictions, namely regression decision trees and regression random forests. Both of these methods showed little improvement over the World Bank’s methodology.

Finally, I sought to classify if a household was poor or not. The World Bank approach found a mean F1 score of 72 on the out-of-sample test. The regression forest found a mean F1 score of 64 on the out-of-sample score. Again, both of these are not particularly accurate predictors of household poverty status.

These classification models were then applied to calculate how many poor households were in each of the eleven regions of Ethiopia. On average the World Bank OLS model was off by 96,000 households, while the random forest was off by 86,000. 100,000 households is quite substantial error.

This work shows that the model matters in classifying poverty and predict total aggregate consumption. None of the models tested in this project are particularly accurate, including the World Bank’s current methodology.

Important policy decisions, such as the allocation of scare national and international development funds, are allocated sub nationally based on the distribution of poverty. Researchers need to recommit themselves to establish more robust methods to map poverty. Although the supervised learning techniques tried here are not vast improvements over the current methodology, researchers should not abandon them in their search for more accurate ways to map poverty.

Appendix A - Cleaning the Ethiopia 2015-16 Socioeconomic Survey

The data is free to use with registration here: 2015-2016 Socioeconomic Survey.

## Here set your working directory to where the are downloaded 
setwd("D:/Dropbox/Dropbox/Personal/data_science/assignments/project/")

library(haven)
## merge on the household consumption  
agg_cons <- read_dta("ethiopia_lsms/Consumption Aggregate/cons_agg_w3.dta")
## agg_cons <- agg_cons[,c("household_id", "household_id2", "total_cons_ann", "nom_totcons_aeq", "saq01", "rural")]
agg_cons$urban <- ifelse(agg_cons$rural %in% 2:3, 1, 0)


## this is a listing of all household members, here we'll get data on the household head 
sect1 <- read_dta("ethiopia_lsms/Household/sect1_hh_w3.dta") 
    ## region code - saq01
    ##marital status - hh_s1q08 (have to limit to head)
    ##hh_s1q03 - sex (have to limit to head of household ) 
    ## hh_s1q07 - religion 

sect1_hhead <- sect1[which(sect1$hh_s1q02 == 1),]
sect1_hhead <- sect1_hhead[,c("household_id", "household_id2", "hh_s1q08", "hh_s1q02", "hh_s1q32_b", "hh_s1q03","hh_s1q07", "individual_id", "individual_id2")]
## married household head 
  sect1_hhead$married_hhead <- 0 
  sect1_hhead$married_hhead <- ifelse(sect1_hhead$hh_s1q08 %in% 2:3, 1, 0)
  summary(sect1_hhead$married_hhead) 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  1.0000  0.6904  1.0000  1.0000
## gender of head of household 
  sect1_hhead$female_hhead <- ifelse(sect1_hhead$hh_s1q03 == 2, 1, 0)
  summary(sect1_hhead$female_hhead)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.0000  0.0000  0.3065  1.0000  1.0000       1
  ## relgiion 
  sect1_hhead$hhead_orthodox <- ifelse(sect1_hhead$hh_s1q07 == 1, 1, 0) 
  sect1_hhead$hhead_protestant <- ifelse(sect1_hhead$hh_s1q07 == 3, 1, 0) 
  sect1_hhead$hhead_islam <- ifelse(sect1_hhead$hh_s1q07 == 4, 1, 0) 
  
  
## head of household illiterate  
  sect1_sondaughter <- sect1[which(sect1$hh_s1q02 == 2),]
  sect1_hhhead_gender<- sect1_hhead[,c("household_id", "household_id2", "female_hhead")]
  sect1_sondaughter<- merge(sect1_sondaughter, sect1_hhhead_gender, by=c("household_id","household_id2"))
    sect1_sondaughter$hhead_illiterate <- ifelse(sect1_sondaughter$hh_s1q15 == 98,1,0)
    sect1_sondaughter$hhead_illiterate <- ifelse(sect1_sondaughter$female_hhead == 1 & sect1_sondaughter$hh_s1q19 == 98,1,sect1_sondaughter$hhead_illiterate)
  summary(sect1_sondaughter$hhead_illiterate)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   1.000   1.000   0.785   1.000   1.000      74
  sect1_sondaughter<- sect1_sondaughter[,c("household_id", "household_id2", "hhead_illiterate")]   
  sect1_hhead <- merge(sect1_hhead,sect1_sondaughter,by=c("household_id","household_id2"), all.x = TRUE)
    ## assuming not illerate, if not children to prove that they are illiterate
  sect1_hhead$hhead_illiterate[is.na(sect1_hhead$hhead_illiterate)] <- 0
  summary(sect1_hhead$hhead_illiterate) 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  1.0000  0.5074  1.0000  1.0000
  ## merge these household head variables onto the household member roster
  lsms <- merge(agg_cons,sect1_hhead,by=c("household_id","household_id2"))
  

# source of drinking water 
sect9 <- read_dta("ethiopia_lsms/Household/sect9_hh_w3.dta") 
sect9 <- sect9[,c("household_id","household_id2", "hh_s9q13","hh_s9q10","hh_s9q10b","hh_s9q08","hh_s9q21","hh_s9q19_a","hh_s9q05","hh_s9q06","hh_s9q07")]
## hh_s9q13 - source of drinking water
sect9$pipedwater <- ifelse(sect9$hh_s9q13 %in% 1:3, 1, 0) 
summary(sect9$pipedwater)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   1.000   0.503   1.000   1.000
sect9$surfacewater <- ifelse(sect9$hh_s9q13 == 14, 1, 0) 
summary(sect9$surfacewater)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.0000  0.0000  0.0883  0.0000  1.0000       5
sect9$unprotectedwater <- ifelse(sect9$hh_s9q13 == 6, 1, 0) 
sect9$unprotectedwater <- ifelse(sect9$hh_s9q13 == 9, 1, sect9$unprotectedwater) 
summary(sect9$unprotectedwater)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## 0.00000 0.00000 0.00000 0.04021 0.00000 1.00000       5
## hh_s9q10 - toilet type 
sect9$notoilet <- ifelse(sect9$hh_s9q10 == 7, 1, 0) 
sect9$flushtoilet <- ifelse(sect9$hh_s9q10 == 1, 1, 0) 
sect9$latrine <- ifelse(sect9$hh_s9q10 %in% 2:4, 1, 0) 

summary(sect9$notoilet)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2812  1.0000  1.0000
## hh_s9q10b - shared toilet 
sect9$sharedtoilet <- ifelse(sect9$hh_s9q10b == 2, 0, sect9$hh_s9q10b) 
summary(sect9$sharedtoilet)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.0000  0.0000  0.3491  1.0000  1.0000    1393
## hh_s9q08 - type of kitchen 
sect9$modernkitchen <- ifelse(sect9$hh_s9q08 %in% 4:5, 1, 0) 
summary(sect9$modernkitchen)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.05026 0.00000 1.00000
## hh_s9q21 - cooking fuel
sect9$advcookingfuel <- ifelse(sect9$hh_s9q21 %in% 7:10, 1, 0) 
summary(sect9$advcookingfuel)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.08195 0.00000 1.00000
## hh_s9q19_a - type of lighting 
sect9$electriclighting <- ifelse(sect9$hh_s9q19_a %in% 1:4, 1, 0) 
summary(sect9$electriclighting)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  1.0000  0.5151  1.0000  1.0000
## hh_s9q05 - wall 
sect9$finishedwalls <- ifelse(sect9$hh_s9q05 %in% c(6, 7, 11, 14:17), 1, 0) 
summary(sect9$finishedwalls)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.107   0.000   1.000
sect9$woodwalls <- ifelse(sect9$hh_s9q05 %in% 1:3, 1, 0) 
summary(sect9$woodwalls)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  1.0000  0.7477  1.0000  1.0000
## hh_s9q06- roof 
sect9$finishedroof <- ifelse(sect9$hh_s9q06 %in% c(1, 2, 7, 8), 1, 0) 
summary(sect9$finishedroof)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  1.0000  0.6494  1.0000  1.0000
## hh_s9q07 - floor 
sect9$finishedfloor <- ifelse(sect9$hh_s9q07 %in% c(4:9), 1, 0) 
summary(sect9$finishedfloor)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1962  0.0000  1.0000
sect9$dirtfloor <- ifelse(sect9$hh_s9q07 == 1, 1, 0) 
summary(sect9$dirtfloor)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  1.0000  1.0000  0.7893  1.0000  1.0000
sect9 <- sect9[,c("household_id","household_id2", "pipedwater", 
                  "notoilet", "sharedtoilet", "modernkitchen", "advcookingfuel", "electriclighting", "finishedwalls", "finishedroof", "finishedfloor", "dirtfloor", "surfacewater", "unprotectedwater", "latrine", "flushtoilet", "woodwalls")]

lsms <- merge(lsms,sect9,by=c("household_id","household_id2"))

## assets 
  # television 
sect10 <- read_dta("ethiopia_lsms/Household/sect10_hh_w3.dta") 
sect10tv <- sect10[sect10$hh_s10q00 == 10,] 
sect10tv <- sect10tv[,c("household_id","household_id2","hh_s10q00", "hh_s10q0a", "hh_s10q01")]
lsms <- merge(lsms,sect10tv,by=c("household_id","household_id2"))
lsms$tv <- ifelse(lsms$hh_s10q01 >0, 1, 0)

  #radio 
sect10radio <- sect10[sect10$hh_s10q00 == 9,] 
sect10radio <- sect10radio[,c("household_id","household_id2","hh_s10q00", "hh_s10q0a", "hh_s10q01")]
sect10radio$radio <- ifelse(sect10radio$hh_s10q01 >0, 1, 0)
lsms <- merge(lsms,sect10radio,by=c("household_id","household_id2"))

## create log output variable
lsms$log_total_cons_ann <-log(lsms$total_cons_ann)

## creating poverty status 
## define poverty as less than $2 pp in 2015 
## 20.46410 birr to the dollar is average in 2015 
## 0.38 PPP conversion factor 
summary(lsms$total_cons_ann)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
##     680.9   12150.0   19980.0   25890.0   32440.0 1265000.0       237
lsms$total_cons_ann_usdppp <- (lsms$total_cons_ann/20.46410/0.38) 
summary(lsms$total_cons_ann_usdppp)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
##     87.57   1562.00   2570.00   3329.00   4171.00 162700.00       237
lsms$total_cons_ann_usdpppdaily <- lsms$total_cons_ann_usdppp / 360
lsms$total_cons_ann_usdpppdailyperperson <- lsms$total_cons_ann_usdpppdaily / lsms$adulteq
summary(lsms$total_cons_ann_usdpppdailyperperson)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##   0.3278   1.2960   2.0800   2.7290   3.3460 124.1000      237
lsms$poor <- ifelse(lsms$total_cons_ann_usdpppdailyperperson < 2, "poor", "not poor")
table(lsms$poor)
## 
## not poor     poor 
##     2485     2256