Data Science for Social Good in Practice: Finding Donors for a Charity Company

Breno Silva
13 min readJun 18, 2020

--

Photo by Larm Rmah on Unsplash

As a developer, I believe that the best way to learn a new programming language, framework, or technology is by getting my hands dirty. I was always watching some videos, reading the documentation, and making projects to increase my GitHub portfolio.

When I started my path on data science in early 2018, I didn’t know how to apply what I was learning. Unlike before, I couldn’t make interesting projects because I needed data and real-world data is hard to find.

One day I found out about Udacity and decided to start the Data Scientist Nanodegree Program. It was love at first sight. The nano degree proposes to be a complete course, with classes, exercises, online instructor, real-world projects evaluated by professionals, and career coach.

Udacity has nanodegrees and free courses on several different topics. Check out the catalog here.

Here, we’re gonna take a look at the first project of the program where the purpose was to develop a machine learning model that can help a fictitious charity organization identify possible donors with data collected from the U.S census.

The main point here is to show a quick overview of how machine learning can be applied to social good for a non-technical audience. For easy understanding, I’ll not explain my entire solution and I’ll apply just a few simple types of preprocessing in the data.

If you want to see my entire solution, check out my Github. Udacity also created a private Kaggle competition and my solution is currently placed in the top 9%.

Starting from the beginning, the charity company problem.

Business Understanding

CharityML is a fictitious charity organization and like most of these organizations, survive from donations. They send letters to US residents asking for donations and after sending more than 30.000 letters, they determined that every donation received came from someone that was making more than $50.000 annually. So our goal is to build a machine learning model to best identify potential donors, expanding their base while reducing costs by sending letters only to the ones who would most likely donate.

Data Understanding

The census data set provided by Udacity has 45,222 records with 14 columns. The columns can be divided into categorical variables, continuous variables, and the target, that is what we’ll try to predict.

First lines of data set.

This data set is a modified version of the one published in the paper “Scaling Up the Accuracy of Naive-Bayes Classifiers: a Decision-Tree Hybrid” by Ron Kohavi.

Let’s take a look at these columns.

Categorical Variables

  • workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
  • education_level: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
  • marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
  • occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
  • relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
  • race: Black, White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other.
  • sex: Female, Male.
  • native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

Continuous Variables

  • age: Years of age.
  • education-num: Number of educational years completed.
  • capital-gain: Monetary capital gains.
  • capital-loss: Monetary capital losses.
  • hours-per-week: Average hours per week worked.
Describe continuous variables

We can see that capital-gain and capital-loss is highly-skewed. We're gonna fix these features later.

Target

  • income: ≤50K, >50K.

Looking at the distribution of classes (those who make at most $50,000, and those who make more), it's clear most individuals do not make more than $50,000. This can greatly affect accuracy, since we could simply say "this person does not make more than $50,000" and generally be right, without ever looking at the data! Making such a statement would be called naive since we have not considered any information to substantiate the claim.

Individuals making more than $50,000: 11208
Individuals making at most $50,000: 34014
Percentage of individuals making more than $50,000: 24.78%

Machine learning expects the input to be numeric and I’ll talk about this later, but for now, to simplify our analysis, we need to convert the target income to numerical values. Since there are only two possible categories for the target (<=50K and >50K), we can simply encode these two categories as 0 and 1, respectively.

Early Data Analysis

Now it’s time to take a closer look at the distributions and correlations with the target for each feature. This is a good place to make questions/hypotheses and use the data with visualizations to answer it.

People with higher education are more likely to earn more than $50,000 annually?

The education_level is categorical and has a notion of ranked associated. For example, we can think that bachelors, masters, and doctors are more likely to have a bigger annual income than people that didn't make it through college. Are these hypotheses true?

Another variable that we can look for answers is the education-num that represents the total education years completed by someone. If we group by education_level or education-num and calculate the average of each income class, we will see that for every value in education_level we have an equivalent value in education-num.

Income average for education levels and completed years of education.

The education-num works the same way as a numerically ranked transformation applied to the values of education_level. Putting in a bar chart we obtain the following:

Our hypothesis is almost totally true, it has just a few wrong ranked education-level, perhaps because of some data noise. Notice that the elementary school years rank order are a little messy and Prof-school has a higher average of income class >50K but is a lower level than Doctorate.

People that work more hours per week are more likely to earn more than $50,000 annually?

In our data set, we have the hours-per-week the feature that represents how many hours each Census participant works per week and we can use this variable to answer our question.

At first sight, looking at the mean hours worked per week of each income class, it appears that people from the >50K group in fact work more hours per week than the other group. Diving a little deep, the image below shows the average of theincome class distributed by hours worked per week.

The orange line shows the mean worked hours in the entire data set. The average increases after forty hours worked per week (red bar) and we have some silos because we don’t have records for every value in the range.

How experience is related to earning more than $50,000 annually?

We don’t know how many years each person has worked, but the older a person is, the more years of work and life experience they usually have. Let’s see if age tells us something.

Looking at the mean age of each income class, it appears that people in the >50K group are on average older than the other group. Let’s look at the age distribution of each income class.

Here we are using a KDE plot instead of histograms because it provides a smooth estimate of the overall distribution of data. The total area of each curve is 1 and the probability of an outcome falling between two values is found by computing the area under the curve that falls between those values.

  • The peaks are different and the group >50K is on average older than the other group.
  • The >50K group distribution is like a normal distribution and most of the data is between 37 and 50 years.
  • The distribution of the group <=50K is right-skewed.

With this, we can conclude that age and experience seem to correlate with the income class. Younger people are usually starting a career and therefore tend to earn less than $50,000 annually. As the person gets older and more experienced, the salary starts to increase.

It is also important to notice that after 60 years of age the distributions meet again.

Preparing the Data

Before data can be used as input for machine learning algorithms, it often must be cleaned, formatted, and restructured — this is typically known as preprocessing.

Let’s see if our data has any missing values:

age               0
workclass 0
education-num 0
marital-status 0
occupation 0
relationship 0
race 0
sex 0
capital-gain 0
capital-loss 0
hours-per-week 0
native-country 0
income 0

Fortunately, for this data set, there are no invalid or missing entries we must deal with, however, there are some qualities about certain features that must be adjusted. The preprocessing step can help tremendously with the outcome and predictive power of nearly all learning algorithms.

Transforming Skewed Continuous Features

A data set may sometimes contain at least one feature whose values tend to lie near a single number, but will also have a non-trivial number of vastly larger or smaller values than that single number. Algorithms can be sensitive to such distributions of values and can underperform if the range is not properly normalized. With the census data set two features fit this description: capital-gain and capital-loss.

For highly-skewed feature distributions like these, it is common practice to apply a logarithmic transformation on the data so that the very large and very small values do not negatively affect the performance of a learning algorithm. Using a logarithmic transformation significantly reduces the range of values caused by outliers. Care must be taken when applying this transformation however: The logarithm of 0 is undefined, so we must translate the values by a small amount above 0 to apply the logarithm successfully.

One Hot Encoding

Typically, learning algorithms expect the input to be numeric, which requires that non-numeric features (categorical variables) be converted. One popular way to convert these variables is by using the one-hot encoding scheme. One-hot encoding creates a “dummy” variable for each possible category variable. For example, assume someFeature has three possible entries: A, B, or C. We then encode this feature into someFeature_A, someFeature_B and someFeature_C.

Example of one hot encoding a feature.

Splitting Data in Training and Validation

Now that all categorical variables have been converted into numerical features, we’re going to split our data into two sets: training and validation. We can't use the same data to train and validate our model because it will learn noise from the data and lose the ability to generalize well. That's why this step is so important.
You can learn more about bias and variance here.

Training set has 31655 samples.
Testing set has 13567 samples.

Normalizing Numerical Features

In addition to performing transformations on highly skewed features, it is often good practice to perform some type of scaling on numerical features.

Applying a scaling to the data does not change the shape of each feature’s distribution (such as 'capital-gain' or 'capital-loss' above); however, normalization ensures that each feature is treated equally when applying supervised learners and changes the raw data meaning, as showed below.

This step comes after splitting our data because we want to avoid data leakage.

We will use sklearn.preprocessing.MinMaxScaler for this.

Describe normalized continuous variables

Data Modeling

In this section, we will investigate a couple of different algorithms and determine which is best at modeling the data. For now, we’ll use accuracy and cross-validation with k_fold = 5. You can learn more about cross-validation here.

Models Exploration

Applying eight different machine learning models to the data we have the following results:

Results of model exploration.
  • Random Forest has the best score in the training set, but it’s clearly over-fitting.
  • XGBoost has the best score in the validation set and the lowest standard deviation across folds.
  • LogisticRegression, Ridge, and Linear Discriminant Analysis are the fastest models to train and have good overall performances.
  • GaussianNB has the lowest performance, but it’s always important to consider a naive prediction as a benchmark for whether a model is performing well.

With that in mind, we’ll choose the XGBoost as the best model for this problem and perform a few further investigations.

If our data set was bigger and time to train was important, we could use LogisticRegression, Rigde or Linear Discriminant Analysis.

XGBoost

The XGBoost (eXtreme Gradient Boosting) basically generates several decision trees in the sequence where the objective of the next tree is to reduce the error from the previous one.

We’ll not explain how the model exactly works but we have great articles here on Medium for that. For instance, if you want to learn more about the XGBoost, check out this article here.

Feature Importances

Like the weights in a regression, machine learning models based on decision trees have a nice way to see how important each feature is in predicting the results.

Here we are plotting the first features with their importance. The orange line shows the accumulative importance across features. With that, we can see that approximately 10 features are responsible for more than 90% of the entire model's importance.

Reading the XGBoost documentation, there are three main different types of how the importance can be calculated, and here we used the weight type, that is the number of times a feature appears in a tree. You can read about the other ones here.

Evaluating the Results

CharityML is interested in predicting who makes more than $50,000 accurately. It would seem that using accuracy as a metric for evaluating a particular model's performance would be appropriate.

Additionally, identifying someone that does not make more than $50,000 as someone who does would be detrimental to CharityML, since they are looking to find individuals willing to donate.

Therefore, a model’s ability to precisely predict those that make more than $50,000 is more important than the model's ability to recall those individuals. Several metrics could be used here, but we'll use ROC AUC because it's the metric chosen as an evaluation method by Udacity in the Kaggle competition.

If you want to learn more about classification metrics, take a look at this post: The ultimate guide to binary classification metrics.

Optimizing the Model

Now that we know that the competition will evaluate our model based on the ROC AUC metric, we can use this as a guide to optimizing the model.

Each machine learning model has a set of parameters that we can set to change the way that the models work. For example, in the XGBoost algorithm, we can set the depth of each decision tree, the regularization factor in the loss function, the learning rate, and several others. You can look at the full list here.

With all these parameters to use, how can we know the best values for everyone? We can use an approach called Grid Search. Inputting a metric and a dictionary of parameters with possible values, the Grid Search will train a model for each combination of values and return the best model based on the metric.

After running the Grid Search, here are the results for the unoptimized and the optimized model:

Unoptimized model
Accuracy on the testing data: 0.8680
F-score on the testing data: 0.7440
ROC/AUC on the testing data: 0.9150
Optimized Model
Final accuracy on the testing data: 0.8722
Final F-score on the testing data: 0.7534
Final ROC/AUC on the testing data: 0.9291

Maybe you might think the improvement is minimal, but 1% in every metric without spending a lot of time in the Data Preparation step, it’s a great result.

Deploying

One approach to deploying this model could be making it available through an API Endpoint. Now, we just have to send the information of a new person and the model will predict to each income class it belongs. The CharityML can use this information to send a letter or not.

As I said before, the Udacity created an ongoing Kaggle private competition. Students can send their models and see how well it performs against the models from others students worldwide.

Kaggle is an awesome place to learn and practice your data science skills. If you want to learn more about Kaggle, check out The Beginner’s Guide to Kaggle.

Udacity private competition leaderboard.

My current solution scored 0.94937 for the ROC AUC metric and is currently placed in the top 9%.

Final Worlds

Thank you for reading it and thanks to Amanda Ferraboli, an amazing Data Scientist, for reviewing this post.

I hope that you have found this a good example of how to apply data science for social good. Feel free to comment below any questions or feedback. You can also find me on Kaggle, GitHub, or Linkedin.

Last but not least, I like to say #stayhome and if you could, take a little time to sponsor any of these Brazilian charity organizations:

Child Fund Brasil.

ChildFundBrazil
“Since 1966 in the country, ChildFund Brazil, the International Agency for Child Development, has benefited thousands of people, including children, adolescents, young people, and their families.”

Donate here.

Todos Pela Saúde
Todos Pela Saúde

Todos Pela Saúde
“We are an initiative to collaborate in the fight against the coronavirus and the aim is to contribute to fighting the pandemic in different social classes and to support public health initiatives.”

Donate here.

SOS Mata Atlântica
“It acts in the promotion of public policies for the conservation of the Atlantic Forest through the monitoring of the biome, the production of studies, demonstration projects, dialogue with public and private sectors, the improvement of environmental legislation, communication, and the engagement of society.”

Donate here.

--

--

Breno Silva

A wizard is never late, nor is he early. He arrives precisely when he means to.