Yvonne McGillicuddy's Blog

Predicting negative attrition with Machine Learning

Yvonne McGillicuddy (UK) — Thu, 04 Jan 2024 16:43:03 GMT

Attrition (employees leaving) can be a really expensive issue for businesses. There are costs associated generally with hiring, training and accounting for loss of productivity with a reduced workforce... or a workforce with team members who haven’t yet hit the ground running.

Whilst a business might be happy to pay these costs if it means onboarding higher-performing employees, they definitely do not want to pay these costs as a result of losing high-performing employees that are tough to replace.

Many companies now invest in employee survey solutions designed to spot indicators of employees being satisfied in their roles, and performance data is also recorded, but even with that data joined we are missing a crucial data point: how likely - based on that data - is it that a person leaves the firm?

Problem Statement

Could a business use machine learning to predict attrition and intervene before an employee leaves; reducing the rate of attrition for high performers over time?

Strategy

To explore this problem, we will...

locate an appropriate dataset that includes data of past and present employees
analyse the data - getting a clear sense of what we have to work with
cleanse the data to remove any data issues that could impede our use of a machine learning model
visualise the data to validate data quality and highlight correlations between features in our dataset.
fit a predictive model - in this case a Random Forest Classifier, as our target variable (attrition) is binary - and assess its performance
refine our model by selecting different features modifying the model's parameters
compare results and examine future steps based on conclusions drawn

Our data set...

This HR analytics dataset from Kaggle incorporates a variety of numerical and categorical data points for each employee that give a sense of their demographic, role and salary at the company, satisfaction against a variety of criteria, as well as performance indicators and — critically — whether or not that employee left the company. These factors combine make this data set ideal for exploring this problem, though the categorical values do pose some challenges, as we won't be able to utilise them in their current form in a machine learning model. I will come back to this point later...

df = pd.read_csv(‘data/HR_Analytics.csv’)
pd.set_option(‘display.max_columns’, None) # show all columns rather than truncating
df.head()

	EmpID	Age	AgeGroup	Attrition	BusinessTravel	DailyRate	Department	DistanceFromHome	Education	EducationField	EmployeeCount	EmployeeNumber	EnvironmentSatisfaction	Gender	HourlyRate	JobInvolvement	JobLevel	JobRole	JobSatisfaction	MaritalStatus	MonthlyIncome	SalarySlab	MonthlyRate	NumCompaniesWorked	Over18	OverTime	PercentSalaryHike	PerformanceRating	RelationshipSatisfaction	StandardHours	StockOptionLevel	TotalWorkingYears	TrainingTimesLastYear	WorkLifeBalance	YearsAtCompany	YearsInCurrentRole	YearsSinceLastPromotion	YearsWithCurrManager
0	RM297	18	18-25	Yes	Travel_Rarely	230	Research & Development	3	3	Life Sciences	1	405	3	Male	54	3	1	Laboratory Technician	3	Single	1420	Upto 5k	25233	1	Y	No	13	3	3	80	0	0	2	3	0	0	0	0.0
1	RM302	18	18-25	No	Travel_Rarely	812	Sales	10	3	Medical	1	411	4	Female	69	2	1	Sales Representative	3	Single	1200	Upto 5k	9724	1	Y	No	12	3	1	80	0	0	2	3	0	0	0	0.0
2	RM458	18	18-25	Yes	Travel_Frequently	1306	Sales	5	3	Marketing	1	614	2	Male	69	3	1	Sales Representative	2	Single	1878	Upto 5k	8059	1	Y	Yes	14	3	4	80	0	0	3	3	0	0	0	0.0
3	RM728	18	18-25	No	Non-Travel	287	Research & Development	5	2	Life Sciences	1	1012	2	Male	73	3	1	Research Scientist	4	Single	1051	Upto 5k	13493	1	Y	No	15	3	4	80	0	0	2	3	0	0	0	0.0
4	RM829	18	18-25	Yes	Non-Travel	247	Research & Development	8	1	Medical	1	1156	3	Male	80	3	1	Laboratory Technician	3	Single	1904	Upto 5k	13556	1	Y	No	12	3	4	80	0	0	0	3	0	0	0	0.0

In terms of the volume of data, the full set is 1480 rows and 38 columns - some of which I will come to remove for reasons I will divulge later on. Overall this seems a reasonable size to draw some initial conclusions, though we will see later on that we might ideally want to increase this data set for further testing

pd.set_option('display.max_columns', None) # show all columns rather than truncating
df.describe()

	Age	DailyRate	DistanceFromHome	Education	EmployeeCount	EmployeeNumber	EnvironmentSatisfaction	HourlyRate	JobInvolvement	JobLevel	JobSatisfaction	MonthlyIncome	MonthlyRate	NumCompaniesWorked	PercentSalaryHike	PerformanceRating	RelationshipSatisfaction	StandardHours	StockOptionLevel	TotalWorkingYears	TrainingTimesLastYear	WorkLifeBalance	YearsAtCompany	YearsInCurrentRole	YearsSinceLastPromotion	YearsWithCurrManager
count	1480.000000	1480.000000	1480.000000	1480.000000	1480.0	1480.000000	1480.000000	1480.000000	1480.000000	1480.000000	1480.000000	1480.000000	1480.000000	1480.000000	1480.000000	1480.000000	1480.000000	1480.0	1480.000000	1480.000000	1480.000000	1480.000000	1480.000000	1480.000000	1480.000000	1423.000000
mean	36.917568	801.384459	9.220270	2.910811	1.0	1031.860811	2.724324	65.845270	2.729730	2.064865	2.725000	6504.985811	14298.460811	2.687162	15.210135	3.153378	2.708784	80.0	0.791892	11.281757	2.797973	2.760811	7.009459	4.228378	2.182432	4.118060
std	9.128559	403.126988	8.131201	1.023796	0.0	605.955046	1.092579	20.328266	0.713007	1.105574	1.104137	4700.261400	7112.056802	2.494098	3.655338	0.360474	1.081995	0.0	0.850527	7.770870	1.288791	0.707024	6.117945	3.616020	3.219357	3.555484
min	18.000000	102.000000	1.000000	1.000000	1.0	1.000000	1.000000	30.000000	1.000000	1.000000	1.000000	1009.000000	2094.000000	0.000000	11.000000	3.000000	1.000000	80.0	0.000000	0.000000	0.000000	1.000000	0.000000	0.000000	0.000000	0.000000
25%	30.000000	465.000000	2.000000	2.000000	1.0	493.750000	2.000000	48.000000	2.000000	1.000000	2.000000	2922.250000	8051.000000	1.000000	12.000000	3.000000	2.000000	80.0	0.000000	6.000000	2.000000	2.000000	3.000000	2.000000	0.000000	2.000000
50%	36.000000	800.000000	7.000000	3.000000	1.0	1027.500000	3.000000	66.000000	3.000000	2.000000	3.000000	4933.000000	14220.000000	2.000000	14.000000	3.000000	3.000000	80.0	1.000000	10.000000	3.000000	3.000000	5.000000	3.000000	1.000000	3.000000
75%	43.000000	1157.000000	14.000000	4.000000	1.0	1568.250000	4.000000	83.000000	3.000000	3.000000	4.000000	8383.750000	20460.500000	4.000000	18.000000	3.000000	4.000000	80.0	1.000000	15.000000	3.000000	3.000000	9.000000	7.000000	3.000000	7.000000
max	60.000000	1499.000000	29.000000	5.000000	1.0	2068.000000	4.000000	100.000000	4.000000	5.000000	4.000000	19999.000000	26999.000000	9.000000	25.000000	4.000000	4.000000	80.0	3.000000	40.000000	6.000000	4.000000	40.000000	18.000000	15.000000	17.000000

The Jupyter Notebook that I used to work on at this problem can be accessed at this GitHub Repository, though key takeaways are included below.

Exploring/Cleansing the data

Exploring a data set is important and starting out with a pandas describe() method as I have above, is a good way to quickly pick out potential issues. In this case my output shows (after a little scrolling) that one of my columns (YearsWithCurrManager) contained 57 nulls — this is not very helpful for an ML model, so I dropped this along with two other columns that were shown to have no variance and so would not be useful to me.

df = df.drop([‘YearsWithCurrManager’,‘StandardHours’,‘Over18’], axis=1) # dropping columns
df.head()

Next I explored the granularity of the data by checking the value_counts() of two different ID fields. This revealed 10 instances where the same IDs were appearing, so I tried drop_duplicates(inplace=True) to check if these rows were duplications, or if they might represent different states in the same Employee’s career.

# counting the occurrence of values in EmpID and EmployeeNumber
EmpIDs = df[‘EmpID’].value_counts()
EmployeeNumbers = df[‘EmployeeNumber’].value_counts()

print(‘The data set contains %s unique employee IDs and %s unique employee numbers’ % (len(EmpIDs), len(EmployeeNumbers)))

# checking to see if any of the 10 rows could be duplicates
df.duplicated().value_counts()

df.drop_duplicates(inplace=True)

Visualising the data

One of the downsides of using a public dataset is the unknowns, and this one features a lot of 1–5 scales where the positive and negative ends of the scales are anyone’s guess. I tried setting up a simple heatmap (using the numerical features in the data and the seaborn and matplotlib packages) to see if I could get a handle on the directions of these scales, but there weren't many strong correlations, so I decided to first take a look at the categorical data and then circle back.

This visualisation revealed some more data quality issues, as well as showing the spread of the different possible values and pointing to two fields - Attrition and OverTime that were Yes/No and could be converted to binary values and combined with the numerical features. BusinessTravel could equally be converted to be numerical, as there was a sense of scale from zero travel through to frequent travel

...and with three new numerical fields available, the heatmap was a little more insightful - though not as much as I would have liked!

The strongest correlation was OverTime, which does make sense and is good to be aware of - people working longer hours and more likely to become unhappy with their roles, but we're going to need more factors than just overtime in order to make predictions.

In terms of those 1-5 scales, the strongest correlation between one of those features and another numerical feature where we could make an inference on the direction of that scale was JobLevel against MonthlyIncome . Plotting these two against each other did suggest that 1 is low and 5 is high, but these features individually didn't correlate strongly with Attrition.

PercentSalaryHike and PerformanceRating looked to have a relationship too, so I investigated that further and found that the data set actually only contains two performance ratings - 3 and 4.

Given that there are only two values available for performance rating, and that these are both ratings that acquired salary increases, this data set appears to be really well set up to answer questions specifically around retaining high performers.

Given the shortage of obvious correlations, though, our model is going to need to make predictions based on patterns between all features in the data set to be most effective.

Building a predictive model

I chose to use a Random Forest Classifier from scikit-learn for this initial phase of testing. This is because a classifier in general will do exactly what we need in terms of predicting either true or false, as opposed to predicting a value in a range. Furthermore, given the outcome of my heatmap, chaining up yes/no predictions against different combinations of features appealed to me, so I selected Random Forest as my classifier

Given that I had already prepped a number of categorical fields for a numerical dataframe, I decided to see how well the model would perform "out of the box" on that numerical data. The only parameter used at this point was to specify a random state to ensure that my outputs remained repeatable while testing.

The model will output f1, precision and recall scores. This decision is also a result of the binary nature of the target variable - we are looking to weigh against each other true positives, false positives, true negatives and false negatives... and each of these scores does so in a different way.

y = df['Attrition'] # defining target values

# creating a list of numerical column names and dropping Attrition, as that is my target value
X_num_subset = numerical_df.drop(['Attrition'], axis=1).columns.to_list()
X = df[X_num_subset] # defining features

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=19) # splitting to train and test data
X_y_shape = print_shape(X_train,y_train,X_test,y_test) # checking shape of train and test data

def model_fit_score(X_train,X_test,y_train,y_test):
    '''
    Uses train/test data to fit and predict using a RandomForestClassifier
    INPUTS:
    train_test_split variables
    OUTPUTS:
    f1, precision and recall scores
    '''
    model = RandomForestClassifier(random_state=17)
    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    f1 = f1_score(y_test, pred)
    precision = precision_score(y_test, pred)
    recall = recall_score(y_test, pred)
    print('F1 Score: %.3f' % f1)
    print('Precision Score: %.3f' % precision)
    print('Recall Score: %.3f' % recall)
    return precision

rfc_num = model_fit_score(X_train,X_test,y_train,y_test) # passing test_train_split to model fit function created earlier

This resulted in the following scores:

F1: 0.225
Precision: 0.727
Recall: 0.133

Here we can see that the f1 and recall scores are quite low. This is because the f1 score is a calculation incorporating both precision and recall, and the recall value is dragging down that f1 score.

In terms of the context of our problem, this isn't particularly alarming. A low recall score implies that we have a higher number of false negatives. In our case that means employees that we predicted would stay, but who left anyway.

We cannot ignore external factors behind a person leaving their job. There are a wide array of personal reasons that would not be linked to their experience in their previous role.

On the other hand, the precision score suggests that there were actually a pretty low number of false positives - people we predicted to leave but didn't. So this score is promising. Can we improve upon our precision score by selecting a better number of features or incorporating categorical data?

Optimising the number of features used by the model

A SelectKBest feature selector enables us to check for an optimal number of features (k) that we can then use to transform our test and train data and refit our model - a really quick way (in the absence of having clear individual correlations - see earlier heatmap) to hone in on specific features to include/disclude from the test/train data that might be negatively impacting the performance of my model.

By default, SelectKBest uses the parameter score_func="f_classif" which is optimised for classification tasks, so I won't be adjusting this parameter.

def plot_k_best(X_train,y_train,X_test):
   '''
   Plots a chart of scores for all dataset features using SelectKBest
   INPUTS:
   train/test split data
   OUTPUTS:
   transformed train/test split data
   Bar chart of scores for all features
   '''
   fs = SelectKBest(k="all")# select all features
   fs.fit(X_train,y_train)
   X_train_fs = fs.transform(X_train)
   X_test_fs = fs.transform(X_test) # transform test input data

   # loop through features and print scores   
   for i in range(len(fs.scores_)):
      print('Feature %d: %f' % (i, fs.scores_[i])) 
   # plot the scores
   plt.bar([i for i in range(len(fs.scores_))], fs.scores_)
   plt.xlabel("Feature")
   plt.ylabel("Score")
   plt.show()
   return X_train_fs, X_test_fs

plot_num = plot_k_best(X_train,y_train,X_test) # plotting scores of all features in the train/test data

This chart shows that there are clearly some features that have very little impact on the model, but the most useful number is still a little unclear. After trying values in the range 10-20 I found 20 to be the optimal number of features. Refitting the model with this number of features improved it fairly significantly...

The f1 score increased from 0.225 to 0.324
The precision score increased from 0.727 to 0.857
The recall score increased from 0.133 to 0.200

Including categorical features in the model

In order to use my Random Forest Classifier with categorical values, I would need to encode them to numerical values. Given that there isn't any implied ordinality (natural rank) with my remaining categorical features, I opted to use OneHotEncoder for this task - this encoder pivots each of my categorical values to a binary column, so there is no implied rank - only a 1 for true, and a zero for false for each additional detail.

encoder = OneHotEncoder(sparse=False)
new_df = pd.DataFrame()

for cat in categorical_df.columns: # loop through each column in categorical_df
    values = asarray(df[cat].to_list()) # list of column values as an array
    values = values.reshape(len(values),1) # reshape array
    result = encoder.fit_transform(values) # fit encoder on values
    cols = pd.DataFrame(result, columns=encoder.categories_) # utilise categorical values as column names
    new_df=pd.concat([new_df,cols],axis=1) # add new columns to new_df

X_cat = pd.concat([new_df.reset_index(drop=True),X.reset_index(drop=True)],axis=1) # concatenating new_df to previous X dataframe and resetting index to prevent index from memory persisting
X_cat.columns = X_cat.columns.astype(str) # ensuring that column titles are strings
X_cat.head()

Pivoting this categorical data and appending it to the numerical data frame resulted in 56 features, and given the trial-and-error involved in isolating the best number of features with only 24 previously, I decided to iterate through each potential model-fitting and visualise which would yield the best results using plotly.

def dict_to_df(title1,list1,title2,list2):
    '''
    creates a dictionary from two lists and converts that into a dataframe that can be used to plot values
    INPUTS:
    Title1, Title2 - strings that will form column titles
    list1, list2 - lists of values for each column
    OUTPUT:
    to_df - a dataframe
    '''
    dict = {title1: list1, title2: list2} # dictionary of two lists
    to_df = pd.DataFrame(dict) #dictionary to dataframe
    return to_df

fs_scores = [] # empty list for scores
fs_list = [] # empty list for features numbers
model = RandomForestClassifier(random_state=17)
for i in range(56): # iterate through features
    #transform train/test with the number of features
    fs = SelectKBest(score_func=f_regression, k=(i+1))
    fs.fit(X_cat_train, y_cat_train)
    X_cat_train_fs = fs.transform(X_cat_train)
    X_cat_test_fs = fs.transform(X_cat_test)
    model.fit(X_cat_train_fs, y_cat_train) # fit to model
    pred = model.predict(X_cat_test_fs)
    #append score and feature number to lists
    fs_scores.append(model.score(X_cat_test_fs, y_cat_test)) 
    fs_list.append(i+1)

df_fs = dict_to_df('Num_Features',fs_list,'Score',fs_scores) # create a dataframe from a dictionary of the two lists

fig = px.line(df_fs, x="Num_Features", y="Score") # plotting number of features against score
fig.show()

refitting my model with train/test data that included categorical features pivoted using OneHotEncoder and transformed according to the k best features suggested in the chart above yielding the following results:

For k=36 features, the model performed slightly worse than previously:

The f1 score decreased from 0.324 to 0.301

The precision score decreased from 0.857 to 0.846
The recall score decreased from 0.200 to 0.183

At k=38 the precision score actually reached 1.000, which suggests that the model could be overfitting

Overall we can conclude that including the categorical variables does not improve the model. Can a little trial and error with our original model and some additional parameters make a difference, though?

Refining the model parameters

So, going back to our initial model, but using 20 features, could we simply adjust the parameters for the Random Forest Classifier to improve the model at all?

This process was more trial-and-error again, but I found that most adjustments made the model worse. There was one that improved it very slightly, though - increasing max_depth to 10 from the default of None. This parameter caps the number of splits that each decision tree can make, so it does make sense that restricting this could refine our model.

The chart below shows the differing performance of each model.

Conclusion

What worked well?

Investing time cleansing and visualising the data upfront helped me to better equipped to understand what approaches might work best, and where the data could potentially impact performance.
Researching the best model to use and why saved me a lot of time.
Selecting the best number of features to use improved my model's performance, and helped me to understand the impact of introducing categorical data.
Adjusting parameters also improved performance.

What could have gone better?

If the data set had shown clearer correlations, that might have made the number of features to use more immediately apparent.
I'd have expected the introduction of the categorical data to have more of an impact than it did.

Next steps

If I were to continue with/repeat this experiment, here are a few considerations that I might make:

Increasing the size/balance of my data set - only 16% of rows in the data set related to attrition, and whilst I accounted for that by checking the values in my y_test and y_train data once split, having more leavers in the data set might help to improve predictions.

I also thought initially that only having Performance Ratings of 3 and 4 in my data set suited the problem that I was looking to solve (retaining high performers), but perhaps including the lower performers might have strengthened my predictive model by including more leavers by default.

Finally, I identified a potential risk of overfitting, so having more data from which to sample would help to check on that

Stacking Classifiers is another enhancement that I could make. Whilst the Random Forest Classifier performed well, perhaps stacking it with another model would boost it.

Scouting for Talent in Women’s Football

Yvonne McGillicuddy (UK) — Thu, 06 Jul 2023 23:00:00 GMT

Any fantasy football enthusiast or Football Manager player will be aware of the astonishing volume of statistics available for men’s football. Databases for the men’s game are, in fact, so extensive that data sets used by computer games such as Football Manager have even been used as preliminary research by professional football scouts — why go to the trouble of flying across the world to see players of interest in person when you can first check them out from the comfort of your home office?

Unfortunately, the women’s game is a number of decades behind — in part due to a 50-year-long ban on women playing stadium games in the UK, and similar restrictions in other countries around the world. Data are no exception to this deficit. The data set that I’m working through for this blog may represent 1328 players, but each has only been tracked for a maximum of 11 matches… and it was the most comprehensive dataset that I could find!

So — to put ourselves in the shoes of a hypothetical football scout — how can we work with the data available to produce positive outcomes for our club?

First things first: we need a team!

Visualised as a bar chart, we can see a huge magnitude of difference between the leading goal scorers (Barcelona) and Leicester City WFC. In reality, this works out to 42 goals for Barcelona vs. just one goal for Leicester City WFC.

Clearly, “The Foxes” need our help!

In an ideal world, as scout for Leicester City WFC we might use the analysis of goal scorers and their clubs above to target talent at Barcelona, but in reality we probably don’t have the budget to lure their top goalscorers away, so how — instead — might we make use of observable features in the game to spot goalscoring potential?

The obvious thing to look out for would be chances on goal, and the visual below breaks down top shooters in the data set, rather than scorers. Already we’re seeing that Barcelona don’t have the monopoly on opportunity!

If I were on the lookout for a more viable target than Barcelona’s top goalscorer, I might look to younger players; players that may still have a thing or two to learn, and room left in their career to develop significantly.

Benedetta Glionna of AC Roma looks interesting in this regard — with 0 goals to show for her 13 shots on target, but only 23 years of age.

We can pick out other features, though, the presence of which would seem to suggest that goals are not far away…

Touches_AttPen: Touches in the attacking penalty area
Touches_Att3rd: Touches in the attacking third.
Dist: Distance

Note: The heatmap shows us that touches in the attacking third actually relate more strongly with shots on target (and assists, which we will see later).

Interestingly, distance appears to be slightly negatively correlated, which would all go to suggest that our scout should be looking out for players…

situated in the attacking third, but taking touches in the penalty area
frequently shooting at goal
restricting their movement to key areas — running the length of the pitch too much seems to be detrimental to goalscoring.

With our ideal player having maximum impact within the penalty area, we need to get the ball to them, so let’s take a closer look at assists…

This heat map shows the strongest relationships to assists in our dataset:

KP: Key Passes
SCA: Shot creation actions
GCA: Goal creation actions

…but these features are far more difficult to evaluate in real-time scouting than shots on target, and touches in the penalty area! If we could get access to a player’s recent performance statistics, though, could we use the full dataset to predict potential for assists, even if that player hasn’t actually contributed any yet?

The short answer is: yes!

We can indeed predict potential for assists with a reasonable degree of accuracy.

Using the values for KP, SCA, and GCA and their relationship to Ast, we can train a linear regression model, and then check that model against the remainder of the data set to see how it fared.

In this instance I trained on 75% of the cleansed dataset, which yielded a prediction accuracy of 67% across the remaining 25% of data.

Now, 67% is by no means a spectacular accuracy rate and — as this article has explored — there were some notable shortcomings in the data. It didn’t span nearly enough matches for my liking, and there were comparatively few fields relating to defensive attributes (absence of clean sheet statistics was noticeable as a football fan).

All of which highlights the need for more data in the growing Women’s game. At this stage we should be looking to collect statistics comparable with the men’s game in order to be able to achieve more with data analysis and help those behind the scenes to spot trends and drive meaningful change.

If this article has peaked your interest in Women’s football, why not look up your local team and ways you can support them.

Please also feel free to take a look at the companion code to this article on GitHub.