<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Yvonne McGillicuddy's Blog]]></title><description><![CDATA[Yvonne McGillicuddy's Blog]]></description><link>https://yvonnemcg.hashnode.dev</link><generator>RSS for Node</generator><lastBuildDate>Wed, 17 Jun 2026 13:26:22 GMT</lastBuildDate><atom:link href="https://yvonnemcg.hashnode.dev/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Predicting negative attrition with Machine Learning]]></title><description><![CDATA[Attrition (employees leaving) can be a really expensive issue for businesses. There are costs associated generally with hiring, training and accounting for loss of productivity with a reduced workforce... or a workforce with team members who haven’t ...]]></description><link>https://yvonnemcg.hashnode.dev/predicting-negative-attrition-with-machine-learning</link><guid isPermaLink="true">https://yvonnemcg.hashnode.dev/predicting-negative-attrition-with-machine-learning</guid><category><![CDATA[Python]]></category><category><![CDATA[Machine Learning]]></category><category><![CDATA[scikit learn]]></category><category><![CDATA[predictive analysis]]></category><category><![CDATA[Data Science]]></category><dc:creator><![CDATA[Yvonne McGillicuddy (UK)]]></dc:creator><pubDate>Thu, 04 Jan 2024 16:43:03 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/RNsKphkdBTk/upload/d665a326f9d037c8dabf4ba94fc4e777.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong>Attrition (employees leaving) can be a really expensive issue for businesses.</strong> There are costs associated generally with hiring, training and accounting for loss of productivity with a reduced workforce... or a workforce with team members who haven’t yet hit the ground running.</p>
<p>Whilst a business might be happy to pay these costs if it means onboarding higher-performing employees, they definitely do not want to pay these costs as a result of losing high-performing employees that are tough to replace.</p>
<p>Many companies now invest in employee survey solutions designed to spot indicators of employees being satisfied in their roles, and performance data is also recorded, but even with that data joined we are missing a crucial data point: <strong><em>how likely - based on that data - is it that a person leaves the firm?</em></strong></p>
<h2 id="heading-problem-statement">Problem Statement</h2>
<p><strong>Could a business use machine learning to predict attrition and intervene before an employee leaves; reducing the rate of attrition for high performers over time?</strong></p>
<h2 id="heading-strategy">Strategy</h2>
<p>To explore this problem, we will...</p>
<ul>
<li><p>locate an appropriate dataset that includes data of past and present employees</p>
</li>
<li><p><strong><em>analyse</em></strong> the data - getting a clear sense of what we have to work with</p>
</li>
<li><p><strong><em>cleanse</em></strong> the data to remove any data issues that could impede our use of a machine learning model</p>
</li>
<li><p><strong><em>visualise</em></strong> the data to validate data quality and highlight correlations between features in our dataset.</p>
</li>
<li><p><strong><em>fit a predictive model</em></strong> - in this case a Random Forest Classifier, as our target variable (attrition) is binary - and assess its performance</p>
</li>
<li><p><strong><em>refine our model</em></strong> by selecting different features modifying the model's parameters</p>
</li>
<li><p><strong><em>compare results</em></strong> and examine future steps based on conclusions drawn</p>
</li>
</ul>
<h2 id="heading-our-data-set">Our data set...</h2>
<p><a target="_blank" href="https://www.kaggle.com/datasets/anshika2301/hr-analytics-dataset/discussion/456134">This HR analytics dataset from Kaggle</a> incorporates a variety of numerical and categorical data points for each employee that give a sense of their demographic, role and salary at the company, satisfaction against a variety of criteria, as well as performance indicators and — critically — whether or not that employee left the company. These factors combine make this data set ideal for exploring this problem, though the categorical values do pose some challenges, as we won't be able to utilise them in their current form in a machine learning model. I will come back to this point later...</p>
<pre><code class="lang-python">df = pd.read_csv(‘data/HR_Analytics.csv’)
pd.set_option(‘display.max_columns’, <span class="hljs-literal">None</span>) <span class="hljs-comment"># show all columns rather than truncating</span>
df.head()
</code></pre>
<div class="hn-table">
<table>
<thead>
<tr>
<td></td><td>EmpID</td><td>Age</td><td>AgeGroup</td><td>Attrition</td><td>BusinessTravel</td><td>DailyRate</td><td>Department</td><td>DistanceFromHome</td><td>Education</td><td>EducationField</td><td>EmployeeCount</td><td>EmployeeNumber</td><td>EnvironmentSatisfaction</td><td>Gender</td><td>HourlyRate</td><td>JobInvolvement</td><td>JobLevel</td><td>JobRole</td><td>JobSatisfaction</td><td>MaritalStatus</td><td>MonthlyIncome</td><td>SalarySlab</td><td>MonthlyRate</td><td>NumCompaniesWorked</td><td>Over18</td><td>OverTime</td><td>PercentSalaryHike</td><td>PerformanceRating</td><td>RelationshipSatisfaction</td><td>StandardHours</td><td>StockOptionLevel</td><td>TotalWorkingYears</td><td>TrainingTimesLastYear</td><td>WorkLifeBalance</td><td>YearsAtCompany</td><td>YearsInCurrentRole</td><td>YearsSinceLastPromotion</td><td>YearsWithCurrManager</td></tr>
</thead>
<tbody>
<tr>
<td>0</td><td>RM297</td><td>18</td><td>18-25</td><td>Yes</td><td>Travel_Rarely</td><td>230</td><td>Research &amp; Development</td><td>3</td><td>3</td><td>Life Sciences</td><td>1</td><td>405</td><td>3</td><td>Male</td><td>54</td><td>3</td><td>1</td><td>Laboratory Technician</td><td>3</td><td>Single</td><td>1420</td><td>Upto 5k</td><td>25233</td><td>1</td><td>Y</td><td>No</td><td>13</td><td>3</td><td>3</td><td>80</td><td>0</td><td>0</td><td>2</td><td>3</td><td>0</td><td>0</td><td>0</td><td>0.0</td></tr>
<tr>
<td>1</td><td>RM302</td><td>18</td><td>18-25</td><td>No</td><td>Travel_Rarely</td><td>812</td><td>Sales</td><td>10</td><td>3</td><td>Medical</td><td>1</td><td>411</td><td>4</td><td>Female</td><td>69</td><td>2</td><td>1</td><td>Sales Representative</td><td>3</td><td>Single</td><td>1200</td><td>Upto 5k</td><td>9724</td><td>1</td><td>Y</td><td>No</td><td>12</td><td>3</td><td>1</td><td>80</td><td>0</td><td>0</td><td>2</td><td>3</td><td>0</td><td>0</td><td>0</td><td>0.0</td></tr>
<tr>
<td>2</td><td>RM458</td><td>18</td><td>18-25</td><td>Yes</td><td>Travel_Frequently</td><td>1306</td><td>Sales</td><td>5</td><td>3</td><td>Marketing</td><td>1</td><td>614</td><td>2</td><td>Male</td><td>69</td><td>3</td><td>1</td><td>Sales Representative</td><td>2</td><td>Single</td><td>1878</td><td>Upto 5k</td><td>8059</td><td>1</td><td>Y</td><td>Yes</td><td>14</td><td>3</td><td>4</td><td>80</td><td>0</td><td>0</td><td>3</td><td>3</td><td>0</td><td>0</td><td>0</td><td>0.0</td></tr>
<tr>
<td>3</td><td>RM728</td><td>18</td><td>18-25</td><td>No</td><td>Non-Travel</td><td>287</td><td>Research &amp; Development</td><td>5</td><td>2</td><td>Life Sciences</td><td>1</td><td>1012</td><td>2</td><td>Male</td><td>73</td><td>3</td><td>1</td><td>Research Scientist</td><td>4</td><td>Single</td><td>1051</td><td>Upto 5k</td><td>13493</td><td>1</td><td>Y</td><td>No</td><td>15</td><td>3</td><td>4</td><td>80</td><td>0</td><td>0</td><td>2</td><td>3</td><td>0</td><td>0</td><td>0</td><td>0.0</td></tr>
<tr>
<td>4</td><td>RM829</td><td>18</td><td>18-25</td><td>Yes</td><td>Non-Travel</td><td>247</td><td>Research &amp; Development</td><td>8</td><td>1</td><td>Medical</td><td>1</td><td>1156</td><td>3</td><td>Male</td><td>80</td><td>3</td><td>1</td><td>Laboratory Technician</td><td>3</td><td>Single</td><td>1904</td><td>Upto 5k</td><td>13556</td><td>1</td><td>Y</td><td>No</td><td>12</td><td>3</td><td>4</td><td>80</td><td>0</td><td>0</td><td>0</td><td>3</td><td>0</td><td>0</td><td>0</td><td>0.0</td></tr>
</tbody>
</table>
</div><p>In terms of the volume of data, the full set is 1480 rows and 38 columns - some of which I will come to remove for reasons I will divulge later on. Overall this seems a reasonable size to draw some initial conclusions, though we will see later on that we might ideally want to increase this data set for further testing</p>
<pre><code class="lang-python">pd.set_option(<span class="hljs-string">'display.max_columns'</span>, <span class="hljs-literal">None</span>) <span class="hljs-comment"># show all columns rather than truncating</span>
df.describe()
</code></pre>
<div class="hn-table">
<table>
<thead>
<tr>
<td></td><td>Age</td><td>DailyRate</td><td>DistanceFromHome</td><td>Education</td><td>EmployeeCount</td><td>EmployeeNumber</td><td>EnvironmentSatisfaction</td><td>HourlyRate</td><td>JobInvolvement</td><td>JobLevel</td><td>JobSatisfaction</td><td>MonthlyIncome</td><td>MonthlyRate</td><td>NumCompaniesWorked</td><td>PercentSalaryHike</td><td>PerformanceRating</td><td>RelationshipSatisfaction</td><td>StandardHours</td><td>StockOptionLevel</td><td>TotalWorkingYears</td><td>TrainingTimesLastYear</td><td>WorkLifeBalance</td><td>YearsAtCompany</td><td>YearsInCurrentRole</td><td>YearsSinceLastPromotion</td><td>YearsWithCurrManager</td></tr>
</thead>
<tbody>
<tr>
<td>count</td><td>1480.000000</td><td>1480.000000</td><td>1480.000000</td><td>1480.000000</td><td>1480.0</td><td>1480.000000</td><td>1480.000000</td><td>1480.000000</td><td>1480.000000</td><td>1480.000000</td><td>1480.000000</td><td>1480.000000</td><td>1480.000000</td><td>1480.000000</td><td>1480.000000</td><td>1480.000000</td><td>1480.000000</td><td>1480.0</td><td>1480.000000</td><td>1480.000000</td><td>1480.000000</td><td>1480.000000</td><td>1480.000000</td><td>1480.000000</td><td>1480.000000</td><td>1423.000000</td></tr>
<tr>
<td>mean</td><td>36.917568</td><td>801.384459</td><td>9.220270</td><td>2.910811</td><td>1.0</td><td>1031.860811</td><td>2.724324</td><td>65.845270</td><td>2.729730</td><td>2.064865</td><td>2.725000</td><td>6504.985811</td><td>14298.460811</td><td>2.687162</td><td>15.210135</td><td>3.153378</td><td>2.708784</td><td>80.0</td><td>0.791892</td><td>11.281757</td><td>2.797973</td><td>2.760811</td><td>7.009459</td><td>4.228378</td><td>2.182432</td><td>4.118060</td></tr>
<tr>
<td>std</td><td>9.128559</td><td>403.126988</td><td>8.131201</td><td>1.023796</td><td>0.0</td><td>605.955046</td><td>1.092579</td><td>20.328266</td><td>0.713007</td><td>1.105574</td><td>1.104137</td><td>4700.261400</td><td>7112.056802</td><td>2.494098</td><td>3.655338</td><td>0.360474</td><td>1.081995</td><td>0.0</td><td>0.850527</td><td>7.770870</td><td>1.288791</td><td>0.707024</td><td>6.117945</td><td>3.616020</td><td>3.219357</td><td>3.555484</td></tr>
<tr>
<td>min</td><td>18.000000</td><td>102.000000</td><td>1.000000</td><td>1.000000</td><td>1.0</td><td>1.000000</td><td>1.000000</td><td>30.000000</td><td>1.000000</td><td>1.000000</td><td>1.000000</td><td>1009.000000</td><td>2094.000000</td><td>0.000000</td><td>11.000000</td><td>3.000000</td><td>1.000000</td><td>80.0</td><td>0.000000</td><td>0.000000</td><td>0.000000</td><td>1.000000</td><td>0.000000</td><td>0.000000</td><td>0.000000</td><td>0.000000</td></tr>
<tr>
<td>25%</td><td>30.000000</td><td>465.000000</td><td>2.000000</td><td>2.000000</td><td>1.0</td><td>493.750000</td><td>2.000000</td><td>48.000000</td><td>2.000000</td><td>1.000000</td><td>2.000000</td><td>2922.250000</td><td>8051.000000</td><td>1.000000</td><td>12.000000</td><td>3.000000</td><td>2.000000</td><td>80.0</td><td>0.000000</td><td>6.000000</td><td>2.000000</td><td>2.000000</td><td>3.000000</td><td>2.000000</td><td>0.000000</td><td>2.000000</td></tr>
<tr>
<td>50%</td><td>36.000000</td><td>800.000000</td><td>7.000000</td><td>3.000000</td><td>1.0</td><td>1027.500000</td><td>3.000000</td><td>66.000000</td><td>3.000000</td><td>2.000000</td><td>3.000000</td><td>4933.000000</td><td>14220.000000</td><td>2.000000</td><td>14.000000</td><td>3.000000</td><td>3.000000</td><td>80.0</td><td>1.000000</td><td>10.000000</td><td>3.000000</td><td>3.000000</td><td>5.000000</td><td>3.000000</td><td>1.000000</td><td>3.000000</td></tr>
<tr>
<td>75%</td><td>43.000000</td><td>1157.000000</td><td>14.000000</td><td>4.000000</td><td>1.0</td><td>1568.250000</td><td>4.000000</td><td>83.000000</td><td>3.000000</td><td>3.000000</td><td>4.000000</td><td>8383.750000</td><td>20460.500000</td><td>4.000000</td><td>18.000000</td><td>3.000000</td><td>4.000000</td><td>80.0</td><td>1.000000</td><td>15.000000</td><td>3.000000</td><td>3.000000</td><td>9.000000</td><td>7.000000</td><td>3.000000</td><td>7.000000</td></tr>
<tr>
<td>max</td><td>60.000000</td><td>1499.000000</td><td>29.000000</td><td>5.000000</td><td>1.0</td><td>2068.000000</td><td>4.000000</td><td>100.000000</td><td>4.000000</td><td>5.000000</td><td>4.000000</td><td>19999.000000</td><td>26999.000000</td><td>9.000000</td><td>25.000000</td><td>4.000000</td><td>4.000000</td><td>80.0</td><td>3.000000</td><td>40.000000</td><td>6.000000</td><td>4.000000</td><td>40.000000</td><td>18.000000</td><td>15.000000</td><td>17.000000</td></tr>
</tbody>
</table>
</div><p>The Jupyter Notebook that I used to work on at this problem can be accessed at this <a target="_blank" href="https://github.com/ymcgillicuddy/HR-Analytics/tree/master"><strong>GitHub Repository</strong></a>, though key takeaways are included below.</p>
<h2 id="heading-exploringcleansing-the-data">Exploring/Cleansing the data</h2>
<p>Exploring a data set is important and starting out with a <a target="_blank" href="https://pandas.pydata.org/docs/">pandas</a> <code>describe()</code> method as I have above, is a good way to quickly pick out potential issues. In this case my output shows (after a little scrolling) that one of my columns (<code>YearsWithCurrManager</code>) contained 57 nulls — this is not very helpful for an ML model, so I dropped this along with two other columns that were shown to have no variance and so would not be useful to me.</p>
<pre><code class="lang-python">df = df.drop([‘YearsWithCurrManager’,‘StandardHours’,‘Over18’], axis=<span class="hljs-number">1</span>) <span class="hljs-comment"># dropping columns</span>
df.head()
</code></pre>
<p>Next I explored the granularity of the data by checking the <code>value_counts()</code> of two different ID fields. This revealed 10 instances where the same IDs were appearing, so I tried <code>drop_duplicates(inplace=True)</code> to check if these rows were duplications, or if they might represent different states in the same Employee’s career.</p>
<pre><code class="lang-python"><span class="hljs-comment"># counting the occurrence of values in EmpID and EmployeeNumber</span>
EmpIDs = df[‘EmpID’].value_counts()
EmployeeNumbers = df[‘EmployeeNumber’].value_counts()

print(‘The data set contains %s unique employee IDs <span class="hljs-keyword">and</span> %s unique employee numbers’ % (len(EmpIDs), len(EmployeeNumbers)))
</code></pre>
<pre><code class="lang-python"><span class="hljs-comment"># checking to see if any of the 10 rows could be duplicates</span>
df.duplicated().value_counts()
</code></pre>
<pre><code class="lang-python">df.drop_duplicates(inplace=<span class="hljs-literal">True</span>)
</code></pre>
<h2 id="heading-visualising-the-data">Visualising the data</h2>
<p>One of the downsides of using a public dataset is the unknowns, and this one features a lot of 1–5 scales where the positive and negative ends of the scales are anyone’s guess. I tried setting up a simple heatmap (using the numerical features in the data and the <a target="_blank" href="https://seaborn.pydata.org/">seaborn</a> and <a target="_blank" href="https://matplotlib.org/">matplotlib</a> packages) to see if I could get a handle on the directions of these scales, but there weren't many strong correlations, so I decided to first take a look at the categorical data and then circle back.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1704378337545/65cbdb6a-0a66-40b6-a258-c5cce0f8c1c7.png" alt class="image--center mx-auto" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1704378353492/ebeafe84-be5a-4fd3-8a5d-cdf4bd54e074.png" alt class="image--center mx-auto" /></p>
<p>This visualisation revealed some more data quality issues, as well as showing the spread of the different possible values and pointing to two fields - <code>Attrition</code> and <code>OverTime</code> that were Yes/No and could be converted to binary values and combined with the numerical features. <code>BusinessTravel</code> could equally be converted to be numerical, as there was a sense of scale from zero travel through to frequent travel</p>
<p>...and with three new numerical fields available, the heatmap was a little more insightful - though not as much as I would have liked!</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1704378752303/772926a9-6f72-4393-b20b-e5e82a63d98b.png" alt class="image--center mx-auto" /></p>
<p>The strongest correlation was <code>OverTime</code>, which does make sense and is good to be aware of - people working longer hours and more likely to become unhappy with their roles, but we're going to need more factors than just overtime in order to make predictions.</p>
<p>In terms of those 1-5 scales, the strongest correlation between one of those features and another numerical feature where we could make an inference on the direction of that scale was <code>JobLevel</code> against <code>MonthlyIncome</code> . Plotting these two against each other did suggest that 1 is low and 5 is high, but these features individually didn't correlate strongly with Attrition.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1704379443057/27f78e5a-d9b8-4e05-8916-d3cb9452645e.png" alt class="image--center mx-auto" /></p>
<p><code>PercentSalaryHike</code> and <code>PerformanceRating</code> looked to have a relationship too, so I investigated that further and found that the data set actually only contains two performance ratings - 3 and 4.</p>
<p>Given that there are only two values available for performance rating, and that these are both ratings that acquired salary increases, this data set appears to be really well set up to answer questions specifically around retaining high performers.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1704379506994/509d17d9-c083-48ea-ade7-9fb464e64a7d.png" alt class="image--center mx-auto" /></p>
<p>Given the shortage of obvious correlations, though, our model is going to need to make predictions based on patterns between all features in the data set to be most effective.</p>
<h2 id="heading-building-a-predictive-model">Building a predictive model</h2>
<p>I chose to use a <a target="_blank" href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html">Random Forest Classifier</a> from <strong><em>scikit-learn</em></strong> for this initial phase of testing. This is because a classifier in general will do exactly what we need in terms of predicting either true or false, as opposed to predicting a value in a range. Furthermore, given the outcome of my heatmap, chaining up yes/no predictions against different combinations of features appealed to me, so I selected Random Forest as my classifier  </p>
<p>Given that I had already prepped a number of categorical fields for a numerical dataframe, I decided to see how well the model would perform "out of the box" on that numerical data. The only parameter used at this point was to specify a <code>random state</code> to ensure that my outputs remained repeatable while testing.</p>
<p>The model will output f1, precision and recall scores. This decision is also a result of the binary nature of the target variable - we are looking to weigh against each other true positives, false positives, true negatives and false negatives... and each of these scores does so in a different way.</p>
<pre><code class="lang-python">y = df[<span class="hljs-string">'Attrition'</span>] <span class="hljs-comment"># defining target values</span>

<span class="hljs-comment"># creating a list of numerical column names and dropping Attrition, as that is my target value</span>
X_num_subset = numerical_df.drop([<span class="hljs-string">'Attrition'</span>], axis=<span class="hljs-number">1</span>).columns.to_list()
X = df[X_num_subset] <span class="hljs-comment"># defining features</span>

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=<span class="hljs-number">0.25</span>, random_state=<span class="hljs-number">19</span>) <span class="hljs-comment"># splitting to train and test data</span>
X_y_shape = print_shape(X_train,y_train,X_test,y_test) <span class="hljs-comment"># checking shape of train and test data</span>
</code></pre>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">model_fit_score</span>(<span class="hljs-params">X_train,X_test,y_train,y_test</span>):</span>
    <span class="hljs-string">'''
    Uses train/test data to fit and predict using a RandomForestClassifier
    INPUTS:
    train_test_split variables
    OUTPUTS:
    f1, precision and recall scores
    '''</span>
    model = RandomForestClassifier(random_state=<span class="hljs-number">17</span>)
    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    f1 = f1_score(y_test, pred)
    precision = precision_score(y_test, pred)
    recall = recall_score(y_test, pred)
    print(<span class="hljs-string">'F1 Score: %.3f'</span> % f1)
    print(<span class="hljs-string">'Precision Score: %.3f'</span> % precision)
    print(<span class="hljs-string">'Recall Score: %.3f'</span> % recall)
    <span class="hljs-keyword">return</span> precision
</code></pre>
<pre><code class="lang-python">rfc_num = model_fit_score(X_train,X_test,y_train,y_test) <span class="hljs-comment"># passing test_train_split to model fit function created earlier</span>
</code></pre>
<p>This resulted in the following scores:</p>
<ul>
<li><p><strong>F1:</strong> 0.225</p>
</li>
<li><p><strong>Precision:</strong> 0.727</p>
</li>
<li><p><strong>Recall:</strong> 0.133</p>
</li>
</ul>
<p>Here we can see that the f1 and recall scores are quite low. This is because the f1 score is a calculation incorporating both precision <em>and</em> recall, and the recall value is dragging down that f1 score.</p>
<p>In terms of the context of our problem, this isn't particularly alarming. A low recall score implies that we have a higher number of false negatives. In our case that means employees that we predicted would stay, but who left anyway.</p>
<p>We cannot ignore external factors behind a person leaving their job. There are a wide array of personal reasons that would not be linked to their experience in their previous role.</p>
<p>On the other hand, the precision score suggests that there were actually a pretty low number of false positives - people we predicted to leave but didn't. So this score is promising. Can we improve upon our precision score by selecting a better number of features or incorporating categorical data?</p>
<h2 id="heading-optimising-the-number-of-features-used-by-the-model">Optimising the number of features used by the model</h2>
<p>A <a target="_blank" href="https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html">SelectKBest</a> feature selector enables us to check for an optimal number of features (k) that we can then use to transform our test and train data and refit our model - a really quick way (in the absence of having clear individual correlations - see earlier heatmap) to hone in on specific features to include/disclude from the test/train data that might be negatively impacting the performance of my model.</p>
<p>By default, <em>SelectKBest</em> uses the parameter <code>score_func="f_classif"</code> which is optimised for classification tasks, so I won't be adjusting this parameter.</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">plot_k_best</span>(<span class="hljs-params">X_train,y_train,X_test</span>):</span>
   <span class="hljs-string">'''
   Plots a chart of scores for all dataset features using SelectKBest
   INPUTS:
   train/test split data
   OUTPUTS:
   transformed train/test split data
   Bar chart of scores for all features
   '''</span>
   fs = SelectKBest(k=<span class="hljs-string">"all"</span>)<span class="hljs-comment"># select all features</span>
   fs.fit(X_train,y_train)
   X_train_fs = fs.transform(X_train)
   X_test_fs = fs.transform(X_test) <span class="hljs-comment"># transform test input data</span>

   <span class="hljs-comment"># loop through features and print scores   </span>
   <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(len(fs.scores_)):
      print(<span class="hljs-string">'Feature %d: %f'</span> % (i, fs.scores_[i])) 
   <span class="hljs-comment"># plot the scores</span>
   plt.bar([i <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(len(fs.scores_))], fs.scores_)
   plt.xlabel(<span class="hljs-string">"Feature"</span>)
   plt.ylabel(<span class="hljs-string">"Score"</span>)
   plt.show()
   <span class="hljs-keyword">return</span> X_train_fs, X_test_fs
</code></pre>
<pre><code class="lang-python">plot_num = plot_k_best(X_train,y_train,X_test) <span class="hljs-comment"># plotting scores of all features in the train/test data</span>
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1704382869772/e48bbc16-da9f-464d-bb18-52a1817cf21e.png" alt class="image--center mx-auto" /></p>
<p>This chart shows that there are clearly some features that have <em>very</em> little impact on the model, but the most useful number is still a little unclear. After trying values in the range 10-20 I found 20 to be the optimal number of features. Refitting the model with this number of features improved it fairly significantly...</p>
<ul>
<li><p>The f1 score <strong><em>increased</em></strong> from 0.225 to <strong>0.324</strong></p>
</li>
<li><p>The precision score <strong><em>increased</em></strong> from 0.727 to <strong>0.857</strong></p>
</li>
<li><p>The recall score <strong><em>increased</em></strong> from 0.133 to <strong>0.200</strong></p>
</li>
</ul>
<h2 id="heading-including-categorical-features-in-the-model">Including categorical features in the model</h2>
<p>In order to use my Random Forest Classifier with categorical values, I would need to encode them to numerical values. Given that there isn't any implied ordinality (natural rank) with my remaining categorical features, I opted to use <a target="_blank" href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html">OneHotEncoder</a> for this task - this encoder pivots each of my categorical values to a binary column, so there is no implied rank - only a 1 for true, and a zero for false for each additional detail.</p>
<pre><code class="lang-python">encoder = OneHotEncoder(sparse=<span class="hljs-literal">False</span>)
new_df = pd.DataFrame()

<span class="hljs-keyword">for</span> cat <span class="hljs-keyword">in</span> categorical_df.columns: <span class="hljs-comment"># loop through each column in categorical_df</span>
    values = asarray(df[cat].to_list()) <span class="hljs-comment"># list of column values as an array</span>
    values = values.reshape(len(values),<span class="hljs-number">1</span>) <span class="hljs-comment"># reshape array</span>
    result = encoder.fit_transform(values) <span class="hljs-comment"># fit encoder on values</span>
    cols = pd.DataFrame(result, columns=encoder.categories_) <span class="hljs-comment"># utilise categorical values as column names</span>
    new_df=pd.concat([new_df,cols],axis=<span class="hljs-number">1</span>) <span class="hljs-comment"># add new columns to new_df</span>

X_cat = pd.concat([new_df.reset_index(drop=<span class="hljs-literal">True</span>),X.reset_index(drop=<span class="hljs-literal">True</span>)],axis=<span class="hljs-number">1</span>) <span class="hljs-comment"># concatenating new_df to previous X dataframe and resetting index to prevent index from memory persisting</span>
X_cat.columns = X_cat.columns.astype(str) <span class="hljs-comment"># ensuring that column titles are strings</span>
X_cat.head()
</code></pre>
<p>Pivoting this categorical data and appending it to the numerical data frame resulted in 56 features, and given the trial-and-error involved in isolating the best number of features with only 24 previously, I decided to iterate through each potential model-fitting and visualise which would yield the best results using <a target="_blank" href="https://plotly.com/">plotly</a>.</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">dict_to_df</span>(<span class="hljs-params">title1,list1,title2,list2</span>):</span>
    <span class="hljs-string">'''
    creates a dictionary from two lists and converts that into a dataframe that can be used to plot values
    INPUTS:
    Title1, Title2 - strings that will form column titles
    list1, list2 - lists of values for each column
    OUTPUT:
    to_df - a dataframe
    '''</span>
    dict = {title1: list1, title2: list2} <span class="hljs-comment"># dictionary of two lists</span>
    to_df = pd.DataFrame(dict) <span class="hljs-comment">#dictionary to dataframe</span>
    <span class="hljs-keyword">return</span> to_df
</code></pre>
<pre><code class="lang-python">fs_scores = [] <span class="hljs-comment"># empty list for scores</span>
fs_list = [] <span class="hljs-comment"># empty list for features numbers</span>
model = RandomForestClassifier(random_state=<span class="hljs-number">17</span>)
<span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(<span class="hljs-number">56</span>): <span class="hljs-comment"># iterate through features</span>
    <span class="hljs-comment">#transform train/test with the number of features</span>
    fs = SelectKBest(score_func=f_regression, k=(i+<span class="hljs-number">1</span>))
    fs.fit(X_cat_train, y_cat_train)
    X_cat_train_fs = fs.transform(X_cat_train)
    X_cat_test_fs = fs.transform(X_cat_test)
    model.fit(X_cat_train_fs, y_cat_train) <span class="hljs-comment"># fit to model</span>
    pred = model.predict(X_cat_test_fs)
    <span class="hljs-comment">#append score and feature number to lists</span>
    fs_scores.append(model.score(X_cat_test_fs, y_cat_test)) 
    fs_list.append(i+<span class="hljs-number">1</span>)

df_fs = dict_to_df(<span class="hljs-string">'Num_Features'</span>,fs_list,<span class="hljs-string">'Score'</span>,fs_scores) <span class="hljs-comment"># create a dataframe from a dictionary of the two lists</span>

fig = px.line(df_fs, x=<span class="hljs-string">"Num_Features"</span>, y=<span class="hljs-string">"Score"</span>) <span class="hljs-comment"># plotting number of features against score</span>
fig.show()
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1704384204878/75af7c1d-bdc1-4a1c-aad7-a3e714fa34d4.png" alt class="image--center mx-auto" /></p>
<p>refitting my model with train/test data that included categorical features pivoted using <em>OneHotEncoder</em> and transformed according to the k best features suggested in the chart above yielding the following results:</p>
<p>For k=36 features, the model performed slightly worse than previously:</p>
<ul>
<li>The f1 score <strong><em>decreased</em></strong> from 0.324 to <strong>0.301</strong></li>
</ul>
<ul>
<li><p>The precision score <strong><em>decreased</em></strong> from 0.857 to <strong>0.846</strong></p>
</li>
<li><p>The recall score <strong><em>decreased</em></strong> from 0.200 to <strong>0.183</strong></p>
</li>
</ul>
<p>At k=38 the precision score actually reached 1.000, which suggests that the model could be overfitting</p>
<p>Overall we can conclude that including the categorical variables does not improve the model. Can a little trial and error with our original model and some additional parameters make a difference, though?</p>
<h2 id="heading-refining-the-model-parameters">Refining the model parameters</h2>
<p>So, going back to our initial model, but using 20 features, could we simply adjust the parameters for the Random Forest Classifier to improve the model at all?</p>
<p>This process was more trial-and-error again, but I found that most adjustments made the model worse. There was one that improved it very slightly, though - increasing <code>max_depth</code> to 10 from the default of None. This parameter caps the number of splits that each decision tree can make, so it does make sense that restricting this could refine our model.</p>
<p>The chart below shows the differing performance of each model.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1704467864618/49b8b36a-3dec-4e70-bb9e-67e0e4d7f995.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-conclusion">Conclusion</h2>
<h3 id="heading-what-worked-well">What worked well?</h3>
<ul>
<li><p>Investing time cleansing and visualising the data upfront helped me to better equipped to understand what approaches might work best, and where the data could potentially impact performance.</p>
</li>
<li><p>Researching the best model to use and why saved me a lot of time.</p>
</li>
<li><p>Selecting the best number of features to use improved my model's performance, and helped me to understand the impact of introducing categorical data.</p>
</li>
<li><p>Adjusting parameters also improved performance.</p>
</li>
</ul>
<h3 id="heading-what-could-have-gone-better">What could have gone better?</h3>
<ul>
<li><p>If the data set had shown clearer correlations, that might have made the number of features to use more immediately apparent.</p>
</li>
<li><p>I'd have expected the introduction of the categorical data to have more of an impact than it did.</p>
</li>
</ul>
<h3 id="heading-next-steps">Next steps</h3>
<p>If I were to continue with/repeat this experiment, here are a few considerations that I might make:</p>
<ul>
<li><p><strong>Increasing the size/balance of my data set</strong> - only 16% of rows in the data set related to attrition, and whilst I accounted for that by checking the values in my <code>y_test</code> and <code>y_train</code> data once split, having more leavers in the data set might help to improve predictions.  </p>
<p>  I also thought initially that only having Performance Ratings of 3 and 4 in my data set suited the problem that I was looking to solve (retaining high performers), but perhaps including the lower performers might have strengthened my predictive model by including more leavers by default.</p>
</li>
</ul>
<p>    Finally, I identified a potential risk of overfitting, so having more data from which to sample would help to check on that  </p>
<ul>
<li><strong>Stacking Classifiers</strong> is another enhancement that I could make. Whilst the Random Forest Classifier performed well, perhaps stacking it with another model would boost it.</li>
</ul>
]]></content:encoded></item><item><title><![CDATA[Scouting for Talent in Women’s Football]]></title><description><![CDATA[Any fantasy football enthusiast or Football Manager player will be aware of the astonishing volume of statistics available for men’s football. Databases for the men’s game are, in fact, so extensive that data sets used by computer games such as Footb...]]></description><link>https://yvonnemcg.hashnode.dev/scouting-for-talent-in-womens-football</link><guid isPermaLink="true">https://yvonnemcg.hashnode.dev/scouting-for-talent-in-womens-football</guid><category><![CDATA[Womens Football]]></category><category><![CDATA[Machine Learning]]></category><category><![CDATA[Linear Regression]]></category><category><![CDATA[football]]></category><category><![CDATA[analytics]]></category><dc:creator><![CDATA[Yvonne McGillicuddy (UK)]]></dc:creator><pubDate>Thu, 06 Jul 2023 23:00:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/rOYAYXHgEh4/upload/8025525ea0453f1103f549706f69e21b.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Any fantasy football enthusiast or Football Manager player will be aware of the astonishing volume of statistics available for men’s football. Databases for the men’s game are, in fact, so extensive that data sets used by computer games such as Football Manager have even been used as preliminary research by professional football scouts — why go to the trouble of flying across the world to see players of interest in person when you can first check them out from the comfort of your home office?</p>
<p>Unfortunately, the women’s game is a number of decades behind — in part due to a 50-year-long ban on women playing stadium games in the UK, and similar restrictions in other countries around the world. Data are no exception to this deficit. The data set that I’m working through for this blog may represent 1328 players, but each has only been tracked for a maximum of 11 matches… and it was the most comprehensive dataset that I could find!</p>
<p>So — to put ourselves in the shoes of a hypothetical football scout — how can we work with the data available to produce positive outcomes for our club?</p>
<p>First things first: <strong>we need a team!</strong></p>
<p><img src="https://miro.medium.com/v2/resize:fit:1050/0*Aig6OeN_858os8kI" alt /></p>
<p>Visualised as a bar chart, we can see a huge magnitude of difference between the leading goal scorers (Barcelona) and Leicester City WFC. In reality, this works out to 42 goals for Barcelona vs. <strong><em>just one goal</em></strong> for Leicester City WFC.</p>
<p>Clearly, “The Foxes” need our help!</p>
<hr />
<p>In an ideal world, as scout for Leicester City WFC we might use the analysis of goal scorers and their clubs above to target talent at Barcelona, but in reality we probably don’t have the budget to lure their top goalscorers away, so <strong>how — instead — might we make use of observable features in the game to spot goalscoring potential?</strong></p>
<p>The obvious thing to look out for would be chances on goal, and the visual below breaks down top <em>shooters</em> in the data set, rather than scorers. Already we’re seeing that Barcelona don’t have the monopoly on opportunity!</p>
<p><img src="https://miro.medium.com/v2/resize:fit:1050/0*SPQMA-4D2KsJA-bV" alt /></p>
<p>If I were on the lookout for a more viable target than Barcelona’s top goalscorer, I might look to younger players; players that may still have a thing or two to learn, and room left in their career to develop significantly.</p>
<p><img src="https://miro.medium.com/v2/resize:fit:1050/0*cBRVXoKdfLPoBBeO" alt /></p>
<p><em>Benedetta Glionna</em> of <em>AC Roma</em> looks interesting in this regard — with 0 goals to show for her 13 shots on target, but only 23 years of age.</p>
<p>We can pick out other features, though, the presence of which would seem to suggest that goals are not far away…</p>
<ul>
<li><p><code>Touches_AttPen</code>: Touches in the attacking penalty area</p>
</li>
<li><p><code>Touches_Att3rd</code>: Touches in the attacking third.</p>
</li>
<li><p><code>Dist</code>: Distance</p>
</li>
</ul>
<p><img src="https://miro.medium.com/v2/resize:fit:930/0*y9iK1EVRx3cpMkWE" alt /></p>
<p><em>Note: The heatmap shows us that touches in the attacking third actually relate more strongly with shots on target (and assists, which we will see later).</em></p>
<p>Interestingly, distance appears to be slightly negatively correlated, which would all go to suggest that our scout should be looking out for players…</p>
<ul>
<li><p><strong>situated in the attacking third, but taking touches in the penalty area</strong></p>
</li>
<li><p><strong>frequently shooting at goal</strong></p>
</li>
<li><p><strong>restricting their movement to key areas — running the length of the pitch too much seems to be detrimental to goalscoring.</strong></p>
</li>
</ul>
<hr />
<p>With our ideal player having maximum impact within the penalty area, we need to get the ball to them, so <strong>let’s take a closer look at assists…</strong></p>
<p><img src="https://miro.medium.com/v2/resize:fit:786/0*rPWSuhN1PjyGZZA3" alt /></p>
<p>This heat map shows the strongest relationships to assists in our dataset:</p>
<ul>
<li><p><code>KP</code>: Key Passes</p>
</li>
<li><p><code>SCA</code><strong>:</strong> Shot creation actions</p>
</li>
<li><p><code>GCA</code>: Goal creation actions</p>
</li>
</ul>
<p>…but these features are far more difficult to evaluate in real-time scouting than shots on target, and touches in the penalty area! If we could get access to a player’s recent performance statistics, though, <strong>could we use the full dataset to predict potential for assists, even if that player hasn’t <em>actually</em> contributed any yet?</strong></p>
<p><strong>The short answer is: yes!</strong></p>
<p>We can indeed predict potential for assists with a reasonable degree of accuracy.</p>
<p>Using the values for <code>KP</code>, <code>SCA</code>, and <code>GCA</code> and their relationship to <code>Ast</code>, we can train a linear regression model, and then check that model against the remainder of the data set to see how it fared.</p>
<p>In this instance I trained on 75% of the cleansed dataset, which yielded a prediction accuracy of 67% across the remaining 25% of data.</p>
<p>Now, 67% is by no means a spectacular accuracy rate and — as this article has explored — there were some notable shortcomings in the data. It didn’t span nearly enough matches for my liking, and there were comparatively few fields relating to defensive attributes (absence of clean sheet statistics was noticeable as a football fan).</p>
<p>All of which highlights the need for more data in the growing Women’s game. At this stage we should be looking to collect statistics comparable with the men’s game in order to be able to achieve more with data analysis and help those behind the scenes to spot trends and drive meaningful change.</p>
<p>If this article has peaked your interest in Women’s football, why not look up your local team and ways you can support them.</p>
<p>Please also feel free to <a target="_blank" href="https://github.com/ymcgillicuddy/female-football-scout">take a look at the companion code to this article on GitHub</a>.</p>
]]></content:encoded></item></channel></rss>