Day 11 of 100 Days of AI

Logistic regression continued.

In the lab portion of the intro to ML course today, I went through an exercise of running a logistic regression analysis on fictional customer data. I’ve put the code on Github here.

The model is structured as follows:

logit(p) = -0.2675 + (-0.1526 * tenure) + (-0.0791 * age) + (-0.0721 * address) + (-0.0196 * income) + (0.0519 * ed) + (-0.0950 * employ) + (0.1601 * equip)

A visual representation of the impact of the coefficients on churn is summarized in this chart.

And here’s the performance of the model, illustrated with a confusion matrix.

The basic steps to produce the model were as follows:

  1. Load the dataset from a CSV file.
  2. Select the features we want to use for predictions. These were: tenure, age, address, income, education, employment status, equipment, and churn status.
  3. Preprocesses the data. We did just two bits of preprocessing here: (a) make sure the churn column has just integers and (b) normalize the feature set.
  4. Split the dataset into training and testing sets.
  5. Train a logistic regression model using the training data.
  6. Make predictions on the test data.
  7. Evaluate the performance of the model using a confusion matrix, classification report, and log loss.
  8. I also added a bar graph that charts the coefficients so we can see which features have the greatest impact on churn.

I still find it incredible that if you can write some code, you can build a simple machine learning model with a few lines of code per the example below.

Day 10 of 100 Days of AI

Logistic Regression. On day 1 I worked with linear regressions — a statistical technique to predict a continuous value based on another variable or set of variables. In the last few days I’ve been wrestling with logistic regression. This is is a statistical technique that classifies data into a category. It also provides a probability for that classification. Examples include:

  • Customer retention (will a customer churn or retain?)
  • Relevant prescription (should we issue Drug A, or Drug B, or Drug C to a patient?)
  • Machine performance (given certain conditions will a machine fail or continue to operate?)

I found the intuition of how this works with the help of this Youtube video. The images below highlight the process well without getting into detailed maths.

The key bit: instead of a straight line (as is the case with linear regression), in logistic regression we use a “sigmoid curve” to fit to the data. This curve has a S-shape as you can see below.

The next image shows how if we look at a case with a x-value in the range of 1-2, you have 4 healthy people and 1 unhealthy person. So the chance of being ill is 20% (1 in 5). In the x-value range of 4-5, you have 1 healthy person and 6 unhealthy people. The chance of being ill is around 85% (6 in 7).

However, the method of ranges and chunking the data with vertical lines for the ranges isn’t precise enough. So we have to plot a “S” curve. On the right chart below, you can see how a sigmoid curve would allow you to take any x-value and estimate the probability of illness. (The data here is fictitious.)

There’s a bunch of maths that turns the above into a machine learning model that can take multiple independent variables and generate a classification prediction. This includes “gradient descent“, a term you’ll hear about a bunch when exploring machine learning. But I won’t be digging into this. I just want to build some models and learn the basics. That’s what I do next.

Day 9 of 100 Days of AI

Python is the most popular machine learning programming language and across all general domains, it is the second most popular, behind Javascript. I’m probably at a intermediate skill level (but certainly on the lower end of that bracket) and I’ve decided to brush up on it regularly.

To keep practising Python, I got a copy of Python Workouts and started working through it today. I’ll use this to supplement the intro to ML course I’m doing.

Day 8 of 100 Days of AI

No coding today, but I did enjoy watching this video with computer scientists explaining machine learning in 5 levels of difficulty. One line in particular struck a chord with me:

“We can now do in 5 lines of code, something that would have taken 500 lines of very mathematical messy gnarly code,” says computer scientist Hilary Mason.

I’ve experienced this first-hand with some of the labs I’ve been doing and I’m surprised at how accessible this is. What a great time to be learning about ML!

Day 7 of 100 Days of AI

Decision Trees. I went through the intro to ML material on decision trees this morning.

I created some data in Excel and a simple decision tree machine learning model that predicts whether someone works in tech (value 1) or not (value 0). The data is synthetic, and I shaped it so that the ‘salary’ feature was the key predictor instead of age, sex, and region. Here’s the output tree.

My model had an accuracy of 77%. The confusion matrix below provides more on performance. For example, the prediction ‘precision’ of whether someone works in tech roles was quite good, at 85.7% (12 true positives and 2 false positives out of 14 positive predictions).

However, the model was less good at identifying people who don’t work in tech, with a true negative rate of 68.8% (11 true negatives out of 16 negative predictions, including 5 cases where non-tech job cases were mistakenly identified as tech).

Key takeaways:

  • Decision trees are a supervised machine learning technique for classifying data (via classification trees) or predicting numeric values (via regression trees).
  • As you can see from the first chart, decision trees are good because they are more easily interpretable, and you can follow the steps to know how a classification was made. Cases where this is useful include:
    • Finance situations, e.g. Loan application decisions, investment decisions
    • Healthcare situations, e.g. Diagnosis by going through symptoms and other features
    • Marketing, e.g. Customer segmentation via some attributes, churn prediction.
  • How you create decision trees?
    • The easiest thing to do is to use a Python library that does this for you. Here are some simple examples I did in Python.
    • Otherwise, the general process revolves mainly around feature selection. This is as follows:
      • Start at the root node.
      • Find the best feature that splits the data according to a metric, such as ‘information gain’ or ‘gini impurity’. You do this by going through all the features and splitting your data according to each one in isolation to see how well the data is split. Once you find the best feature, you build a branch with that feature.
      • You then repeat the above process, but below the previous branch and with a subset of data.
      • You keep going until you’re happy with the depth (note that if you go too deep, you might have issues with overfitting).

Day 6 of 100 Days of AI

Yesterday, I skimmed through k-fold cross-validation. I wanted to return to it today because I still wasn’t sure about what exactly we are assessing with it, and why.

Here‘s the clearest Youtube video I found on the topic. The screenshot below from that video provides an easy summary.

Key takeaways:

  • K-cross validation is a method of evaluating the performance of a specific model configuration.
    • Model configuration includes the choice of an algorithm (e.g. linear regressions, decision trees, KNNs, neural networks), feature selection (e.g. the independent variables used to predict the dependent variable), and hyperparamters (e.g. the number of neurons in a neural network.)
  • To perform K-cross validation you,
    • (1) split the data into ‘k’ equal parts (the “folds”);
    • (2) you train k-number of models using your chosen ‘model configuration’ across the k-number of folds; for example in the image above, model 1 is trained on folds 2-5 and tested on fold 1; model 2 is trained on folds 1 and 3-5, and tested on fold 2 and so forth;
    • (3) you then assess the performance of each model across the k-folds and aggregate this performance measure, usually using an average (or median, depending on the data type);
    • (4) the final accuracy tells you how good your model configuration is.
  • If your aggregated accuracy is strong, you can then train your model on the whole data set.
  • K-fold cross validation is useful in situations where you have a small amount of data and a small test set. In those instances, the opportunity to test your model is limited, and k-fold can boost that, allowing you to test a model configuration across a larger number of scenarios.

Day 5 of 100 Days of AI

Today I learnt about cross validation (scikit-learn has Python helper functions for this.) This is where you split your training and testing data into different sets, and then you iteratively train and test against different combinations to assess how well a model performs on unseen data.

Two common methods of cross validation are k-fold validation and leave-one-out cross validation.

K-fold involves splitting the data into equal parts and rotating across them in terms of testing and training.

With leave-one-out cross validation, you select one data point for testing, and use the rest for training. You then move to another data point for testing, and then use the remainder for training, and so forth, until every data point has been used for testing.

I’ll return to these concepts when I write some more code next week.

Ps. One thing I’m realising as I dig into the basics of machine learning is that it’s a mix of art and science in terms of choosing techniques that may produce the best models. There’s a bunch of trial and error, even though it’s a deeply rigorous and mathematical field.

Day 4 of 100 Days of AI

Today, I went through a classifier lab on the intro ML course. There are several bits I didn’t quite understand but GPT helped get me over the basics. For example, I will need to review my notes on the Jaccard Index and F1-score (evaluation metrics for classifier models), and the concept of normalisation, where you transform your data without changing its distribution. This makes it easier to calculate distances between points, a critical bit when trying to make classification predictions.

On the latter point, I’ve included some charting code in the github repo here (see image below), which helped me understand the normalisation concept. The charting code was written by GPT, with some minor tweaks from me.

Key takeaways:

  • Classification is a supervised machine learning approach.
  • It makes a prediction about what discrete class some item should fall into.
  • Classifiers can be used for spam detection, document classification, speech recognition, or even to predict if a certain customer will churn, based on a variety of characteristics.
  • Classification algorithms include k-nearest neighbour (which I’ve put on github here), decision trees, and logistic regression (which instead of putting an item into a class, gives you a probability that it will fit a particular bucket.)
  • The K-nearest neighbours algorithm was fun to learn about, and the intuition for it is simpler than I expected. The basic notion is as follows: for a given item to predict on, look at a select number of neigbhours (the k-number), and predict the outcome based on the most popular category that those neighbours are in (or the neighbours’ mean or median of the values you’re trying to predict for e.g. house price based on location, square foot size etc.)
  • Classification algorithms can be evaluated with a number of accuracy measures, such as the Jaccard Index, a F1-score, or Log Loss. I didn’t cover these in detail but I did enough to get the very basics.

Day 3 of 100 Days of AI

Today I continued with the IBM intro ML course, and got to build a multilinear regression model. This time I tried to do it with data outside of the class. I found some data on IMDB movies that also had a Metacritic score and a number of other features. Below is a scatter graph of just the IMDB and metascore ratings.

We then go through the usual steps of preparing the data and running a regression but this time, I had an extra dependant variable. You can see it in the code below, line 29.

This model had a variance score of just 0.58 so it’s not of much use. But it was fun to test something I’d just learnt and with a different data set. Below, I run the model with Dune’s IMDB rating from 2021.

The model predicted a Metacritic score of 71. The actual rating today is 74. But you don’t need a fancy model to know that IMDB ratings are usually in line with Metacritic scores!

Day 2 of 100 Days of AI

I spent bits of today watching Youtube videos about linear regression. One of the best ones I found was this StatQuest video on the topic. I enjoyed his style so much, I ordered his introductory book on Machine Learning. I’m planning a digital fast over a long weekend in the second quarter of the year, and I’ll take the book with me to continue learning offline.

Today I learnt about a new concept: Overfitting.

Key takeaways today:

  • Overfitting: This is when a machine learning model fits too strictly to the training data that it was trained on. So much so, that the model doesn’t generalize well to data outside of the training set (i.e. new data.)
  • Why does overfitting happen? A few drivers:
    • Small training data set — limited training examples might throw up patterns that don’t account for the true distribution of the data.
    • High model complexity — having too many parameters might throw up noise or patterns that are not useful;
    • Biased data — this could throw up patterns that are not generalizable across wider a population of data.