Day 15 of 100 Days of AI

K-Means Clustering.

Today I learnt about k-means clustering. This is another one of those fancy-sounding machine learning techniques that seems intimidating to the uninitiated. But actually, the intuition behind it is simple. Thanks to Youtube, you can understand the rudimental elements of this unsupervised machine learning algorithm fairly quickly. Here’s a video that got me over that hurdle.

Tomorrow, I’ll train a model using this technique.

Key takeway:

  • Machine learning has so many techniques that it seems to me part of solving any problem with ML is first figuring out what data you need, and then researching and running experiments to figure out what the best algorithm to use is.
  • Some of these algorithms are very old. Even when it comes to the latest breakthroughs with large language models — Geoffrey Hinton was using similar methods as far back as 1985 (albeit at a fraction of the scale today).

Day 14 of 100 Days of AI

More on data.

I read a bit more about the data problem in AI after writing about it yesterday. People smarter than I am believe there are diminishing returns to training AI models and ever growing datasets. In this research publication, AI researchers share the following summary:

In layman terms (thanks to a ChatGPT translation) the above says the following:

“Our research looked into how well advanced models, trained on vast internet data, can understand and respond to new tasks without prior specific training. We discovered that these models, contrary to expectations, require significantly more data to slightly improve at these new tasks, showing a slow and inefficient learning process. This inefficiency persists even under varied testing conditions, including completely new or closely related data to their training. Moreover, when faced with a broad range of uncommon tasks, the models performed poorly. Our findings challenge the notion of “zero-shot” learning—where models instantly adapt to new tasks without extra training—highlighting a gap in their supposed ability to learn efficiently from large datasets.”

ChatGTP summary of the abstract from this paper.

Key takeaway:

  • Looks like just throwing more data at models will eventually reach some limit in terms of performance. At that point (or before it) we’ll need alternative techniques and new breakthroughs. That’s what AI expert Gary Marcus believes. As he notes in his latest newsletter, “There will never be enough data; there will always be outliers. This is why driverless cars are still just demos, and why LLMs will never be reliable.”

Day 13 of 100 Days of AI

On Data.

One thing that’s clear as I work through this 100 day challenge is how important data is for training AI. We need lots of it, and apparently all the world’s data that’s available on the internet might not be enough for building more advanced models. That’s the lead of this article from the Wall Street Journal.

According to the article and some of the researchers interviewed, if GPT-4 was trained on 12 trillion tokens (the fragments of words that large language models learn from), GPT-5 might need 60-100 trillion tokens on the current trajectory. Even if we used all the text and image data on the internet, we’d still be 10-20 trillion tokens (or more) short of what’s required, according to AI researcher Pablo Villalobos.

It’s possible that we might eventually run out of high quality data to train more advanced AI models on. However — and I say this as a non-technical expert — I believe we’ll figure out ways of capturing more data and/or find a way to do more with less. This is something the WSJ article also considers.

Autonomous vehicles can generate an estimated 19 terabytes of data per day. A video game platform can do 50 terabytes daily. On a larger scale, weather organisations capture 287 terabytes of data a day.

There’s a ton of data out there. We just have to figure out how to capture more of it and make sense of it.

Day 12 of 100 Days of AI

Support Vector Machine. This is another intimidating machine learning term (along with terms like gradient descent!) However, you can use this concept in practice without digging into the crazy maths best left to the academics.

Today, I completed a lab that runs you through a simple implementation of a support vector machine (SVM) model. This technique involves a supervised machine learning algorithm that classifies data by thrusting it into a higher dimensional space, and then finding a hyperplane that can easily group the data into separate classes.

The IBM intro to ML course I’m doing made the simple illustration below.

In this first image, we have data with just one dimension. Our data has 1 feature along the x-axis, running from -10 to 10. This dataset is “not linearly separable”. There’s no clear way of classifying the blue dots from the red dots.

However, if we can go from one dimension to a higher dimension (go from 1-D to 2-D) by finding and selecting additional features of our data, we might be able to separate the data with a line. Here’s an example of what can happen.

In the above, we have 2 features for our data that can help us predict whether a dot is red or blue. We have values -10 to 10 on the horizontal axis and we have values 0 to 100 on the vertical axis (2 dimensions). Notice that in this higher dimension, a pattern emerges that allows us to draw a straight line (of the form y = mx + c), which can help us make predictions about whether a dot is red or blue.

This thrusting of data into higher dimensions (formally known as ‘kerneling’) is key to SVM. The ‘support vectors’ are the data points that are closest to the hyperplane (a line in the example above and below.)

SVM can work even in 3 dimensions or higher. Below is a 3D example.

Note that it’s more tricky to visualise a 4 dimensions and beyond.

The code for the lab that I completed is here. Below is a preview of the data I used. This shows just two features of a cell that’s either benign or malignant.

Key takeaway:

  • Once again, I’m amazed that there are tools you can use to train a machine learning model with just two key lines of code.
This code initializes a Support Vector Machine classifier. It uses a radial basis function (RBF) to move the data into higher dimensions. It then fits a model to the training data (X_train) with corresponding labels (y_train).

Day 11 of 100 Days of AI

Logistic regression continued.

In the lab portion of the intro to ML course today, I went through an exercise of running a logistic regression analysis on fictional customer data. I’ve put the code on Github here.

The model is structured as follows:

logit(p) = -0.2675 + (-0.1526 * tenure) + (-0.0791 * age) + (-0.0721 * address) + (-0.0196 * income) + (0.0519 * ed) + (-0.0950 * employ) + (0.1601 * equip)

A visual representation of the impact of the coefficients on churn is summarized in this chart.

And here’s the performance of the model, illustrated with a confusion matrix.

The basic steps to produce the model were as follows:

  1. Load the dataset from a CSV file.
  2. Select the features we want to use for predictions. These were: tenure, age, address, income, education, employment status, equipment, and churn status.
  3. Preprocesses the data. We did just two bits of preprocessing here: (a) make sure the churn column has just integers and (b) normalize the feature set.
  4. Split the dataset into training and testing sets.
  5. Train a logistic regression model using the training data.
  6. Make predictions on the test data.
  7. Evaluate the performance of the model using a confusion matrix, classification report, and log loss.
  8. I also added a bar graph that charts the coefficients so we can see which features have the greatest impact on churn.

I still find it incredible that if you can write some code, you can build a simple machine learning model with a few lines of code per the example below.

Day 10 of 100 Days of AI

Logistic Regression. On day 1 I worked with linear regressions — a statistical technique to predict a continuous value based on another variable or set of variables. In the last few days I’ve been wrestling with logistic regression. This is is a statistical technique that classifies data into a category. It also provides a probability for that classification. Examples include:

  • Customer retention (will a customer churn or retain?)
  • Relevant prescription (should we issue Drug A, or Drug B, or Drug C to a patient?)
  • Machine performance (given certain conditions will a machine fail or continue to operate?)

I found the intuition of how this works with the help of this Youtube video. The images below highlight the process well without getting into detailed maths.

The key bit: instead of a straight line (as is the case with linear regression), in logistic regression we use a “sigmoid curve” to fit to the data. This curve has a S-shape as you can see below.

The next image shows how if we look at a case with a x-value in the range of 1-2, you have 4 healthy people and 1 unhealthy person. So the chance of being ill is 20% (1 in 5). In the x-value range of 4-5, you have 1 healthy person and 6 unhealthy people. The chance of being ill is around 85% (6 in 7).

However, the method of ranges and chunking the data with vertical lines for the ranges isn’t precise enough. So we have to plot a “S” curve. On the right chart below, you can see how a sigmoid curve would allow you to take any x-value and estimate the probability of illness. (The data here is fictitious.)

There’s a bunch of maths that turns the above into a machine learning model that can take multiple independent variables and generate a classification prediction. This includes “gradient descent“, a term you’ll hear about a bunch when exploring machine learning. But I won’t be digging into this. I just want to build some models and learn the basics. That’s what I do next.

Day 9 of 100 Days of AI

Python is the most popular machine learning programming language and across all general domains, it is the second most popular, behind Javascript. I’m probably at a intermediate skill level (but certainly on the lower end of that bracket) and I’ve decided to brush up on it regularly.

To keep practising Python, I got a copy of Python Workouts and started working through it today. I’ll use this to supplement the intro to ML course I’m doing.

Day 8 of 100 Days of AI

No coding today, but I did enjoy watching this video with computer scientists explaining machine learning in 5 levels of difficulty. One line in particular struck a chord with me:

“We can now do in 5 lines of code, something that would have taken 500 lines of very mathematical messy gnarly code,” says computer scientist Hilary Mason.

I’ve experienced this first-hand with some of the labs I’ve been doing and I’m surprised at how accessible this is. What a great time to be learning about ML!

Day 7 of 100 Days of AI

Decision Trees. I went through the intro to ML material on decision trees this morning.

I created some data in Excel and a simple decision tree machine learning model that predicts whether someone works in tech (value 1) or not (value 0). The data is synthetic, and I shaped it so that the ‘salary’ feature was the key predictor instead of age, sex, and region. Here’s the output tree.

My model had an accuracy of 77%. The confusion matrix below provides more on performance. For example, the prediction ‘precision’ of whether someone works in tech roles was quite good, at 85.7% (12 true positives and 2 false positives out of 14 positive predictions).

However, the model was less good at identifying people who don’t work in tech, with a true negative rate of 68.8% (11 true negatives out of 16 negative predictions, including 5 cases where non-tech job cases were mistakenly identified as tech).

Key takeaways:

  • Decision trees are a supervised machine learning technique for classifying data (via classification trees) or predicting numeric values (via regression trees).
  • As you can see from the first chart, decision trees are good because they are more easily interpretable, and you can follow the steps to know how a classification was made. Cases where this is useful include:
    • Finance situations, e.g. Loan application decisions, investment decisions
    • Healthcare situations, e.g. Diagnosis by going through symptoms and other features
    • Marketing, e.g. Customer segmentation via some attributes, churn prediction.
  • How you create decision trees?
    • The easiest thing to do is to use a Python library that does this for you. Here are some simple examples I did in Python.
    • Otherwise, the general process revolves mainly around feature selection. This is as follows:
      • Start at the root node.
      • Find the best feature that splits the data according to a metric, such as ‘information gain’ or ‘gini impurity’. You do this by going through all the features and splitting your data according to each one in isolation to see how well the data is split. Once you find the best feature, you build a branch with that feature.
      • You then repeat the above process, but below the previous branch and with a subset of data.
      • You keep going until you’re happy with the depth (note that if you go too deep, you might have issues with overfitting).

Day 6 of 100 Days of AI

Yesterday, I skimmed through k-fold cross-validation. I wanted to return to it today because I still wasn’t sure about what exactly we are assessing with it, and why.

Here‘s the clearest Youtube video I found on the topic. The screenshot below from that video provides an easy summary.

Key takeaways:

  • K-fold cross validation is a method of evaluating the performance of a specific model configuration.
    • Model configuration includes the choice of an algorithm (e.g. linear regressions, decision trees, KNNs, neural networks), feature selection (e.g. the independent variables used to predict the dependent variable), and hyperparamters (e.g. the number of neurons in a neural network.)
  • To perform K-fold validation you,
    • (1) split the data into ‘k’ equal parts (the “folds”);
    • (2) you train k-number of models using your chosen ‘model configuration’ across the k-number of folds; for example in the image above, model 1 is trained on folds 2-5 and tested on fold 1; model 2 is trained on folds 1 and 3-5, and tested on fold 2 and so forth;
    • (3) you then assess the performance of each model across the k-folds and aggregate this performance measure, usually using an average (or median, depending on the data type);
    • (4) the final accuracy tells you how good your model configuration is.
  • If your aggregated accuracy is strong, you can then train your model on the whole data set.
  • K-fold cross validation is useful in situations where you have a small amount of data and a small test set. In those instances, the opportunity to test your model is limited, and k-fold can boost that, allowing you to test a model configuration across a larger number of scenarios.