Day 19 of 100 Days of AI

Today I completed the KNN lab. The work is ungraded and I haven’t yet checked whether my results are in line with other students’ but I’ll dig into this once I have all the modes. Here are the results for the KNN model I built today. The code is on Github.

  1. Linear Regression – completed.
  2. KNN – completed
  3. Decision Trees – to do
  4. Logistic Regression – to do
  5. SVM – to do

Day 18 of 100 Days of AI

Today I started the bonus material of the intro to ML course I’ve been doing. I’ll be working with weather data will practise all the key topics I covered in the course. Key areas include:

  1. Linear Regression – completed.
  2. KNN – to do
  3. Decision Trees – to do
  4. Logistic Regression – to do
  5. SVM – to do

I’ll then assess all the above techniques using the following evaluation metrics.

  1. Accuracy Score
  2. Jaccard Index
  3. F1-Score
  4. LogLoss
  5. Mean Absolute Error
  6. Mean Squared Error
  7. R2-Score

I’ve started with linear regression. The code for this is here. I’ll work through the remainder in the coming days.

Day 17 of 100 Days of AI

I completed the Machine Learning with Python introductory course today and received a fun digital certificate for it here.

The course is designed as a six-week, part-time program, but I completed it in less than half the time by dedicating a little time every day. I’m also keen to move on to other, more practical building bits with LLMs. However, before that, I will revise what I’ve learnt in the last three weeks and apply the concepts to a few novel challenges.

83 more days of AI learning to go!

Day 16 of 100 Days of AI

K-Means Clustering Continued…

I went through a lab exercise today on k-means. I’ve put the code on my Github page here.

I remain in awe that you can fit a model with a few lines of code. A snippet is provided below. In this example, a k-means algorithm is run on data assigned to the variable ‘X’. How cool is that?

Here are some cool charts from the code on Github.

Key takeaways:

  • K-means is an unsupervised machine learning technique to cluster data into a ‘k’ number of groups.
  • Example use cases include customer segmentation, fraud detection, and grouping biological markers.
  • It’s useful where the data isn’t labelled, and you want to explore and find patterns that might not be immediately obvious.

Day 15 of 100 Days of AI

K-Means Clustering.

Today I learnt about k-means clustering. This is another one of those fancy-sounding machine learning techniques that seems intimidating to the uninitiated. But actually, the intuition behind it is simple. Thanks to Youtube, you can understand the rudimental elements of this unsupervised machine learning algorithm fairly quickly. Here’s a video that got me over that hurdle.

Tomorrow, I’ll train a model using this technique.

Key takeway:

  • Machine learning has so many techniques that it seems to me part of solving any problem with ML is first figuring out what data you need, and then researching and running experiments to figure out what the best algorithm to use is.
  • Some of these algorithms are very old. Even when it comes to the latest breakthroughs with large language models — Geoffrey Hinton was using similar methods as far back as 1985 (albeit at a fraction of the scale today).

Day 14 of 100 Days of AI

More on data.

I read a bit more about the data problem in AI after writing about it yesterday. People smarter than I am believe there are diminishing returns to training AI models and ever growing datasets. In this research publication, AI researchers share the following summary:

In layman terms (thanks to a ChatGPT translation) the above says the following:

“Our research looked into how well advanced models, trained on vast internet data, can understand and respond to new tasks without prior specific training. We discovered that these models, contrary to expectations, require significantly more data to slightly improve at these new tasks, showing a slow and inefficient learning process. This inefficiency persists even under varied testing conditions, including completely new or closely related data to their training. Moreover, when faced with a broad range of uncommon tasks, the models performed poorly. Our findings challenge the notion of “zero-shot” learning—where models instantly adapt to new tasks without extra training—highlighting a gap in their supposed ability to learn efficiently from large datasets.”

ChatGTP summary of the abstract from this paper.

Key takeaway:

  • Looks like just throwing more data at models will eventually reach some limit in terms of performance. At that point (or before it) we’ll need alternative techniques and new breakthroughs. That’s what AI expert Gary Marcus believes. As he notes in his latest newsletter, “There will never be enough data; there will always be outliers. This is why driverless cars are still just demos, and why LLMs will never be reliable.”

Day 13 of 100 Days of AI

On Data.

One thing that’s clear as I work through this 100 day challenge is how important data is for training AI. We need lots of it, and apparently all the world’s data that’s available on the internet might not be enough for building more advanced models. That’s the lead of this article from the Wall Street Journal.

According to the article and some of the researchers interviewed, if GPT-4 was trained on 12 trillion tokens (the fragments of words that large language models learn from), GPT-5 might need 60-100 trillion tokens on the current trajectory. Even if we used all the text and image data on the internet, we’d still be 10-20 trillion tokens (or more) short of what’s required, according to AI researcher Pablo Villalobos.

It’s possible that we might eventually run out of high quality data to train more advanced AI models on. However — and I say this as a non-technical expert — I believe we’ll figure out ways of capturing more data and/or find a way to do more with less. This is something the WSJ article also considers.

Autonomous vehicles can generate an estimated 19 terabytes of data per day. A video game platform can do 50 terabytes daily. On a larger scale, weather organisations capture 287 terabytes of data a day.

There’s a ton of data out there. We just have to figure out how to capture more of it and make sense of it.

Day 12 of 100 Days of AI

Support Vector Machine. This is another intimidating machine learning term (along with terms like gradient descent!) However, you can use this concept in practice without digging into the crazy maths best left to the academics.

Today, I completed a lab that runs you through a simple implementation of a support vector machine (SVM) model. This technique involves a supervised machine learning algorithm that classifies data by thrusting it into a higher dimensional space, and then finding a hyperplane that can easily group the data into separate classes.

The IBM intro to ML course I’m doing made the simple illustration below.

In this first image, we have data with just one dimension. Our data has 1 feature along the x-axis, running from -10 to 10. This dataset is “not linearly separable”. There’s no clear way of classifying the blue dots from the red dots.

However, if we can go from one dimension to a higher dimension (go from 1-D to 2-D) by finding and selecting additional features of our data, we might be able to separate the data with a line. Here’s an example of what can happen.

In the above, we have 2 features for our data that can help us predict whether a dot is red or blue. We have values -10 to 10 on the horizontal axis and we have values 0 to 100 on the vertical axis (2 dimensions). Notice that in this higher dimension, a pattern emerges that allows us to draw a straight line (of the form y = mx + c), which can help us make predictions about whether a dot is red or blue.

This thrusting of data into higher dimensions (formally known as ‘kerneling’) is key to SVM. The ‘support vectors’ are the data points that are closest to the hyperplane (a line in the example above and below.)

SVM can work even in 3 dimensions or higher. Below is a 3D example.

Note that it’s more tricky to visualise a 4 dimensions and beyond.

The code for the lab that I completed is here. Below is a preview of the data I used. This shows just two features of a cell that’s either benign or malignant.

Key takeaway:

  • Once again, I’m amazed that there are tools you can use to train a machine learning model with just two key lines of code.
This code initializes a Support Vector Machine classifier. It uses a radial basis function (RBF) to move the data into higher dimensions. It then fits a model to the training data (X_train) with corresponding labels (y_train).

Day 11 of 100 Days of AI

Logistic regression continued.

In the lab portion of the intro to ML course today, I went through an exercise of running a logistic regression analysis on fictional customer data. I’ve put the code on Github here.

The model is structured as follows:

logit(p) = -0.2675 + (-0.1526 * tenure) + (-0.0791 * age) + (-0.0721 * address) + (-0.0196 * income) + (0.0519 * ed) + (-0.0950 * employ) + (0.1601 * equip)

A visual representation of the impact of the coefficients on churn is summarized in this chart.

And here’s the performance of the model, illustrated with a confusion matrix.

The basic steps to produce the model were as follows:

  1. Load the dataset from a CSV file.
  2. Select the features we want to use for predictions. These were: tenure, age, address, income, education, employment status, equipment, and churn status.
  3. Preprocesses the data. We did just two bits of preprocessing here: (a) make sure the churn column has just integers and (b) normalize the feature set.
  4. Split the dataset into training and testing sets.
  5. Train a logistic regression model using the training data.
  6. Make predictions on the test data.
  7. Evaluate the performance of the model using a confusion matrix, classification report, and log loss.
  8. I also added a bar graph that charts the coefficients so we can see which features have the greatest impact on churn.

I still find it incredible that if you can write some code, you can build a simple machine learning model with a few lines of code per the example below.

Day 10 of 100 Days of AI

Logistic Regression. On day 1 I worked with linear regressions — a statistical technique to predict a continuous value based on another variable or set of variables. In the last few days I’ve been wrestling with logistic regression. This is is a statistical technique that classifies data into a category. It also provides a probability for that classification. Examples include:

  • Customer retention (will a customer churn or retain?)
  • Relevant prescription (should we issue Drug A, or Drug B, or Drug C to a patient?)
  • Machine performance (given certain conditions will a machine fail or continue to operate?)

I found the intuition of how this works with the help of this Youtube video. The images below highlight the process well without getting into detailed maths.

The key bit: instead of a straight line (as is the case with linear regression), in logistic regression we use a “sigmoid curve” to fit to the data. This curve has a S-shape as you can see below.

The next image shows how if we look at a case with a x-value in the range of 1-2, you have 4 healthy people and 1 unhealthy person. So the chance of being ill is 20% (1 in 5). In the x-value range of 4-5, you have 1 healthy person and 6 unhealthy people. The chance of being ill is around 85% (6 in 7).

However, the method of ranges and chunking the data with vertical lines for the ranges isn’t precise enough. So we have to plot a “S” curve. On the right chart below, you can see how a sigmoid curve would allow you to take any x-value and estimate the probability of illness. (The data here is fictitious.)

There’s a bunch of maths that turns the above into a machine learning model that can take multiple independent variables and generate a classification prediction. This includes “gradient descent“, a term you’ll hear about a bunch when exploring machine learning. But I won’t be digging into this. I just want to build some models and learn the basics. That’s what I do next.