Day 51 of 100 Days of AI

Today, I returned to decision trees. Even though I was building decision tree models by day 7 of this challenge, I could do with going back to the theory and understanding it better. And that’s what I started doing today.

Before starting, we should know that there are two types of decision trees in machine learning.

  1. Classification trees — which classify an item into a particular category, based on a series of yes/no responses to a series of questions (the nodes in the tree). For example:
    • Will a customer churn based on certain events?
  2. Regression trees — which predict a continuous value, based on a series of yes/no responses to a series of questions (the nodes in the tree). For example:
    • What is the price of a house given its location, number of rooms, and size?

Here’s an example decision tree from Wikipedia.

Decision trees have the benefit of simplicity and interpretability. It’s easier to follow a path on a decision tree versus scrutinising millions of neurons in an artificial neural network!

That said, decision trees have their limits. They are prone to overfitting and don’t generalize well to new data.

There’s more about the pros and cons of decision trees on the Scikit Learn website here.

Day 50 of 100 Days of AI

I’m about 60% through this workbook, and I have to admit, I really I’m skipping most of the more mathsy bits. I read the Ridge (L2) and Lasso (L1) Regularization sections today and though I didn’t follow the maths fully, at least I came away with the appreciation of a key idea.

To prevent machine learning models from being overtrained (or rather, “overfitted”) to the training data, it helps to increase the bias of the models somewhat by using a variety of techniques. These include Ridge and Lasso Regularization.

If I was doing ML professionally I’d have to nail these concepts. Thankfully, I’m a hobbyist and I’m more interested in the applied components of the field than the depth of all the underlying maths.

Day 48 of 100 Days of AI

This morning I made a start on learning about “regularization” in machine learning.

When we train sophisticated machine models, sometimes they “overfit” the training data. In the example below from Andrew NG, you can see the third chart on the far right has a blue line (the model) that fits the training data perfectly. However, it has “overfit” the data—which is to say, it fits perfectly to the training data but at the expense of being able to make predictions about new data points that may fall outside of the prediction area.

Meanwhile, the first chart on the far left has a model that underfits our training data. It might not be so good at making predictions about new data points, too.

Regularization is a collection of techniques that can help us get a model that’s “just right”, like in the example above with the middle case. It simplifies models that are overly complex and helps us deal with the overfitting problem. More on this in the next few days.

Day 47 of 100 Days of AI

I continued working through some Numpy (numerical python) exercises today to better understand why the library is so useful. Turns out, it’s great at dealing with arrays and a variety of numerical computations.

Running linear algebra, statistical operations, and other calculations with Numpy is faster and more memory efficient than writing all the code for these operations yourself, particularly when working with big data and machine learning. It’s why the library is so popular in AI circles.

I don’t expect to master Numpy but I hope knowing the basics will get me far enough.

Day 46 of 100 Days of AI

I worked through a fast primer on Numpy this morning. Numpy is an open-source Python library for mathematical functions. It’s used often in data science and machine learning so it’s worth knowing the basics.

I also enjoyed reading this article which has a moderate view of what large language models can do and how long it might take for use-cases to emerge.

On my end, LLMs have been transformative as I’ve been co-writing code with GPT and automating lots of personal tasks. I’ll share a few more of these on my blog, as I’ve done in the past here.

Day 45 of 100 Days of AI

What is AI, really?

Lots of new tech startups have “AI” in their pitches these days. The media and press also use the words “AI” very broadly, and I’ve been guilty of it, too.

“AI” is catch-all term for lots of technologies and techniques, and it isn’t always clear what someone is specifically talking about. Today, I went through a quick refresher on the highest level concepts of what we really mean by AI.

In a broad sense, artificial intelligence (“AI”) is any form of intelligence that we put into a machine. A simple spam filter that rejects any email from a predefined set of addresses is AI. So is the Face ID system of my iPhone, which is far more sophisticated than a simple email spam filter.

Given how many forms of AI there are, it’s useful to be discerning about what’s what. The spam filter I describe above is a rules-based system of AI. The Face ID example is a machine learning system of AI. The breakthroughs we’ve had in recent years are in this latter category of AI systems.

This image from Wikipedia is useful in visualising the basic taxonomy of AI.

Machine learning (ML) — systems that learn from data without humans providing the explicit rules of intelligence — is where all the action is today. (The alternative — rules-based systems that are explicitly programmed by humans — are less effective.)

Within machine learning you have:

  • Supervised learning — where a system learns from “labelled data” i.e. data that has the correct answers provided (e.g. when trying to predict the value of house based on known features);
  • Unsupervised learning — where a system discerns patterns and groupings from “unlabelled data” (e.g. when trying to segment different customer types);
  • Reinforcement learning — where a system learns through rewards and penalties, depending on what actions it takes (e.g. in robotics or when refining large language models like ChatGPT).

There are lots of algorithms and techniques that are used across the above groups. For example, in supervised learning you could use a logistic regression model to predict a house price. In unsupervised learning you might use k-means clustering to segment customers. And when it comes to generative AI, computer scientists are using a blend of approaches: supervised learning (more specifically, the self-supervised form), neural networks (aka deep learning), as well as reinforcement learning from human feedback (RLHF) to “fine-tune” models so that they are more effective at answering questions.

AI is a hugely technical field and can take a lifetime to master. Thankfully, software engineers are starting to abstract away the complexity of the field so that even less technical people like me can learn how to build and deploy ML models. That said, it’s best to leave the bulk of the underlying technical know-how to the PhDs!

Day 45 of 100 Days of AI

Yesterday I learnt about ROC curves. There’s an extension to this concept: the Area Under each Curve (AUC). If you have two different ROC curves from different models, finding the area under each curve will reveal which model performs better. Here’s an extract from the StatQuest book that illustrates this concept well.

In the coming days, I’ll move onto the chapter about Regularization — a technique that helps prevent overfitting.

Day 44 of 100 Days of AI

Today, I went over ROC graphs. The acronym stands for receiver operating characteristic. This concept originates from signal detection theory during World War II. Radar operators wanted to ascertain how good they were at detecting enemy aircraft (“True Positives”) rather than confusing them for birds or some other noise (“False Positives”). So they designed a chart that illustrates the true positive rate and false positive rates under different thresholds (I blogged about thresholds yesterday.)

Here’s an example ROC curve from the Youtube channel StatQuest.

How do we interpret these curves? Each point represents the results of different thresholds in a logistic regression, for instance. In the above chart, if we wanted a threshold that had the lowest false positives, we would go for the point with 0 false positive rates and the highest true positive rate (i.e. the highest point on the y-axis where the x-value is 0).

Similarly, if we wanted the highest true positivity rate overall but we could tolerate some false positives, we would take the the point highlighted in red below.

Side note: I’m still working through this book and I’m eager to get back to code. However, it’s important to at least run through the basic fundamentals again before getting back to online courses and other fun applications of ML.

Day 43 of 100 Days of AI

Logistic regression thresholds.

A few weeks back I built a number of logistic regression models without quite appreciating the impact of the thresholds you set. A few pages into a chapter on assessing ML model performance helped me eliminate this gap today.

It turns out that if you lower the classification threshold, you increase the chances of correctly identifying positive cases, the True Positives. This comes at the cost of increased False Positives. However, there are some cases where this is the best approach.

This book shares the example of ebola cases. In that situation you would rather have a lower classification threshold and increase your True Positive Rate (i.e. recall) at the expense of more False Positives. This ensures you catch the maximum number of ebola cases. A less morbid example is venture capital. If you had a startup success prediction model for early-stage companies you would be better off with a lower threshold since that limits the chances of missing out on a potential outlier success.

I’ll continue with this thought and more reading tomorrow.