Day 7 of 100 Days of AI

Decision Trees. I went through the intro to ML material on decision trees this morning.

I created some data in Excel and a simple decision tree machine learning model that predicts whether someone works in tech (value 1) or not (value 0). The data is synthetic, and I shaped it so that the ‘salary’ feature was the key predictor instead of age, sex, and region. Here’s the output tree.

My model had an accuracy of 77%. The confusion matrix below provides more on performance. For example, the prediction ‘precision’ of whether someone works in tech roles was quite good, at 85.7% (12 true positives and 2 false positives out of 14 positive predictions).

However, the model was less good at identifying people who don’t work in tech, with a true negative rate of 68.8% (11 true negatives out of 16 negative predictions, including 5 cases where non-tech job cases were mistakenly identified as tech).

Key takeaways:

  • Decision trees are a supervised machine learning technique for classifying data (via classification trees) or predicting numeric values (via regression trees).
  • As you can see from the first chart, decision trees are good because they are more easily interpretable, and you can follow the steps to know how a classification was made. Cases where this is useful include:
    • Finance situations, e.g. Loan application decisions, investment decisions
    • Healthcare situations, e.g. Diagnosis by going through symptoms and other features
    • Marketing, e.g. Customer segmentation via some attributes, churn prediction.
  • How you create decision trees?
    • The easiest thing to do is to use a Python library that does this for you. Here are some simple examples I did in Python.
    • Otherwise, the general process revolves mainly around feature selection. This is as follows:
      • Start at the root node.
      • Find the best feature that splits the data according to a metric, such as ‘information gain’ or ‘gini impurity’. You do this by going through all the features and splitting your data according to each one in isolation to see how well the data is split. Once you find the best feature, you build a branch with that feature.
      • You then repeat the above process, but below the previous branch and with a subset of data.
      • You keep going until you’re happy with the depth (note that if you go too deep, you might have issues with overfitting).

Day 6 of 100 Days of AI

Yesterday, I skimmed through k-fold cross-validation. I wanted to return to it today because I still wasn’t sure about what exactly we are assessing with it, and why.

Here‘s the clearest Youtube video I found on the topic. The screenshot below from that video provides an easy summary.

Key takeaways:

  • K-fold cross validation is a method of evaluating the performance of a specific model configuration.
    • Model configuration includes the choice of an algorithm (e.g. linear regressions, decision trees, KNNs, neural networks), feature selection (e.g. the independent variables used to predict the dependent variable), and hyperparamters (e.g. the number of neurons in a neural network.)
  • To perform K-fold validation you,
    • (1) split the data into ‘k’ equal parts (the “folds”);
    • (2) you train k-number of models using your chosen ‘model configuration’ across the k-number of folds; for example in the image above, model 1 is trained on folds 2-5 and tested on fold 1; model 2 is trained on folds 1 and 3-5, and tested on fold 2 and so forth;
    • (3) you then assess the performance of each model across the k-folds and aggregate this performance measure, usually using an average (or median, depending on the data type);
    • (4) the final accuracy tells you how good your model configuration is.
  • If your aggregated accuracy is strong, you can then train your model on the whole data set.
  • K-fold cross validation is useful in situations where you have a small amount of data and a small test set. In those instances, the opportunity to test your model is limited, and k-fold can boost that, allowing you to test a model configuration across a larger number of scenarios.

Day 5 of 100 Days of AI

Today I learnt about cross validation (scikit-learn has Python helper functions for this.) This is where you split your training and testing data into different sets, and then you iteratively train and test against different combinations to assess how well a model performs on unseen data.

Two common methods of cross validation are k-fold validation and leave-one-out cross validation.

K-fold involves splitting the data into equal parts and rotating across them in terms of testing and training.

With leave-one-out cross validation, you select one data point for testing, and use the rest for training. You then move to another data point for testing, and then use the remainder for training, and so forth, until every data point has been used for testing.

I’ll return to these concepts when I write some more code next week.

Ps. One thing I’m realising as I dig into the basics of machine learning is that it’s a mix of art and science in terms of choosing techniques that may produce the best models. There’s a bunch of trial and error, even though it’s a deeply rigorous and mathematical field.

Day 4 of 100 Days of AI

Today, I went through a classifier lab on the intro ML course. There are several bits I didn’t quite understand but GPT helped get me over the basics. For example, I will need to review my notes on the Jaccard Index and F1-score (evaluation metrics for classifier models), and the concept of normalisation, where you transform your data without changing its distribution. This makes it easier to calculate distances between points, a critical bit when trying to make classification predictions.

On the latter point, I’ve included some charting code in the github repo here (see image below), which helped me understand the normalisation concept. The charting code was written by GPT, with some minor tweaks from me.

Key takeaways:

  • Classification is a supervised machine learning approach.
  • It makes a prediction about what discrete class some item should fall into.
  • Classifiers can be used for spam detection, document classification, speech recognition, or even to predict if a certain customer will churn, based on a variety of characteristics.
  • Classification algorithms include k-nearest neighbour (which I’ve put on github here), decision trees, and logistic regression (which instead of putting an item into a class, gives you a probability that it will fit a particular bucket.)
  • The K-nearest neighbours algorithm was fun to learn about, and the intuition for it is simpler than I expected. The basic notion is as follows: for a given item to predict on, look at a select number of neigbhours (the k-number), and predict the outcome based on the most popular category that those neighbours are in (or the neighbours’ mean or median of the values you’re trying to predict for e.g. house price based on location, square foot size etc.)
  • Classification algorithms can be evaluated with a number of accuracy measures, such as the Jaccard Index, a F1-score, or Log Loss. I didn’t cover these in detail but I did enough to get the very basics.

Day 3 of 100 Days of AI

Today I continued with the IBM intro ML course, and got to build a multilinear regression model. This time I tried to do it with data outside of the class. I found some data on IMDB movies that also had a Metacritic score and a number of other features. Below is a scatter graph of just the IMDB and metascore ratings.

We then go through the usual steps of preparing the data and running a regression but this time, I had an extra dependant variable. You can see it in the code below, line 29.

This model had a variance score of just 0.58 so it’s not of much use. But it was fun to test something I’d just learnt and with a different data set. Below, I run the model with Dune’s IMDB rating from 2021.

The model predicted a Metacritic score of 71. The actual rating today is 74. But you don’t need a fancy model to know that IMDB ratings are usually in line with Metacritic scores!

Day 2 of 100 Days of AI

I spent bits of today watching Youtube videos about linear regression. One of the best ones I found was this StatQuest video on the topic. I enjoyed his style so much, I ordered his introductory book on Machine Learning. I’m planning a digital fast over a long weekend in the second quarter of the year, and I’ll take the book with me to continue learning offline.

Today I learnt about a new concept: Overfitting.

Key takeaways today:

  • Overfitting: This is when a machine learning model fits too strictly to the training data that it was trained on. So much so, that the model doesn’t generalize well to data outside of the training set (i.e. new data.)
  • Why does overfitting happen? A few drivers:
    • Small training data set — limited training examples might throw up patterns that don’t account for the true distribution of the data.
    • High model complexity — having too many parameters might throw up noise or patterns that are not useful;
    • Biased data — this could throw up patterns that are not generalizable across wider a population of data.

Day 1 of 100 Days of AI

Today I’ve started a “100 days of AI” challenge. I’ve used LLMs across a few personal projects here, here, and here. But I’d like to understand the basics of AI and machine learning a bit better.

My motive isn’t to retrain as a ML engineer or data scientist. Instead, I want to challenge myself to go beyond a superficial understanding of one of the greatest technology advancements of our time, particularly given my work in technology investing.

Every day, I’ll aim to do anything from 10 to 90 mins of learning, experiments, and review. Here we go!

Day 1: Overview & Learnings

I’m working my way through an introductory machine learning course by IBM, which provides an AI Engineering Professional Certificate at the end.

I’m about 25% of the way through it, and already, we got to build a simple linear regression model that predicts CO2 emissions based on engine size. I’ve put the GitHub code for this here. Most of the action comes from the code snippet below.

Key takeaways today:

  • Artificial intelligence (AI) is a broad definition that covers computer system that performs intelligent human-like functions.
    • Machine learning is a subset of AI that uses statistical techniques to learn from data and infers patterns and makes predictions.
    • Deep learning is a subset of machine learning and it uses neural networks (inspired by the brain structure) to learn from data.
    • There’s a good explainer of all this here.
  • I’m going to be looking at mostly machine learning for now. This can be split across supervised learning (where you provide data that’s ‘labelled’) and unsupervised learning (where the data is ‘unlabelled’).
  • Popular techniques for ML models include regression and classification (in supervised learning), and clustering and association (in supervised learning).
  • There are a number of open source libraries that make it easier to prepare and build machine learning models. The key ones I’ll probably use a lot are NumPy, SciPy, Pandas, and Scikit learn.

Many concepts feel foreign to me at the moment, but as I spin up a couple of projects and work through tutorials, I should start to get the basics down!

Turning Podcasts into AI Insights: The Making of a 20VC-GPT Bot

A DALL·E 3 generated image.

After the OpenAI developer event yesterday investor Harry Stebbings  tweeted, “Holy shit. Can you imagine a GPT where you could ask any question and it uses advice from 3,000 20VC episodes to answer your questions from the best VCs in the world.” That possibility is in fact already a reality. Here’s how I spun up a working prototype rapidly. Demo videos below.

Some background: I built something similar six months ago with a different dataset. However, that process took several hours over a few evenings. Today, you can make custom bots in minutes. Let’s walk through the broad steps I took for the 20VC bot.

First, I downloaded a sample of 60 episodes from the 20VC podcast. I then used AssemblyAI’s API to transcribe the MP3s in a big batch. You could also use Whisper, which is cheaper and perhaps even faster. I went for AssemblyAI instead because of familiarity and a need to prototype quickly.

The next step was to convert these transcripts into a database that GPT4 could use. For that, I used Retool — a platform that lets you drag and drop files into a database of embeddings that language models understand. Retool also provides chatbot interfaces you can use right off the shelf. And voila! I had a bot that could query 60 episodes of 20VC for knowledge and advice.

To create a full version of the 20VC bot you would need all 3,000 episodes and a reasonable budget for large language model services. This process will rack up a bill in the hundreds (maybe thousands) of dollars, but it’s small change for a media or investment business.

To Harry and his team, I hope this demo shows what’s possible. Even without the upcoming OpenAI feature that enables anyone to create their own GPT, you can build custom GPTs already and with impressive speed.

Happy building all!

Coffee, Code, and ChatGPT: Lessons in Automation

(Photo by Dall-E)

I’ve been doing a lot of coffee meetings recently and often, I need to find venues—mostly cafes—that are convenient for both parties. Using Google maps for this can be cumbersome, and so I thought, why not ask ChatGPT-4 to write a simple program that could solve this problem? The app could take two locations and automatically identify a list of coffee shops located roughly halfway between the two addresses.

To my surprise, and in less than 30 minutes of working with GPT and the Python programming language, I had working code. Here’s the output of that initial process.

While this code was enough for me, it wasn’t user-friendly for non-technical people. So I went back and forth with ChatGPT-4 through a bunch of queries until I cobbled together a web app that anyone could use. You can see how that turned out in this demo video.

Feel free to try out the app here, but only for a short while before my Google Maps API budget is depleted.

What did I learn from this?

Going through this process of iteration and collaboration with AI was fun, but it also drove home a point that most tech-savvy people are already familiar with: AI can write code that works, but it’s not a full-on substitute for a good software developer. (This means now is still a good time to learn to code!)

Deploying even the simplest of apps involves a maze of tools and systems. It’s not just about the code. In my case, I had to set up a Google developers’ account to be able to use their maps technology. (This involved going through their documentation when the GPT-written code turned out to be out of date!) I also had to research and debate the merits of various hosting providers for the app before deciding which one to use. Additionally, I had to buy a domain name and link it to my servers. And then of course, I couldn’t forget the basics, like setting up analytics and regularly backing up the app code on Github, among other steps.

Of course people who do this work daily find it trivial in a technical sense. However, even seasoned software developers grumble about how time-consuming it is to get all these tools and platforms working together for a public-facing app.

Simply put, you can’t fully automate the process of building things that will be used in the real world by real people. We’re not there yet. But what tools like GPT can do is speed up your prototyping process. Furthermore, if you have a touch of technical know-how, you can quickly automate a variety of personal tasks that don’t need to be public or require a full-fledged app. To me, that’s enough reason to be optimistic about how generative AI will meaningfully impact global productivity in the years to come.

The Personal Brand Delusion

(Photo by Jonas Stolle on Unsplash)

Excessive focus on a personal brand is terrible. 

Personal branding — assuming brand can even be applied to a complex human being in the same way it can to a commodity — is mostly a by-product of something else: making valuable and meaningful contributions in your areas of interest.

People who tirelessly and directly work on personal brand often do so at the expense of other activities that matter. And at the extreme end, there are some who go as far as fraud just to build a name for themselves. (See this list of fraudulent Forbes 30 under 30 candidates for example).

Paradoxically, and as this paper about the ‘Best-New-Artist Grammy Nomination Curse’ puts it, if you seek recognition directly, you probably won’t do your best work. That’s because people-pleasing antics and unrealistic versions of success are a distraction from what really matters. In contrast, if you care less about public opinion and awards, you may find yourself producing better and more original work. 

For these reasons I’m not sold on the idea of spending lots of time on a “personal brand”, especially if it precedes any meaningful contributions from an individual. Attempts to establish a personal brand without genuine achievement are, at best, fruitless busy work and, at worst, delusional. 

Only a select few can pull off a personal brand. It emerges in its strongest form after incredible achievement. Beyonce, Michael Jordan, Steve Jobs: each of these individuals has a great personal brand. But in all cases their mastery of craft preceded mastery of image.

Chances are, you won’t tread the same path as Bey, Mike, or Steve. That’s okay. It means you can skip the personal branding frenzy. Instead, focus on doing exceptional work and contribute meaningfully in your areas of interest. Your reputation will naturally grow and you won’t have to rely on a contrived personal image to open doors. Let the superstars have their personal branding. For the rest of us, there are more impactful ways to invest our time.