Day 10 of 100 Days of AI

Logistic Regression. On day 1 I worked with linear regressions — a statistical technique to predict a continuous value based on another variable or set of variables. In the last few days I’ve been wrestling with logistic regression. This is is a statistical technique that classifies data into a category. It also provides a probability for that classification. Examples include:

  • Customer retention (will a customer churn or retain?)
  • Relevant prescription (should we issue Drug A, or Drug B, or Drug C to a patient?)
  • Machine performance (given certain conditions will a machine fail or continue to operate?)

I found the intuition of how this works with the help of this Youtube video. The images below highlight the process well without getting into detailed maths.

The key bit: instead of a straight line (as is the case with linear regression), in logistic regression we use a “sigmoid curve” to fit to the data. This curve has a S-shape as you can see below.

The next image shows how if we look at a case with a x-value in the range of 1-2, you have 4 healthy people and 1 unhealthy person. So the chance of being ill is 20% (1 in 5). In the x-value range of 4-5, you have 1 healthy person and 6 unhealthy people. The chance of being ill is around 85% (6 in 7).

However, the method of ranges and chunking the data with vertical lines for the ranges isn’t precise enough. So we have to plot a “S” curve. On the right chart below, you can see how a sigmoid curve would allow you to take any x-value and estimate the probability of illness. (The data here is fictitious.)

There’s a bunch of maths that turns the above into a machine learning model that can take multiple independent variables and generate a classification prediction. This includes “gradient descent“, a term you’ll hear about a bunch when exploring machine learning. But I won’t be digging into this. I just want to build some models and learn the basics. That’s what I do next.