MIT Statistical Learning Lecture 3 Features and Kernels

Ben Ben Follow Dec 06, 2019 · 2 mins read
Share this

All apologies for the long delay here. Lots of stuff going on including, but not limited to:

But I still have materials from a few more sessions of MIT’s Statistical Learning class. So I’m planning on going through those and then moving on to some exciting NLP-related material.

This session focuses on the idea of kernels and how they can be used to open up opportunities for a variety of models.


Note: From this point on, the slides link may have slightly out of date materials (from 2018), but most of the information matches up.

As a refresher, here is the Expected Risk Minimization (ERM) for regularized least squares:

Regularized ERM for LS

Again, the minimizer for \(w\):

Minimizer for W

This notation is a bit different from what we had in the last lecture. I did a bit of searching and I’m not sure why these \(n\)’s are here, but they continue into the next formulas, so I am keeping them.

Anyway, this can be rewritten using the representer theorem

Rewriting the minimizer

Why is this rewrite of f(x) important? Because it shows that the solution really only depends on the inner product of x. And why does that matter? Because you could use many types of transformations of x and you’d just have to solve in terms of those transformed vectors rather than the original vectors. This includes nonlinear transformations:

The nonlinear formulation of f(x) can be written as:

Nonlinear formulations

The formulation on the left is a mapping of each x value to another value in a reprojected space. Typically, that space is higher-dimensional as it enables you to use a greater variety of functions to fit the data (rather than just a line).

The inner product of these transformed vectors is referred to the kernel. So we don’t actually need to use the individual points themselves, but the inner product of the points. So long as we know that value, we don’t really need to worry about the data points themselves or even, really, the transformation. This is referred to as the “kernel trick” and I’m drastically oversimplifying it here. There’s a clearer description of the intuition here and a more math-y explanation here.

The upshot here is basically by using this “kernel trick”, we are able to transform data our data in ways that open up opportunities for better models. What sorts of better models, you ask? Well, stay tuned.

Written by Ben Follow
I am the type of person who appreciates the telling of a great story. As a data scientist, I am interested in using AI to understand, explore and communicate across borders and boundaries. My focus is on Natural Language Processing and the amazing ways it can help us understand each other and our world. The material on this blog is meant to share my experiences and understandings of complex technologies in a simple way, focused on application.