bagging to bert, nlp, tutorials,

Tokens to Transformers: An update to Bagging to BERT

Ben Ben Follow Oct 04, 2023 · 1 min read
Tokens to Transformers: An update to Bagging to BERT
Share this

I swear, I’m still here! The past few weeks have been a bit of madness with job transition, travel, etc.

One of the things I was working on was putting together an update to my Bagging to BERT tutorial for a presentation to the Census’ Center for Optimization and Data Science. Because every talk needs to have a catchy name, I called it:

Tokens to Transformers

I’m pretty happy with how it came out. I ended up spending a lot more time talking about tokenization strategies and how, in my opinion, they constitute an example of “transfer learning”.

This is how I visualized it:

Tokenization as transfer learning
Tokenization as transfer learning

At “training” time, you create a vocabulary which constitutes all tokens you expect your system will encounter. There’s no practical way to capture every possible token, so your assumption is that the task to which you’ve fit your vocabulary is a good representation of the task you’ll be doing at inference time.

That sounds a lot like transfer learning to me.

I like this way of thinking about it. Tokenization, from what I’ve seen, is sort of hand-waved away as a little pre-processing step. But, in the words of Bush, it’s the little things that kill. One of the reasons the big bad GPTs struggle at times with math has been theorized to be because of tokenization. That is, because GPT is not specifically trained to tokenize components of mathematical equations, it struggles to answer questions about them.

From my presentation:

GPT struggles with math
GPT struggles with math

As usual, Jay Alammar does an excellent job breaking this down.

More to come! I’m hoping to have more time to write now that I’m kind of a free agent :)

Ben
Written by Ben Follow
I am the type of person who appreciates the telling of a great story. As a data scientist, I am interested in using AI to understand, explore and communicate across borders and boundaries. My focus is on Natural Language Processing and the amazing ways it can help us understand each other and our world. The material on this blog is meant to share my experiences and understandings of complex technologies in a simple way, focused on application.