NLP course Week 1 - Tokenization

As I mentioned in a previous post, I’m going to be teaching a course at Harvard Extension this summer. I figured I’d begin posting some of the material here on the blog in case people are interested in following along.

Week 1 notebook.

The focus is on tokenization. Tokens are intended to be some kind of useful semantic units, but what is “useful” and what is a “unit” is up to the scientist. Here, I think, things fall heavily in to the “art” of data. In this notebook, I try to outline some of the important considerations when tokenizing your text data.

Namely, I focus on:

Casing
Lemmatization/Stemming
Stop words
Non-standard tokens (e.g. named-entities)

I also touch on creating word counts and setting up some of the more advanced technologies that will be explored later in the course.

Definitely would value any thoughts/feedback folks have.

More to come!

NLP course Week 1 - Tokenization

Written by Ben Follow