nlp course, nlp, tutorials,

NLP course Week 1 - Tokenization

Ben Ben Follow Mar 10, 2020 · 1 min read
Share this

As I mentioned in a previous post, I’m going to be teaching a course at Harvard Extension this summer. I figured I’d begin posting some of the material here on the blog in case people are interested in following along.

Week 1 notebook.

The focus is on tokenization. Tokens are intended to be some kind of useful semantic units, but what is “useful” and what is a “unit” is up to the scientist. Here, I think, things fall heavily in to the “art” of data. In this notebook, I try to outline some of the important considerations when tokenizing your text data.

Namely, I focus on:

  • Casing
  • Lemmatization/Stemming
  • Stop words
  • Non-standard tokens (e.g. named-entities)

I also touch on creating word counts and setting up some of the more advanced technologies that will be explored later in the course.

Definitely would value any thoughts/feedback folks have.

More to come!

Ben
Written by Ben Follow
I am the type of person who appreciates the telling of a great story. As a data scientist, I am interested in using AI to understand, explore and communicate across borders and boundaries. My focus is on Natural Language Processing and the amazing ways it can help us understand each other and our world. The material on this blog is meant to share my experiences and understandings of complex technologies in a simple way, focused on application.