As I mentioned in a previous post, I’m going to be teaching a course at Harvard Extension this summer. I figured I’d begin posting some of the material here on the blog in case people are interested in following along.
The focus is on tokenization. Tokens are intended to be some kind of useful semantic units, but what is “useful” and what is a “unit” is up to the scientist. Here, I think, things fall heavily in to the “art” of data. In this notebook, I try to outline some of the important considerations when tokenizing your text data.
Namely, I focus on:
- Casing
- Lemmatization/Stemming
- Stop words
- Non-standard tokens (e.g. named-entities)
I also touch on creating word counts and setting up some of the more advanced technologies that will be explored later in the course.
Definitely would value any thoughts/feedback folks have.
More to come!