nlp course, nlp, tutorials,

NLP course Week 2 - Vectorization

Ben Ben Follow Mar 31, 2020 · 1 min read
Share this

Continuing with the release of some of the notebooks from my upcoming course at Harvard Extension, here is week 2, focusing on going from tokens to vectors.

Week 2 notebook.

The idea here is we’re moving from tokenization to vectorization; turning text into vectors of information based on the tokenization design we’ve already done.

The topics covered here are:

  • Count vectorization
  • Term Frequency-Inverse Document Frequency
  • Cosine similarity
  • Non-negative Matrix Factorization
  • Latent Dirichlet Allocation
  • Using all these vectors in a simple supervised learning problem

This introduces a lot of material and touches on a number of very different areas (e.g. topic models, supervised learning, information extraction). But, in my view, word count vectors and topic model vectors are just different ways of turning raw text into vectors of information. They yield really important information on their own and the comparison between them be a pretty solid analysis on its own.

Each of these informative vectors are useful for supervised learning efforts (e.g. classification) and unsupervised learning efforts (e.g. clustering). And it makes sense to have all of these tools in one’s toolkit to see how the different repreentations yield different results on these next-stage efforts.

Again, I value any thoughts/feedback.

See (read?) you next time!

Written by Ben Follow
I am the type of person who appreciates the telling of a great story. As a data scientist, I am interested in using AI to understand, explore and communicate across borders and boundaries. My focus is on Natural Language Processing and the amazing ways it can help us understand each other and our world. The material on this blog is meant to share my experiences and understandings of complex technologies in a simple way, focused on application.