Continuing with the release of some of the notebooks from my upcoming course at Harvard Extension, here is week 2, focusing on going from tokens to vectors.
The idea here is we’re moving from tokenization to vectorization; turning text into vectors of information based on the tokenization design we’ve already done.
The topics covered here are:
- Count vectorization
- Term Frequency-Inverse Document Frequency
- Cosine similarity
- Non-negative Matrix Factorization
- Latent Dirichlet Allocation
- Using all these vectors in a simple supervised learning problem
This introduces a lot of material and touches on a number of very different areas (e.g. topic models, supervised learning, information extraction). But, in my view, word count vectors and topic model vectors are just different ways of turning raw text into vectors of information. They yield really important information on their own and the comparison between them be a pretty solid analysis on its own.
Each of these informative vectors are useful for supervised learning efforts (e.g. classification) and unsupervised learning efforts (e.g. clustering). And it makes sense to have all of these tools in one’s toolkit to see how the different repreentations yield different results on these next-stage efforts.
Again, I value any thoughts/feedback.
See (read?) you next time!