More notebooks from my upcoming course at Harvard Extension, here is week 2, focusing on going from tokens to vectors.
In this notebook, we build on the previous two weeks by considering one of the main sources of information in natural language: Sequence. Our previous methods essentially turn text into bags of words. But we know, intuitively, that the order of words changes the meaning.
Think of this example:
“Dog bit man”
“Man bit dog”
The word count vector for these would be exactly the same. But we know, on reading it, these aren’t the same. For the first, we’d call animal cotrol. The second, well, maybe PETA?
The topics covered here are:
- Recurrent Neural Networks
- With one-hot word encoding
- With embeddings
- Word Embeddings
- Extracted from topic models (NMF and LDA)
- Pre-trained (GloVe)
This is a pretty code-heavy notebook, a lot of it adapted from a really great example implementation of LSTM in PyTorch. I tried to document it extensively and the plan in the course is to walk, in more detail, through it. The implementation is pretty simple, but I think it covers, at a high level, the main considerations of designing one of these models.
Something that I found was that this notebook can be run really easily on Google Collaboratory with a GPU runtime. To compare, on my local machine it took ~30 minutes to run for five epochs. On the Collab GPU, it took <1 minute. Definitely will be recommending running it on Collab to the class.
Let me know what you think!
More to come.