nlp course, nlp, tutorials,

NLP course Week 5 - Scoping an NLP project

Ben Ben Follow Jun 08, 2020 · 1 min read
Share this

The second half of the my course at Harvard Extension will focus on applications of the theory we cover in the first half. That means the notebooks are mainly going to be walkthroughs of actual use-cases in NLP. I’m excited to be doing this, I find I learn a lot from implementation and that it gives me a more complete picture of what it means to use these methods.

Week 5 notebook.

In this example use-case we make use of the movie spoilers dataset from IMDB+Kaggle. The purpose is to try a set of methods for predicting whether a review is a spoiler or not. We start by just using genre-adjusted probabilities and move on to models based on some of the feature sets we’ve explored.

The interesting thing here is that just with a bag of words + SVM we get pretty reasonable performance (~78% accuracy). I end up spending a bunch of time getting BERT document-level representations for this same SVM model and only marginally increasing performance.

In the lecture I talk about balancing the cost and the return of a particular method. Cost is mainly in terms of the resources required and the technical debt incurred. Returns is a bit more fuzzy, since it’s dependent on one’s specific use-case.

The conclusion I’d come to here from just this initial exploration is that it probably doesn’t make sense to go state-of-the-art here. Unless we have a bunch of time to dedicate to really making the best use out of transformer models, word counts + SVM seem to do pretty well for us.

I don’t include it here, but my goal is to spend some time showing studets how to implement this very simple technique as a “model service”. This is definitely made simpler by using just word counts + SVM vs. having to rely on pretrained BERT models.

I welcome any thoughts and comments you have!

Written by Ben Follow
I am the type of person who appreciates the telling of a great story. As a data scientist, I am interested in using AI to understand, explore and communicate across borders and boundaries. My focus is on Natural Language Processing and the amazing ways it can help us understand each other and our world. The material on this blog is meant to share my experiences and understandings of complex technologies in a simple way, focused on application.