Reddit Event Clustering

Jumping on the bandwagon in an interesting way

As my final project for Advanced Data Science, I chose to analyze a then-recent event from a data scientific perspective: the Jan-Feb short squeeze of Gamestop, AMC, and Blackberry, caused by the community of the r/WallStreetBets subreddit. The gist of the project was to scrape the posts from r/WallStreetBets for the previous several months, perform a number of neat data science tricks on the text of each post to turn them into numerical data, then use these to attempt to group them into similar clusters. My vague hypothesis was that events like the squeeze of GME would naturally present themselves as clusters of similar posts in time, and with enough monitoring & clusters, one could understand happenings in the community & detect the incidence of new events.

For a lot of reasons, this was a moonshot task. The way I ended up representing the text of the posts made the data very high-dimensional, and the classic weakness of clustering algorithms is that they become less effective for higher and higher dimensional data. Surprisingly, though, I did find some of the temporal patterns I was looking for, and although I didn't dive too much deeper (sentiment analysis of the posts, etc.), I still think the project provided some interesting insights into a very chaotic environment.

Those who are interested can read the full writeup here. It intentionally glosses over explanation of the shingling/minhashing technique I used, but it provides some nice insight into the rest of the process and results.