Archiving tweets about Kingsborough

Last year, I heard about a python program called Twarc, which was developed by Ed Summers, a software developer at the University of Maryland, to capture and archive tweets. Back in August, Ed demonstrated the effectiveness of this tool by capturing over 13 million tweets about events in Ferguson, MO as they unfolded over the course of 17 days. He blogged about the process here. Twarc brought an archivist’s collecting impulse to events that were happening very quickly, which was not only a brilliant idea, but captured valuable data as well.

The value I saw in Ed’s tool, however, did not have much to do with current events, but rather with my own college. Kingsborough has thousands of students, who are tweeting all the time about their classes, their commutes and the food in the cafeteria, among many, many other things. The immediately evident value of Twarc for me is that it can systematically and continuously archive tweets about Kingsborough. There are obvious benefits to having this kind of archive. Twarc can listen to all of twitter, all of the time, in a way that is not possible by one librarian, no matter how enthusiastically they use the twitter advanced search.

Twarc is a python command line tool that requires Twitter API keys. Registering for the Twitter API is fairly easy. Moreover, Ed has documented Twarc quite well, so that not much more is needed to use it than basic knowledge of the command line. After spending some time trying out different Twarc searches, I settled on:

twarc.py –stream ‘kbcc kingsborough’,\#kbcc,\#kingsborough, >> results.json

Basically this will listen to Twitter continuously for mentions of both the words “kingsborough” and “kbcc” in the same tweet, or tweets containing the hashtags “#kbcc” or “#kingsborough”. The results are continuously added to the end of the results.json file. The JSON output is not particularly human-readable, but when run through a JSON visualizer, or through some of the utility scripts that are provided with Twarc, turns out to be quite detailed and interesting.

For our library’s purposes, I want Twarc to be running all of the time, so I run it in tmux on a server where I have a shell account. I also wrote a shell script (periodically triggered by cron) that re-starts Twarc if it is stopped for some reason, such as the server being rebooted. As a result, since I got everything working in February, I have basically been able to leave Twarc running unattended, while it continues to archive tweets about my college.

The tweets archived by Twarc will hopefully end up in the Kingsborough archive, but I think they have other value as well. I hope to post soon about some of the other things we’ve done with the twitter data that we’ve gathered.

Archiving tweets about Kingsborough

Subscribe to this blog via email

Need help with the Commons?