Our library started archiving tweets about our college in 2015. At first, our archive ran an archiving tool called Twarc, on SDF. SDF is a hobbyist programming community; it’s a dynamic place, full of enthusiastic tinkerers and notoriously unreliable infrastructure.
The archive trundled along that way until 2017, when I switched it over to TCAT, the Twitter Capture and Analysis Toolset. This switch was possible because the Amazon Web Services Educate program provided me with ongoing funds to run a server in the AWS cloud. Moreover, the helpful people at the Digital Methods Initiative offer a version of TCAT that is easy to spin up on an AWS server (I used an ec2 nano). This setup was much more reliable than SDF.
There were still some bumps along the way. Twitter’s switch from 140 character tweets to 280 character tweets in 2017 required an overhaul of the archive’s database. The end of long term support for Ubuntu 14.04 also resulted in some disruption as we switched over to Ubuntu 18.04. Nonetheless, the overall trend was toward increased uptime.
But recently we’ve come back around to Twarc. The release of the Twitter v2 API means that our archiving strategy needs to be revisited once again. Twarc has already made great strides in adapting to the v2 API. It is now a pretty impressive piece of software.
So in its latest incarnation, our archive is running Twarc on PythonAnywhere. PA is a cloud service for running Python applications. It allows “always-on tasks”, without the need to run a dedicated server. This is perfect for Twarc. Given that I also quite happily run other unrelated projects on PA, this platform was an obvious choice.
One thing I’ve learned since 2015 is that having a Twitter archive means that you’re going to have to regularly overhaul your archiving technologies. Maybe every two years, at best. Things definitely do not stand still.