Making an annotation tool at CodexHack

I spent last weekend at CODEX Hackathon working on a tool called LitRen. LitRen is meant to make ebooks editable and annotatable. The idea behind this project was that editable ebooks would help people who write fan fiction: fanfic authors could insert their ideas and stories into ebooks, or even modify the existing text as they like. With public domain books, entirely reimagined versions of the story could be created by modifying and expanding on the original text.

Making this annotation tool required some heavy lifting in javascript. There are some existing javascript annotation tools that we adapted to work with ebooks. As much of the javascript was over my head, my small contribution was to watch others code, and to build a responsive landing page with Bootstrap.

By the end of the weekend, the basic annotation/editing tool was working, though a lot of supporting functionality remained unbuilt. That was fine; our goal was to have a proof of concept. Anyhow, it was a fun project and a great weekend!

Posted in ebooks, hackathon | Comments closed

Archiving with TCAT

For quite some time now, our library has been archiving tweets about our college using twarc. This has been fine, so I hadn’t really dug any deeper into the world of archiving bots until earlier this week when my colleague Shawna Brandle approached me about using TCAT, the Twitter Capture and Analysis Toolset.

TCAT has a lot more packaging than twarc: from installation scripts, to a GUI, to extensive reports. Once it’s set up, it is a lot more user friendly than twarc. If you want a web-based twitter archiving tool that doesn’t require any command line knowledge of its users, TCAT is a good choice. The biggest technical hurdle is getting TCAT running on a linux (virtual) server. I set it up on AWS, with help from the good installation documentation.

TCAT offers a great opportunity to give your colleagues a tool to archive tweets. Beyond that, it provides a lot of ways to analyze and export collections of tweets. It’s got a lot more overhead — as well as more user-focused functionality — than twarc.

Posted in archives, tcat, twarc, twitter | Comments closed

Integrating open source projects in our library

Recently, our library was considering adopting Augur, a CUNY-made open source reference desk transaction tracking program. It’s a nice program that fills a very specific niche function. We tested Augur at our library for a couple of weeks. Yet despite its niftyness, we didn’t implement it at Kingsborough. This was mainly because it added an unnecessary layer of complication to what is currently a very simple, manual process for keeping reference desk statistics.

I wasn’t too disappointed that the project didn’t go ahead. Manual reference desk tracking has been working fine at Kingsborough, and in some regards there is no reason to interfere with that workflow.

Yet evaluating Augur got me thinking about the value of open source projects for libraries. I found myself revisiting some of the old tropes of open source advocacy: projects like Augur can provide opportunities for us to expand our technical skill sets. They allow us to build collaboratively and to contribute actively to open projects. And so on.

Even though these issues are not new, they are still important in our libraries. Building our librarians’ skills is an important long term goal, as is creating software that benefits libraries. So I hope that we can find opportunities to integrate open source tools that meet the needs of our librarians and our communities.

Posted in open source | Comments closed

I made my own altmetric

I’m waiting for one of my colleagues to lend me some books on bibliometrics. However, in the meantime, in my naïveté, I have created a metric[1].

My metric is not a terribly good one, though perhaps it is no worse than some other well-established ones. While it somewhat defensibly measures reach and productivity, my metric also fails on other fronts.

I’ve called my metric gh-index. It works on the same math as h-index, which is fairly widely known. I’ve translated the logic of h-index to evaluate GitHub stars. The (questionable) assumption is that GitHub stars are the open source software equivalent to academic citations.

It’s a comparison that is kind of interesting. To be clear, I’m not trying to make OSS contributions equivalent to academic citations.  But my rhetorical point is that is that GitHub stars and scholarly citations are both hard-earned recognition, even though they represent very different types of labor.

So, you can calculate the gh-index of any GitHub user here.  This web tool queries the GitHub API, and parses the resulting data to make gh-index calculations. Also, if you’re interested, you can see the code here.

[1] Here is an example of a much more well thought out analysis.

Posted in git, metrics | Comments closed

DIY Twitter analytics

Our library uses Twitter (@kbcclibrary) to communicate with our students and faculty. Along with our tweeting, we rely on metrics to keep tabs on our Twitter presence. We get these metrics exclusively from free tools: the native Twitter Analytics page, but also third party analytics sites like Tweetstats and the free version of (the unfortunately named) SocialBro.

I like Tweetstats, because it is clearly a passion-project of one developer. For some time, I regularly used Tweetstats’ “tweets by month” chart to make sure we were on track to meet self-imposed targets of tweets for the month.

But problems came up in December and January when Tweetstats stopped working reliably. It became so inconsistent that it was unusable. When I learned that the Tweetstats creator was trying to sell the site, I basically gave up on the service. I reluctantly looked at other free tools that might offer a similar display of tweets per month, and was quickly reminded that the world of third party Twitter analytics sites is pretty unappealing.

An obviously better solution is to build the functionality I needed for myself. It would be a good programming challenge, and we’d end up with a home-grown analytics tool. Our library has built tools on the Twitter API previously, so I didn’t need to start from scratch.

Creating a list of tweet dates from the API was not too difficult; what proved more challenging was producing a visual representation of this data. I imported pandas, numpy and matplotlib, all of which were unfamiliar to me. I spent a lot of time messing with pandas dataframes. In the end, the result was a visualization that looks like this:


It’s not pretty, but it is exactly what I needed.

Posted in twitter, visualization | Comments closed

On trying (and failing) to learn shell scripting

I tried to lean shell scripting in the summer of 2000. It seemed doable: isn’t it basically executing a bunch of shell commands in a row? I got some books, which I read half-heartedly, tried a few things, and then gave up. Shell scripting, which I had hoped would be the easiest entrée into programming, was too hard.

I think part of the problem was the undeveloped state of learn-to-code resources in the early 2000s. I wanted the hand-holding of resources like Codecademy and Treehouse to make those first steps, but if something like that existed in the summer of 2000, I never found it.

Also, I think my approach was conceptually not helpful. Stringing a bunch of shell commands together does not make a programmer. There were a lot of core programming concepts that I was ignoring entirely. In hindsight, I think it makes more sense to learn core ideas – like variables, loops, functions, boolean, and so on – and bring those concepts back to shell scripting.

With a bit of perspective from time spent learning things like python and javascript in the past year, shell scripting recently began to make much more sense. I now have some ksh scripts automating library processes: like restarting certain programs when needed, or clearing out logs periodically. A shell script, triggered by cron, is much more reliable at doing this than I am. Our library projects benefit from this reliability. Unfortunately, I just took the very long road to finally being able to write those scripts.

Posted in ksh, shell | Comments closed

Visualizing library data

Using Twarc-Report, a tool made by Peter Binkley at the University of Alberta Libraries, I made some visualizations of our library’s archive of twitter data. Here’s one of them:


This shows how the hashtags in various tweets about Kingsborough are related. You can see the full interactive version of that visualization here.

Neat, right? Twarc-Report builds visualizations based on data captured by Twarc. It uses d3.js, a javascript library that provides tools for data-based manipulations of the DOM. Twarc-Report does this nicely, and it prompted me to try something similar with some other library data.

The APIs for Primo, CUNY’s discovery layer, provide interesting data and metadata about searches. Using d3.js and Flask, a Python framework, I made a web tool to visualize some of this information. This tool takes the user’s search terms and parameters, makes an API call to Primo, and passes the resulting data to a d3.js script (adapted from here) to make the visualization. The whole thing produces something that looks like this:


This is a visual rendering of where Kingsborough books with the keyword “president” appear in the Library of Congress classification. You can try the tool yourself here; the code is also on GitHub.

Posted in d3, twarc, visualization | Comments closed

The many uses of Git

Git is version control and collaboration software. It’s initially unintuitive and takes some time to learn (command line!), but it’s also powerful, broadly useful and generally awesome. I wish more librarians used Git because of the benefits it could bring to our collaborations.

Git is closely related to Github, which makes it possible to share Git repositories much more broadly. Git and Github are mostly used for coding projects, but librarians have used them to share lesson plans and to write peer-reviewed articles. (Stephen Zweibel helpfully pointed out to me that academics can get free private Github repositories.)

This past semester, I used Git to keep track of my lesson plans. This was useful because Git can divide projects into distinct “branches”, which allow you to work on different variations of the project separately from the “master” branch. I created a “master” lesson plan for my library instruction sessions at the start of this semester, and divided and sub-divided it into individual branches for each class that I taught. Git kept track of all of the changes and variations.

There are a number of places to learn Git. Here in New York City, METRO and the LACUNY Emerging Technologies Committee have recently had workshops on Git. Sometimes groups on have sessions devoted to Git. The Atlassian tutorials are really useful for figuring out the nitty-gritty. And of course you can learn Git on Github itself, with step by step tutorials here and here.

Posted in git | Comments closed

Archiving tweets about Kingsborough

Last year, I heard about a python program called Twarc, which was developed by Ed Summers, a software developer at the University of Maryland, to capture and archive tweets. Back in August, Ed demonstrated the effectiveness of this tool by capturing over 13 million tweets about events in Ferguson, MO as they unfolded over the course of 17 days. He blogged about the process here. Twarc brought an archivist’s collecting impulse to events that were happening very quickly, which was not only a brilliant idea, but captured valuable data as well.

The value I saw in Ed’s tool, however, did not have much to do with current events, but rather with my own college. Kingsborough has thousands of students, who are tweeting all the time about their classes, their commutes and the food in the cafeteria, among many, many other things. The immediately evident value of Twarc for me is that it can systematically and continuously archive tweets about Kingsborough. There are obvious benefits to having this kind of archive. Twarc can listen to all of twitter, all of the time, in a way that is not possible by one librarian, no matter how enthusiastically they use the twitter advanced search.

Twarc is a python command line tool that requires Twitter API keys. Registering for the Twitter API is fairly easy. Moreover, Ed has documented Twarc quite well, so that not much more is needed to use it than basic knowledge of the command line. After spending some time trying out different Twarc searches, I settled on: –stream ‘kbcc kingsborough’,\#kbcc,\#kingsborough, >> results.json

Basically this will listen to Twitter continuously for mentions of both the words “kingsborough” and “kbcc” in the same tweet, or tweets containing the hashtags “#kbcc” or “#kingsborough”. The results are continuously added to the end of the results.json file. The JSON output is not particularly human-readable, but when run through a JSON visualizer, or through some of the utility scripts that are provided with Twarc, turns out to be quite detailed and interesting.

For our library’s purposes, I want Twarc to be running all of the time, so I run it in tmux on a server where I have a shell account. I also wrote a shell script (periodically triggered by cron) that re-starts Twarc if it is stopped for some reason, such as the server being rebooted. As a result, since I got everything working in February, I have basically been able to leave Twarc running unattended, while it continues to archive tweets about my college.

The tweets archived by Twarc will hopefully end up in the Kingsborough archive, but I think they have other value as well. I hope to post soon about some of the other things we’ve done with the twitter data that we’ve gathered.

Posted in archives, twarc, twitter | Comments closed
Need help with the Commons? Visit our
help page
Send us a message
Skip to toolbar