Fuzzy string matching

A few months ago, I wrote about a tool I made called the Fictograph, which graphs the awesomeness of authors’ works over time. It leans heavily on data from the Goodreads API. I expected the Goodreads API to be reliable, but it turns out it has some design problems. For example, if you query an author name with a minor spelling mistake, you sometimes get back data on a random author who is totally unrelated to your search.

This behavior is irritating to users, who get results for a different author than they intended. Plus, Fictograph users aren’t going to have much sympathy for my whining and blaming the underlying API for the problem. So I needed to find a programmatic way to compensate for this unhelpful API behavior.

I was stuck on this problem until I saw a presentation at PyGotham that touched on fuzzy string matching. This was a plausible solution, as fuzzy string matching can evaluate whether the name entered by the user is more or less the same as the name returned from the API. If they’re pretty much the same, great! If they’re not, it means the API is probably returning an unexpected result, so the Fictograph should probably return an “author not found” error.

The best part is that I didn’t have to write any string matching code myself; Python has libraries like fuzzywuzzy that will take care of fuzzy string matching for you.

Posted in api, goodreads | Comments closed

Keeping librarians up to date on electronic products

Teaching librarians usually want to stay atop the latest changes to their institution’s electronic products to be able to teach research skills effectively. As an instructor, it’s important to be comfortable using the latest features of the various services.

However, keeping up can be a challenge. Vendors regularly roll out updates, but these aren’t always communicated in a timely way to front-line librarians. Compounding this problem, most libraries have dozens of electronic products to keep track of. This is a hard problem to solve, in part because librarians often have unique, personal workflows, and most one-size-fits-all communication approaches will not work for everyone.

So, our challenge was to get the canonical, vendor-produced training materials into the hands of our librarians in a timely and convenient way. Our solution was to build a page that draws on vendors’ YouTube training videos. This relies on YouTube’s RSS feeds. RSS used to be popular with people who read blogs, but it isn’t usually user-facing anymore; nonetheless, it lives on as internet infrastructure. In our case, we drew from the RSS feeds of vendors’ YouTube channels to create an auto-updating documentation page. Thanks RSS! Librarians can now review the latest training videos at their own pace, here.

While I made this with WordPress, I think a similar solution may have also been possible in LibGuides. I might move this content to LibGuides in the future.

Posted in learning, rss, youtube | Comments closed

Build small

Software can sometimes be big and unwieldly. But it doesn’t have to be. Software can also be small, unimportant and ephemeral. Software can have small goals and limited use cases. It can be fun to build and deploy. There is a lot of value in building small applications for libraries. Here are some benefits:

  • Building a small application requires very little time commitment. Make something over the weekend!
  • Tools we build don’t have to be that complex. Do you really need a database to make that interesting project? Probably not!
  • It’s easy to iterate with small applications. Not happy with your code? Rewrite the whole thing if you like.
  • Small applications are a great way to show off library projects or resources. Highlight an amazing aspect of your library.
Posted in software | Comments closed

On developer conferences

Going to a developer conference can be pretty intimidating when you’re not a professional programmer. The imposter syndrome of being the non-developer at the table can be substantial. But I think it can be useful for librarians who write code to attend these events.

First of all, it is reassuring to see the issues that professional developers are dealing with. Their programming problems are not really that different from those faced by programming librarians. It turns out that they are grappling with human-scale problems like the rest of us.

Second, it is empowering to see what can be done with code. Going to a developer conference can inspire ideas for actually realizable projects that can benefit our libraries. Having a sense of the possibilities can motivate us to push forward interesting projects at our workplaces.

Finally, they’re usually pretty fun. Programming is a great way to build upon one’s interests, however idiosyncratic those may be. The developer conferences that I’ve been to reflect that, with lots of oddball presentations that are usually quite entertaining. It’s usually a good time.

This post was inspired by PyGotham, which wrapped up on Saturday.

Posted in conference, imposter syndrome | Comments closed

Burn it all down

This week I rewrote SeeCollections, a data visualization application that I had originally built in 2015. The rewrite was sorely needed, for a couple of reasons:

  • The original code was really bad. Which is to be expected; I was a beginner when I wrote it. The newer code is better. It’s clearer. It went from over 400 lines of code to under 200, while maintaining the same functionality. It will now be much easier to debug.
  • SeeCollections was originally based on the Primo X-Services API, which is now deprecated. For my application to keep working, I needed to move it to the newer Primo Search API. An added bonus was that the new API works with key-based authentication, which allows for more deployment options than with IP-based authentication.
  • I wanted to move this project off sdf.org. SDF a wonderful hobbyist community, but the infrastructure is sometimes hard to work with. SDF is a few thousand programmers sharing a handful of servers, which results in all kinds of strange and unexpected technical roadblocks. It’s fun to tinker with, but not to rely on. Key-based authentication allowed me to move this to PythonAnywhere, which uses dynamic IPs, and is more reliable than SDF.

I had been putting off the rewrite, because it seemed like a lot of work, but now I’m glad I did it. I don’t have to look at the previous code anymore, and the whole project is now a lot more reliable and maintainable.

Posted in api, visualization | Comments closed

What I learned from 52,080 tweets

A few weeks ago, over the course of 15 days, I gathered 52,080 tweets about learning to code. I did this with TCAT, the open source tool that our library uses for Twitter archiving. I gathered all tweets that matched any of three popular learn to code hashtags: #codenewbie, #learntocode and #100daysofcode.

What I learned is that libraries and librarians aren’t very involved in the learn to code conversation on Twitter. Searching the results for the word “library” turned up 138 mentions (or 0.265% of tweets). This sounds like a modest contribution, but in fact only 17 of these were about libraries as institutions (0.035% of tweets), while the rest were about software libraries. “Librarian” only turned up in two results (0.004% of tweets).

What does this mean? First of all, I feel that 52K tweets is a decent sample size. Second, libraries and librarians are not well represented in this conversation. While launching learn to code initiatives is often good publicity for libraries, I think this data suggests that our profession is not really following through.

Posted in libraries, tcat, twitter | Comments closed

Making bots on Mastodon

I made a Mastodon bot this past weekend. It’s called Why, and it tries to answer the perennial question “Why?” with responses from public domain texts from Project Gutenberg.

I built this for Mastodon, rather for Twitter, for a couple of reasons:

  1. I was curious about the Mastodon API and the tools that are available to work with it.
  2. I was also a bit discouraged with Twitter. Twitter has implemented a bot approval process intended to exclude certain bots. While I understand that they want to improve the quality of users’ timelines, the new sign-up process has driven away some bot makers.
  3. Mastodon has instances that are specifically bot-friendly, like botsin.space.
  4. Perhaps most importantly for me, Mastodon is generally very welcoming. This openness makes it fun to build projects there.

Going forward, I’ll be building bots on Mastodon rather than Twitter.

Posted in bots, mastodon, twitter | Comments closed

Why Python is a good choice for academics

I’ve been thinking about the role of Python in higher education. There’s a lot going on in that space, and the TL;DR version of this post is that I think Python is a good language choice for academics. If you’d like to hear my reasons, I have three:

  1. Python has a wide range of possible use cases, from web development to data analysis, to machine learning, to general scripting, etc. This broad range of uses is great for academics, who often do many varied types of work. On the other hand, there are some things that Python does less well, but overall it is very multi-purpose.
  2. It is also very widely used, which makes solving Python problems easier. There is a huge amount of helpful material online, from blog posts, to online courses, to StackOverflow posts. There are also many IRL Python communities; there is probably one that meets somewhere near you. The community is generally helpful.
  3. Python code is relatively easy read. This makes it useful for effectively communicating ideas, which is especially important to academics.

None of these reasons will be news to people who already work with Python; they are features that are well known in the community. Nor will this likely be convincing to committed enthusiasts of other languages. That’s fine with me. Those are just my $0.02 as to why Python is well suited to academic work.

Posted in language, python | Comments closed

Highlighting new books for faculty

This post is co-written with Julia Furay.

Thanks to the dedicated work of our acquisitions librarian, Prof. Julia Furay, the Kingsborough library buys a lot of interesting books throughout the academic year. Typically, these are displayed on the New Books shelf for about a week before they find their permanent homes upstairs in the library stacks.

The problem is that if you don’t catch a new book during its brief stay on the New Books shelf, the odds are that you won’t find it upstairs unless you’re specifically looking for it. This means that some of our acquisitions pass unnoticed into the relative obscurity of the stacks.

Obviously, it would be great if we could better publicize these new arrivals. Julia recently came up with the great idea to send an email to faculty with each month’s new books. But the problem is that such a list would probably be too long to sift through easily.

Her next iteration of this idea was to use a LibGuide to allow for easy access from any user, as well as a browseable archive. But what do you include on this kind of list? Title and author, certainly; and our vendor also provides a preliminary LC call number. This could all be easily exported to a CSV file. But wouldn’t library users find it frustrating to have to jump between tabs, and to search for the books themselves in the catalog? A far better solution would be to include a catalog permalink to each title along with the citation information.

Getting all these links one by one would be an extremely labor-intensive process, however, with over 2000 new titles added each academic year. We wanted a solution that would be easy to maintain moving forward. As an alternative, Julia used Excel’s formula function to set up a canned search for each title using our discovery layer, Primo (branded as OneSearch at the CUNY libraries). After figuring out the format for a catalog search, we copied the formula down to the bottom of the spreadsheet, a process which took only a few minutes. As a result, each title listed has a live link to search the catalog for that specific title.

Another question: How to sort the list? Some users may find it simplest if we listed the new books alphabetically by title or author. But this allows for little serendipity in discovery. What about all the titles our users didn’t know about yet? We decided instead to sort the new books by their call number. This led to perhaps the biggest problem in the finished project: Excel does not easily sort by correct LC call number. On the CSV file, you might find titles in this order: LB1, LB1000, LB2, LB2000. Solutions for this problem exist through a series of formulas in Excel, though none of the hacks on the web could be implemented easily in our case. Since we didn’t want to dig too deeply into the issue, we merely advertised the titles as sorted by “LC Class” as opposed to call number. Most months did not feature an inordinate number of arrivals in any one category, so we hoped these variations would not prove too confusing for our users. This gives faculty direct access to their subject area, as well as the ability to look through past months.

And creating the list of titles was just the beginning. Assembling this LibGuide required manually converting large spreadsheets of new books into HTML. Doing this would be very tedious and time-consuming – definitely not something you’d want to do every month.

The solution was to write a Python script that would convert a CSV file of new books into HTML. Then we could feed our new books CSV file to the script every month and it would generate the needed HTML for our LibGuide. This was a fun coding challenge for Mark, and you can see the results of our handiwork on LibGuides and on GitHub.

There is room to make this script better, such as improving the capitalization of titles. Additionally, there are problems with the call numbers provided by our vendors. These numbers are preliminary and may be slightly different (or occasionally, very different) from the final number after the book is cataloged. Also, there is no Cutter number. A title will be listed as BF637.M4 when its actual call number is BF637.M4 A385 2018, for example. Again, this is not ideal, but we ultimately needed a solution that would suit our workflow, so we decided to leave it. To make sure users are aware of this, we posted a warning — highlighted in yellow — on each page on the site:

Even with these issues, the page more or less does what we need. Julia recently emailed a link to the resulting LibGuide to all Kingsborough faculty, and the feedback was overwhelmingly positive. Our takeaway from this experience is that a bit of time spent automating some boring tasks allowed our library to deliver a monthly notification service that was too labor-intensive to provide otherwise.

Posted in acquisitions, books, excel, python | Comments closed

Teaching librarians to build Twitter bots

Robin Davis (@robincamille) and I are running a Twitter bot-making workshop next week at ALA Annual in New Orleans. We’ve run this workshop a couple of times before, and it’s always been a positive experience. It’s a great way to introduce people to Python while building something fun.

Right now, I’m in the midst of re-checking all of the bots and activities, to make sure they still work. Technical workshops, especially ones like this that rely so heavily on networked tools, are sometimes a real gamble. One broken service and the whole workshop quickly comes to halt. Of course, we have a plan B, and even a plan C; but still.

With a workshop like this, the elephant in the room is that bots have earned a bit of a reputation lately. Our workshop emphasizes making bots that bring some whimsy to Twitter. We’re aiming for increased bot literacy, and a bit of practical Python knowledge. We want our participants to use their new found skills to make something constructive.

If you can’t make it to ALA, you can still play along here.

Posted in bots, twitter, workshop | Comments closed
Need help with the Commons? Visit our
help page
Send us a message
Skip to toolbar