HathiTrust

You may have access to the HathiTrust library through your university. If so, you can download books from it with a bit of work. HathiTrust provides access to lots of books unavailable on libgen and other places, so it’s worth having this trick in your back pocket, just in case.1

You’ll need to install: FireFox with the DownThemAll extension, img2pdf, and ocrmypdf. I’m assuming you’re on osx, but this should work on linux and, mutatis mutandis, on windows, too.

1. Download raw pages from HathiTrust.

Using FireFox, access HathiTrust using your instiutitonal login.

Search for your book and check it out.

Right click on the book and click “Copy Image Location.”

Now click on the DownThemAll button, and select “Manager.”

Paste into the “Download” box. You’ll have something that looks like this.

  https://babel.hathitrust.org/cgi/imgsrv/image?id=uc1.b4362598;seq=107;size=150;rotation=0

This link corresponds only to one page. To download the entire book, we’ll need to make DownThemAll download the whole sequence using the brackets notation. We can also request an arbitrary image size using “width” or “size.”

  https://babel.hathitrust.org/cgi/imgsrv/image?id=uc1.b4362598;seq=[1:464];width=2000

The defaults for DownThemAll work fine. I like to set a custom subfolder for the project. Click “Download,” then “Batch Download.”

DownThemAll should begin working its magic. HathiTrust limits the number of viewable images for a certain time period, returning a server error. So if you’re downloading a bunch of pages, you’ll want to modify the network preferences. Click on DownThemAll, “Preferences,” then “Network.” Here’s what I use.

  Concurrent downloads: 1
  Number of retries of downloads on temporary errors: 99
  Retry every (in minutes): 2

You may need to step away for a couple of hours while DownThemAll gets through all the pages.

2. Merge raw pages, ocr them.

You should now have all of your images in a single directory on your computer. Using your command line, navigate to that directory.

  cd ~/Downloads/example_book/

Use img2pdf since it’s always lossless. We need to sort the images correctly. They also come in a combination of jpg and png, in my experience. So this one-liner does the trick.

  img2pdf --fit shrink --output out.pdf $(ls *.{jpg,png}|sort -V)

Now OCR this pdf with your preferred method. The simplist one is likely ocrmypdf. A good alternative is FineReader. This may take awhile.

  ocrmypdf out.pdf out_ocr.pdf

You’ve now got a nice pdf of your book, DRM free.

Footnotes:

1

This method is mostly taken from Mukkakukaku’s guide. But see this ticket for HathiDownloadHelper and this issue for HathiTrust-downloader.



This page was created using GNU Emacs and org-mode.
Matthew J. Delhey (matt.delhey@mail.utoronto.ca)