History

This page documents the release history of the Book Data Tools. Each numbered, released version has a corresponding Git tag (e.g. v2.0).

If you use the Book Data Tools in published research, we ask that you do the following:

Cite the UMUAI paper, regardless of which version of the data set you use.
Cite the papers corresponding to the individual ratings, review, or consumption data sets you are using.
Clearly state the version of the data tools you are using in your paper.
Let us know about your work so we can add you to the list.

Book Data 3.0 (in progress)

Make the pipeline configurable so individual rating datasets can be disabled.
Only support the full JSON GoodReads interaction data, because it is now publicly available.
Use jsonnet to generate DVC pipelines, taking configuration settings into account.
Update to newer VIAF and OpenLibrary dumps.
Extract GoodReads author information into goodreads/gr-author-info.parquet.
Support full-text reviews from the GoodReads and Amazon 2018 data sets (enabled by default).
Disable the BookCrossing data by default since the source website is offline.
Extract 5-cores of interaction files.
Update to OpenLibrary and VIAF dumps from the beginning of 2024 (OpenLibrary 2023-12-31, VIAF 2024-01-01).

Bugs Fixed

🪲 GoodReads cluster & work rating timestamps were on incorrect scale

Book Data 2.1

Version 2.1 has a few updates but does not change existing data schemas when run with the full GoodReads interaction files. It does have improved book/author linking that increases coverage due to a revised and corrected name parsing & normalization flow.

The tools now support the GoodReads interaction CSV file, which is available without registration, and uses this by default. See the GoodReads data docs for the details. This means that, in their default configuration, the book data integration uses only data that is publicly available without special request.

Data Updates

Updated VIAF to May 1, 2022 dump
Updated OpenLibrary to March 29, 2022 dump
Added 2018 version of the Amazon ratings
Added code to extract edition and work subjects
Updated docs for current extraction layout
Added openlibrary/work-clusters.parquet to simplify OpenLibrary integration

Logic Updates

Switched from DataFusion to Polars, to reduce volatility and improve maintainability. This also involved a switch from Arrow to Arrow2, which seems to have cleaner code (and less custom logic needed for IO).
Rewrote logic that was previously in DataFusion + custom TCL in Rust, so all integration code is in Rust for consistency (and to avoid redundancy in things like logging configuration between Rust and Python). The code is now in 2 languages: Rust integration and Python notebooks to report on integration statistics.
Improved name parsing
- Replaced nom-based name parser for name_variants with a new one written in peg, that is both easier to read/maintain and more efficient.
- Corrected errors in name parser that emitted empty-string names for some authors.
- Added clean_name function, used across all name formatting, to normalize whitespace and punctuation in name records from any source.
- Added more tests for name parsing and normalization.
Fixed a bug in GoodReads integration, where we were not extracting ASINs.
Extract book genres and series from GoodReads.
Updated various Rust dependencies, and upgraded from StructOpt to clap’s derive macros.
Better progress reporting for data scans.

Book Data 2.0

This is the updated release of the Book Data Tools, using the same source data as 1.0 but with DataFusion and Rust-based import logic, instead of PostgreSQL. It is significantly easier to install and use.

Book Data 1.0

The original release that used PostgreSQL. There were a couple of versions of this for the RecSys and UMUAI papers; the tagged 1.0 release corresponds to the data used for the UMUAI paper.