GoodReads (UCSD Book Graph)¶
We import GoodReads data from the UCSD Book Graph for additional book and user interaction information. The source files are not automatically downloaded; you will need the following:
Full interaction data
We do not yet support reviews.
If you use this data, cite the paper(s) documented on the data set web site.
Imported data lives in the
Data Model Diagram¶
The import is controlled by the following DVC steps:
gr-schema.sqlto set up the base schema.
Import raw GoodReads data from files under
gr-index-books.sqlto index the book data and extract identifiers.
gr-book-info.sqlto extract additional book and work metadata.
gr-index-ratings.sqlto index the rating and interaction data.
The raw rating data, with invalid characters cleaned up, is in the various
Each table has the following columns:
Numeric record identifier generated at import time. Throughout this page, we will refer to these as record identifiers; they are distinct from the identifiers GoodReads uses for books and works, as those are not known until the JSON is unpacked.
JSONBcolumn containing imported data.
Extracted Book Tables¶
We extract the following tables for book and work data:
GoodReads work identifiers.
GoodReads book identifiers. This maps each GoodReads book record identifier to the following identifiers:
ISBN 10 (
ISBN 13 (
This table extracts the textual versions of ISBNs and ASINs directly from the
raw_booktable. It does not resolve them to ISBN IDs.
Map GoodReads books to ISBN IDs and book codes. This does not use ASINs, just ISBN-10 and ISBN-13s.
Genre membership (and scores) for each book. This is a direct extract of the book genres file from UCSD.
The title of each work.
The publication date of each book. It extracts the year, month, and day; if all three are present, then
pub_datecontains the date as an SQL date. These are the
publication_*fields in the book JSON data.
The original publication date of each work. Extracted like
book_pub_date, but from a work’s
The book cluster each book is a member of.
Extracted Interaction Tables¶
We extract the following tables for book ratings and interactions (add-to-shelf actions):
Mapping between user record IDs and GoodReads user IDs.
Extract of basic information about each entry in the Interactions file. These interactions represent an add-to-shelf action, optionally with a rating. We extract the following:
The interaction record identifier (PK)
GoodReads book ID
User record identifier (we use record IDs instead of user IDs to keep them numeric)
The 5-star rating value (if provided)
isReadflag from original JSON data.
The date the book was added to the shelf.
The update date for this interaction.
Rating table suitable for use in LensKit. This is aggregated by book cluster, and contains both the median rating and the last rating, along with the median update date as the timestamp.
Add-action table suitable for use in LensKit. Also aggregated by book cluster, with the first and last (update) date as the timestamps, and number of interactions with this book.