GoodReads

We import GoodReads data from the UCSD Book Graph for additional book and user interaction information. The source files are not automatically downloaded; you will need the following:

Books
Book works
Authors
Book genres
Book series
Interaction data (the full interactions JSON file, not the summary CSV)

Download the files and save them in the data/goodreads directory. Each one has a corresponding .dvc file already in the repository.

Important

If you use this data, cite the paper(s) documented on the data set web site.

Imported data lives in the goodreads directory.

Configuration

The config.yaml file allows you disable the GoodReads data entirely, as well as control whether reviews are processed:

goodreads:
    enabled: true
    reviews: true

If you change the configuration, you need to update the pipeline before running.

Import Steps

The import is controlled by several DVC steps:

scan-*: The various scan-* steps each scan a JSON file into corresponding Parquet files. They have a specific order, as scanning interactions needs book information.
book-isbn-ids: Match GoodReads ISBNs with ISBN IDs.
book-links: Creates goodreads/gr-book-link.parquet, which links each GoodReads book with its work (if applicable) and is cluster ID.
cluster-actions: Extracts cluster-level implicit feedback data. Each (user, cluster) pair has one record, with the number of actions (the number of times the user added a book from that cluster to a shelf) and timestamp data.
cluster-ratings: Extracts cluster-level explicit feedback data. This is the ratings each user assigned to books in each cluster.
work-actions, work-ratings: The same thing as the cluster-* stages, except it groups by GoodReads work instead of by integrated cluster. If you are only working with the GoodReads data, and not trying to connect across data sets, this data is better to work with.
work-gender: The author gender for each GoodReads work, as near as we can tell.

Scanned and Linking Data

goodreads/gr-book-ids.parquet

Identifiers extracted from each GoodReads book record.

File details

Schema for `goodreads/gr-book-ids.parquet`.
Field	Type
book_id	Int32
work_id	Int32
item_id	Int32
isbn10	LargeUtf8
isbn13	LargeUtf8
asin	LargeUtf8

goodreads/gr-book-info.parquet

Metadata extracted from GoodReads book records.

File details

Schema for `goodreads/gr-book-info.parquet`.
Field	Type
book_id	Int32
title	LargeUtf8
pub_year	UInt16
pub_month	UInt8

goodreads/gr-book-genres.parquet

GoodReads book-genre associations.

File details

Schema for `goodreads/gr-book-genres.parquet`.
Field	Type
book_id	Int32
genre_id	Int32
count	Int32

goodreads/gr-book-series.parquet

GoodReads book series associations.

File details

Schema for `goodreads/gr-book-series.parquet`.
Field	Type
book_id	Int32
series	LargeUtf8

goodreads/gr-genres.parquet

The genre labels to go with goodreads/gr-book-genres.parquet.

File details

Schema for `goodreads/gr-genres.parquet`.
Field	Type
genre_id	Int32
genre	LargeUtf8

goodreads/gr-book-link.parquet

Linking identifiers (work and cluster) for GoodReads books.

File details

Schema for `goodreads/gr-book-link.parquet`.
Field	Type
book_id	Int32
work_id	Int32
cluster	Int32

goodreads/gr-work-info.parquet

Metadata extracted from GoodReads work records.

File details

Schema for `goodreads/gr-work-info.parquet`.
Field	Type
work_id	Int32
title	LargeUtf8
pub_year	Int16
pub_month	UInt8

goodreads/gr-interactions.parquet

GoodReads interaction records (from JSON).

File details

Schema for `goodreads/gr-interactions.parquet`.
Field	Type
rec_id	UInt32
review_id	Int64
user_id	Int32
book_id	Int32
is_read	UInt8
rating	Float32
added	Float32
updated	Float32
read_started	Float32
read_finished	Float32

goodreads/gr-author-info.parquet

GoodReads author information.

File details

Schema for `goodreads/gr-author-info.parquet`.
Field	Type
author_id	Int32
name	LargeUtf8

Cluster-Level Tables

goodreads/gr-cluster-actions.parquet

Cluster-level implicit-feedback records, suitable for use in LensKit. The item column contains cluster IDs. This version of the table is processed from the JSON version of the full interaction log, which is only available by request.

File details

Schema for `goodreads/gr-cluster-actions.parquet`.
Field	Type
user_id	Int32
item_id	Int32
first_time	Int64
last_time	Int64
nactions	UInt32
last_rating	Float32

goodreads/gr-cluster-ratings.parquet

Cluster-level explicit-feedback records, suitable for use in LensKit. The item column contains cluster IDs. This version of the table is processed from the JSON version of the full interaction log, which is only available by request.

File details

Schema for `goodreads/gr-cluster-ratings.parquet`.
Field	Type
user_id	Int32
item_id	Int32
rating	Float32
last_rating	Float32
first_time	Int64
last_time	Int64
nratings	UInt32

Work-Level Tables

goodreads/gr-work-actions.parquet

Work-level implicit-feedback records, suitable for use in LensKit. The item_id column contains work IDs.

File details

Schema for `goodreads/gr-work-actions.parquet`.
Field	Type
user_id	Int32
item_id	Int32
first_time	Int64
last_time	Int64
nactions	UInt32
last_rating	Float32

goodreads/gr-work-ratings.parquet

Work-level explicit-feedback records, suitable for use in LensKit. The item_id column contains work IDs.

File details

Schema for `goodreads/gr-work-ratings.parquet`.
Field	Type
user_id	Int32
item_id	Int32
rating	Float32
last_rating	Float32
first_time	Int64
last_time	Int64
nratings	UInt32

goodreads/gr-work-item-gender.parquet

Author gender for GoodReads work-level items. This is computed by connecting works to clusters and obtaining the cluster gender information from book-links/cluster-genders.parquet.

File details

Schema for `goodreads/gr-work-item-gender.parquet`.
Field	Type
item_id	Int32
gender	LargeUtf8

goodreads/gr-book-gender.parquet

Author gender for GoodReads books (not just works). This is computed by connecting works to clusters and obtaining the cluster gender information from book-links/cluster-genders.parquet.

File details

Schema for `goodreads/gr-book-gender.parquet`.
Field	Type
cluster	Int32
gender	LargeUtf8
book_id	Int32
work_id	Int32
item_id	Int32