GoodReads

We import GoodReads data from the UCSD Book Graph for additional book and user interaction information. The source files are not automatically downloaded; you will need the following:

  • Books
  • Book works
  • Authors
  • Book genres
  • Book series
  • Interaction data (the full interactions JSON file, not the summary CSV)

Download the files and save them in the data/goodreads directory. Each one has a corresponding .dvc file already in the repository.

Important

If you use this data, cite the paper(s) documented on the data set web site.

Imported data lives in the goodreads directory.

Configuration

The config.yaml file allows you disable the GoodReads data entirely, as well as control whether reviews are processed:

goodreads:
    enabled: true
    reviews: true

If you change the configuration, you need to update the pipeline before running.

Import Steps

The import is controlled by several DVC steps:

scan-*
The various scan-* steps each scan a JSON file into corresponding Parquet files. They have a specific order, as scanning interactions needs book information.
book-isbn-ids
Match GoodReads ISBNs with ISBN IDs.
book-links
Creates goodreads/gr-book-link.parquet, which links each GoodReads book with its work (if applicable) and is cluster ID.
cluster-actions
Extracts cluster-level implicit feedback data. Each (user, cluster) pair has one record, with the number of actions (the number of times the user added a book from that cluster to a shelf) and timestamp data.
cluster-ratings
Extracts cluster-level explicit feedback data. This is the ratings each user assigned to books in each cluster.
work-actions, work-ratings
The same thing as the cluster-* stages, except it groups by GoodReads work instead of by integrated cluster. If you are only working with the GoodReads data, and not trying to connect across data sets, this data is better to work with.
work-gender
The author gender for each GoodReads work, as near as we can tell.

Scanned and Linking Data

goodreads/gr-book-ids.parquet

Identifiers extracted from each GoodReads book record.

File details
Schema for goodreads/gr-book-ids.parquet.
Field
Type
book_id
Int32
work_id
Int32
gr_item
Int32
isbn10
Utf8
isbn13
Utf8
asin
Utf8
goodreads/gr-book-info.parquet

Metadata extracted from GoodReads book records.

File details
Schema for goodreads/gr-book-info.parquet.
Field
Type
book_id
Int32
title
Utf8
pub_year
UInt16
pub_month
UInt8
goodreads/gr-book-genres.parquet

GoodReads book-genre associations.

File details
Schema for goodreads/gr-book-genres.parquet.
Field
Type
book_id
Int32
genre_id
Int32
count
Int32
goodreads/gr-book-series.parquet

GoodReads book series associations.

File details
Schema for goodreads/gr-book-series.parquet.
Field
Type
book_id
Int32
series
Utf8
goodreads/gr-genres.parquet

The genre labels to go with goodreads/gr-book-genres.parquet.

File details
Schema for goodreads/gr-genres.parquet.
Field
Type
genre_id
Int32
genre
Utf8
goodreads/gr-book-link.parquet

Linking identifiers (work and cluster) for GoodReads books.

File details
Schema for goodreads/gr-book-link.parquet.
Field
Type
book_id
Int32
work_id
Int32
cluster
Int32
goodreads/gr-work-info.parquet

Metadata extracted from GoodReads work records.

File details
Schema for goodreads/gr-work-info.parquet.
Field
Type
work_id
Int32
title
Utf8
pub_year
Int16
pub_month
UInt8
goodreads/gr-interactions.parquet

GoodReads interaction records (from JSON).

File details
Schema for goodreads/gr-interactions.parquet.
Field
Type
rec_id
UInt32
review_id
Int64
user_id
Int32
book_id
Int32
is_read
UInt8
rating
Float32
added
Float32
updated
Float32
read_started
Float32
read_finished
Float32
goodreads/gr-author-info.parquet

GoodReads author information.

File details
Schema for goodreads/gr-author-info.parquet.
Field
Type
author_id
Int32
name
Utf8

Cluster-Level Tables

goodreads/gr-cluster-actions.parquet

Cluster-level implicit-feedback records, suitable for use in LensKit. The item column contains cluster IDs. This version of the table is processed from the JSON version of the full interaction log, which is only available by request.

File details
Schema for goodreads/gr-cluster-actions.parquet.
Field
Type
user
Int32
item
Int32
first_time
Int64
last_time
Int64
nactions
UInt32
last_rating
Float32
goodreads/gr-cluster-ratings.parquet

Cluster-level explicit-feedback records, suitable for use in LensKit. The item column contains cluster IDs. This version of the table is processed from the JSON version of the full interaction log, which is only available by request.

File details
Schema for goodreads/gr-cluster-ratings.parquet.
Field
Type
user
Int32
item
Int32
rating
Float32
last_rating
Float32
first_time
Int64
last_time
Int64
nratings
UInt32

Work-Level Tables

goodreads/gr-work-actions.parquet

Work-level implicit-feedback records, suitable for use in LensKit. The item column contains work IDs.

File details
Schema for goodreads/gr-work-actions.parquet.
Field
Type
user
Int32
item
Int32
first_time
Int64
last_time
Int64
nactions
UInt32
last_rating
Float32
goodreads/gr-work-ratings.parquet

Work-level explicit-feedback records, suitable for use in LensKit. The item column contains work IDs.

File details
Schema for goodreads/gr-work-ratings.parquet.
Field
Type
user
Int32
item
Int32
rating
Float32
last_rating
Float32
first_time
Int64
last_time
Int64
nratings
UInt32
goodreads/gr-work-gender.parquet

Author gender for GoodReads works. This is computed by connecting works to clusters and obtaining the cluster gender information from book-links/cluster-genders.parquet.

File details
Schema for goodreads/gr-work-gender.parquet.
Field
Type
cluster
Int32
gender
Utf8
book_id
Int32
work_id
Int32