GoodReads (UCSD Book Graph)#

We import GoodReads data from the UCSD Book Graph for additional book and user interaction information. The source files are not automatically downloaded; you will need the following:

  • Books

  • Book works

  • Authors

  • Book genres

  • Book series

  • Interaction data (the CSV summary file, along with its book and user ID files, is used by default; the full JSON file is also supported)

We do not yet support reviews.

If you use this data, cite the paper(s) documented on the data set web site.

Imported data lives in the goodreads directory.

Configuration#

The config.tcl file defines what source of GoodReads interaction data is used:

set gr_interactions simple

The default, simple, uses the CSV summary data that you can download directly from the web site in 3 files:

  • goodreads_interactions.csv

  • user_id_map.csv

  • book_id_map.csv

Download and save these 3 files in data/goodreads, along with the other metadata files.

The tools also support the detailed version (change interactions to full), delivered in JSON format. If you want this version, you need to contact Mengtin Wan as noted on the web site.

Import Steps#

The import is controlled by several DVC steps:

scan-*

The various scan-* steps each scan a JSON file into corresponding Parquet files. They have a specific order, as scanning interactions needs book information.

book-isbn-ids

Match GoodReads ISBNs with ISBN IDs.

book-links

Creates goodreads/gr-book-link.parquet, which links each GoodReads book with its work (if applicable) and is cluster ID.

cluster-actions

Extracts cluster-level implicit feedback data. Each (user, cluster) pair has one record, with the number of actions (the number of times the user added a book from that cluster to a shelf) and timestamp data.

cluster-ratings

Extracts cluster-level explicit feedback data. This is the ratings each user assigned to books in each cluster.

work-actions, work-ratings

The same thing as the cluster-* stages, except it groups by GoodReads work instead of by integrated cluster. If you are only working with the GoodReads data, and not trying to connect across data sets, this data is better to work with.

work-gender

The author gender for each GoodReads work, as near as we can tell.

Scanned and Linking Data#

goodreads/gr-book-ids.parquet#

Identifiers extracted from each GoodReads book record.

Field

Type

book_id

Int32

work_id

Int32?

isbn10

Utf8?

isbn13

Utf8?

asin

Utf8?

goodreads/gr-book-info.parquet#

Metadata extracted from GoodReads book records.

Field

Type

book_id

Int32

title

Utf8?

pub_year

UInt16?

pub_month

UInt8?

pub_date

Date32?

goodreads/gr-book-genres.parquet#

GoodReads book-genre associations.

Field

Type

book_id

Int32

genre_id

Int32

count

Int32

goodreads/gr-book-series.parquet#

GoodReads book series associations.

Field

Type

book_id

Int32

series

Utf8

goodreads/gr-genres.parquet#

The genre labels to go with goodreads/gr-book-genres.parquet.

Field

Type

genre_id

Int32?

genre

Utf8?

goodreads/gr-book-link.parquet#

Linking identifiers (work and cluster) for GoodReads books.

Field

Type

book_id

Int32

work_id

Int32?

cluster

Int32

goodreads/gr-work-info.parquet#

Metadata extracted from GoodReads work records.

Field

Type

work_id

Int32

title

Utf8?

pub_year

Int16?

pub_month

UInt8?

pub_date

Date32?

goodreads/simple/gr-interactions.parquet#

GoodReads interaction records (from CSV).

goodreads/full/gr-interactions.parquet#

GoodReads interaction records (from JSON).

goodreads/gr-author-info.parquet#

GoodReads author information.

Field

Type

author_id

Int32

name

Utf8?

Cluster-Level Tables#

goodreads/full/gr-cluster-actions.parquet#

Cluster-level implicit-feedback records, suitable for use in LensKit. The item column contains cluster IDs. This version of the table is processed from the JSON version of the full interaction log, which is only available by request.

goodreads/full/gr-cluster-ratings.parquet#

Cluster-level explicit-feedback records, suitable for use in LensKit. The item column contains cluster IDs. This version of the table is processed from the JSON version of the full interaction log, which is only available by request.

goodreads/simple/gr-cluster-actions.parquet#

Cluster-level implicit-feedback records, suitable for use in LensKit. The item column contains cluster IDs. This version of the table is processed from the CSV data.

goodreads/simple/gr-cluster-ratings.parquet#

Cluster-level explicit-feedback records, suitable for use in LensKit. The item column contains cluster IDs. This version of the table is processed from the CSV data.

Work-Level Tables#

goodreads/full/gr-work-actions.parquet#

Work-level implicit-feedback records, suitable for use in LensKit. The item column contains work IDs.

goodreads/full/gr-work-ratings.parquet#

Work-level explicit-feedback records, suitable for use in LensKit. The item column contains work IDs.

goodreads/simple/gr-work-actions.parquet#

Work-level implicit-feedback records, suitable for use in LensKit. The item column contains work IDs.

goodreads/simple/gr-work-ratings.parquet#

Work-level explicit-feedback records, suitable for use in LensKit. The item column contains work IDs.

goodreads/gr-work-gender.parquet#

Author gender for GoodReads works. This is computed by connecting works to clusters and obtaining the cluster gender information from book-links/cluster-genders.parquet.

Field

Type

cluster

Int32?

gender

Utf8?

book_id

Int32?

work_id

Int32?