GoodReads (UCSD Book Graph)#
We import GoodReads data from the UCSD Book Graph for additional book and user interaction information. The source files are not automatically downloaded; you will need the following:
Books
Book works
Authors
Book genres
Book series
Interaction data (the CSV summary file, along with its book and user ID files, is used by default; the full JSON file is also supported)
We do not yet support reviews.
If you use this data, cite the paper(s) documented on the data set web site.
Imported data lives in the goodreads
directory.
Configuration#
The config.tcl
file defines what source of GoodReads interaction data is used:
set gr_interactions simple
The default, simple
, uses the CSV summary data that you can download directly
from the web site in 3 files:
goodreads_interactions.csv
user_id_map.csv
book_id_map.csv
Download and save these 3 files in data/goodreads
, along with the other metadata files.
The tools also support the detailed version (change interactions to full
),
delivered in JSON format. If you want this version, you need to contact Mengtin
Wan as noted on the web site.
Import Steps#
The import is controlled by several DVC steps:
scan-*
The various
scan-*
steps each scan a JSON file into corresponding Parquet files. They have a specific order, as scanning interactions needs book information.book-isbn-ids
Match GoodReads ISBNs with ISBN IDs.
book-links
Creates
goodreads/gr-book-link.parquet
, which links each GoodReads book with its work (if applicable) and is cluster ID.cluster-actions
Extracts cluster-level implicit feedback data. Each (user, cluster) pair has one record, with the number of actions (the number of times the user added a book from that cluster to a shelf) and timestamp data.
cluster-ratings
Extracts cluster-level explicit feedback data. This is the ratings each user assigned to books in each cluster.
work-actions
,work-ratings
The same thing as the
cluster-*
stages, except it groups by GoodReads work instead of by integrated cluster. If you are only working with the GoodReads data, and not trying to connect across data sets, this data is better to work with.work-gender
The author gender for each GoodReads work, as near as we can tell.
Scanned and Linking Data#
- goodreads/gr-book-ids.parquet#
Identifiers extracted from each GoodReads book record.
Field
Type
book_id
Int32
work_id
Int32?
isbn10
Utf8?
isbn13
Utf8?
asin
Utf8?
- goodreads/gr-book-info.parquet#
Metadata extracted from GoodReads book records.
Field
Type
book_id
Int32
title
Utf8?
pub_year
UInt16?
pub_month
UInt8?
pub_date
Date32?
- goodreads/gr-book-genres.parquet#
GoodReads book-genre associations.
Field
Type
book_id
Int32
genre_id
Int32
count
Int32
- goodreads/gr-book-series.parquet#
GoodReads book series associations.
Field
Type
book_id
Int32
series
Utf8
- goodreads/gr-genres.parquet#
The genre labels to go with
goodreads/gr-book-genres.parquet
.Field
Type
genre_id
Int32?
genre
Utf8?
- goodreads/gr-book-link.parquet#
Linking identifiers (work and cluster) for GoodReads books.
Field
Type
book_id
Int32
work_id
Int32?
cluster
Int32
- goodreads/gr-work-info.parquet#
Metadata extracted from GoodReads work records.
Field
Type
work_id
Int32
title
Utf8?
pub_year
Int16?
pub_month
UInt8?
pub_date
Date32?
- goodreads/simple/gr-interactions.parquet#
GoodReads interaction records (from CSV).
- goodreads/full/gr-interactions.parquet#
GoodReads interaction records (from JSON).
- goodreads/gr-author-info.parquet#
GoodReads author information.
Field
Type
author_id
Int32
name
Utf8?
Cluster-Level Tables#
- goodreads/full/gr-cluster-actions.parquet#
Cluster-level implicit-feedback records, suitable for use in LensKit. The
item
column contains cluster IDs. This version of the table is processed from the JSON version of the full interaction log, which is only available by request.
- goodreads/full/gr-cluster-ratings.parquet#
Cluster-level explicit-feedback records, suitable for use in LensKit. The
item
column contains cluster IDs. This version of the table is processed from the JSON version of the full interaction log, which is only available by request.
- goodreads/simple/gr-cluster-actions.parquet#
Cluster-level implicit-feedback records, suitable for use in LensKit. The
item
column contains cluster IDs. This version of the table is processed from the CSV data.
- goodreads/simple/gr-cluster-ratings.parquet#
Cluster-level explicit-feedback records, suitable for use in LensKit. The
item
column contains cluster IDs. This version of the table is processed from the CSV data.
Work-Level Tables#
- goodreads/full/gr-work-actions.parquet#
Work-level implicit-feedback records, suitable for use in LensKit. The
item
column contains work IDs.
- goodreads/full/gr-work-ratings.parquet#
Work-level explicit-feedback records, suitable for use in LensKit. The
item
column contains work IDs.
- goodreads/simple/gr-work-actions.parquet#
Work-level implicit-feedback records, suitable for use in LensKit. The
item
column contains work IDs.
- goodreads/simple/gr-work-ratings.parquet#
Work-level explicit-feedback records, suitable for use in LensKit. The
item
column contains work IDs.
- goodreads/gr-work-gender.parquet#
Author gender for GoodReads works. This is computed by connecting works to clusters and obtaining the cluster gender information from
book-links/cluster-genders.parquet
.Field
Type
cluster
Int32?
gender
Utf8?
book_id
Int32?
work_id
Int32?