GoodReads
We import GoodReads data from the UCSD Book Graph for additional book and user interaction information. The source files are not automatically downloaded; you will need the following:
- Books
- Book works
- Authors
- Book genres
- Book series
- Interaction data (the full interactions JSON file, not the summary CSV)
Download the files and save them in the data/goodreads
directory. Each one has a corresponding .dvc
file already in the repository.
If you use this data, cite the paper(s) documented on the data set web site.
Imported data lives in the goodreads
directory.
Configuration
The config.yaml
file allows you disable the GoodReads data entirely, as well as control whether reviews are processed:
goodreads:
enabled: true
reviews: true
If you change the configuration, you need to update the pipeline before running.
Import Steps
The import is controlled by several DVC steps:
scan-*
-
The various
scan-*
steps each scan a JSON file into corresponding Parquet files. They have a specific order, as scanning interactions needs book information. book-isbn-ids
- Match GoodReads ISBNs with ISBN IDs.
book-links
-
Creates
goodreads/gr-book-link.parquet
, which links each GoodReads book with its work (if applicable) and is cluster ID. cluster-actions
- Extracts cluster-level implicit feedback data. Each (user, cluster) pair has one record, with the number of actions (the number of times the user added a book from that cluster to a shelf) and timestamp data.
cluster-ratings
- Extracts cluster-level explicit feedback data. This is the ratings each user assigned to books in each cluster.
work-actions
,work-ratings
-
The same thing as the
cluster-*
stages, except it groups by GoodReads work instead of by integrated cluster. If you are only working with the GoodReads data, and not trying to connect across data sets, this data is better to work with. work-gender
- The author gender for each GoodReads work, as near as we can tell.
Scanned and Linking Data
goodreads/gr-book-ids.parquet
Identifiers extracted from each GoodReads book record.
File details
Field
|
Type
|
---|---|
book_id
|
Int32
|
work_id
|
Int32
|
gr_item
|
Int32
|
isbn10
|
LargeUtf8
|
isbn13
|
LargeUtf8
|
asin
|
LargeUtf8
|
goodreads/gr-book-info.parquet
Metadata extracted from GoodReads book records.
File details
Field
|
Type
|
---|---|
book_id
|
Int32
|
title
|
LargeUtf8
|
pub_year
|
UInt16
|
pub_month
|
UInt8
|
goodreads/gr-book-genres.parquet
GoodReads book-genre associations.
File details
Field
|
Type
|
---|---|
book_id
|
Int32
|
genre_id
|
Int32
|
count
|
Int32
|
goodreads/gr-book-series.parquet
GoodReads book series associations.
File details
Field
|
Type
|
---|---|
book_id
|
Int32
|
series
|
LargeUtf8
|
goodreads/gr-genres.parquet
The genre labels to go with goodreads/gr-book-genres.parquet
.
File details
Field
|
Type
|
---|---|
genre_id
|
Int32
|
genre
|
LargeUtf8
|
goodreads/gr-book-link.parquet
Linking identifiers (work and cluster) for GoodReads books.
File details
Field
|
Type
|
---|---|
book_id
|
Int32
|
work_id
|
Int32
|
cluster
|
Int32
|
goodreads/gr-work-info.parquet
Metadata extracted from GoodReads work records.
File details
Field
|
Type
|
---|---|
work_id
|
Int32
|
title
|
LargeUtf8
|
pub_year
|
Int16
|
pub_month
|
UInt8
|
goodreads/gr-interactions.parquet
GoodReads interaction records (from JSON).
File details
Field
|
Type
|
---|---|
rec_id
|
UInt32
|
review_id
|
Int64
|
user_id
|
Int32
|
book_id
|
Int32
|
is_read
|
UInt8
|
rating
|
Float32
|
added
|
Float32
|
updated
|
Float32
|
read_started
|
Float32
|
read_finished
|
Float32
|
Cluster-Level Tables
goodreads/gr-cluster-actions.parquet
Cluster-level implicit-feedback records, suitable for use in LensKit. The item
column contains cluster IDs. This version of the table is processed from the JSON version of the full interaction log, which is only available by request.
File details
Field
|
Type
|
---|---|
user
|
Int32
|
item
|
Int32
|
first_time
|
Int64
|
last_time
|
Int64
|
nactions
|
UInt32
|
last_rating
|
Float32
|
goodreads/gr-cluster-ratings.parquet
Cluster-level explicit-feedback records, suitable for use in LensKit. The item
column contains cluster IDs. This version of the table is processed from the JSON version of the full interaction log, which is only available by request.
File details
Field
|
Type
|
---|---|
user
|
Int32
|
item
|
Int32
|
rating
|
Float32
|
last_rating
|
Float32
|
first_time
|
Int64
|
last_time
|
Int64
|
nratings
|
UInt32
|
Work-Level Tables
goodreads/gr-work-actions.parquet
Work-level implicit-feedback records, suitable for use in LensKit. The item
column contains work IDs.
File details
Field
|
Type
|
---|---|
user
|
Int32
|
item
|
Int32
|
first_time
|
Int64
|
last_time
|
Int64
|
nactions
|
UInt32
|
last_rating
|
Float32
|
goodreads/gr-work-ratings.parquet
Work-level explicit-feedback records, suitable for use in LensKit. The item
column contains work IDs.
File details
Field
|
Type
|
---|---|
user
|
Int32
|
item
|
Int32
|
rating
|
Float32
|
last_rating
|
Float32
|
first_time
|
Int64
|
last_time
|
Int64
|
nratings
|
UInt32
|
goodreads/gr-work-gender.parquet
Author gender for GoodReads works. This is computed by connecting works to clusters and obtaining the cluster gender information from book-links/cluster-genders.parquet
.
File details
Field
|
Type
|
---|---|
cluster
|
Int32
|
gender
|
LargeUtf8
|
book_id
|
Int32
|
work_id
|
Int32
|