The BookCrossing data set consists of user-provided ratings — both implicit and explicit — of books.
If you use this data, cite:
Cai-Nicolas Ziegler, Sean M. McNee, Joseph A. Konstan, and Georg Lausen. 2005. Improving Recommendation Lists Through Topic Diversification. Proceedings of the 14th International World Wide Web Conference (WWW ‘05), May 10-14, 2005, Chiba, Japan.
Imported data lives in the
bx schema. The source data files are automatically downloaded and unpacked by
the provided scripts and DVC stages.
Data Model Diagram¶
The import is controlled by the following DVC steps:
Unpack the BookCrossing zip file.
Download the BookCrossing zip file.s
bx-schema.sqlto set up the base schema.
Import raw BookCrossing ratings from
bx-index.sqlto index the rating data and integrate with book data.
The raw rating data, with invalid characters cleaned up, is in the
bx.raw_ratings table, with
the following columns:
The user identifier (numeric).
The book ISBN (text).
The book rating. The ratings are on a 1-10 scale, with 0 indicating an implicit-feedback record.
Extracted Rating Tables¶
We extract the following tables for BookCrossing ratings:
The explicit ratings (
rating > 0) from the raw ratings table.
Records of users adding books, either by rating or through an implicit feedback action, without rating values.
Both of these tables are pre-clustered, so the book IDs refer to book clusters and not individual ISBNs or editions. They have the following columns:
The user ID.
The book code for this book; the cluster identifier if available, or the ISBN-based book code if this book is not in a cluster.
The rating value; if the user has rated multiple books in a cluster, the median value is reported. This field is only on the
The number of book actions this user performed on this book. Equivalent to the number of books in the cluster that the user has added or rated.