Amazon Ratings

This processes two data sets from Julian McAuley’s group at UCSD:

Each consists of user-provided reviews and ratings for a variety of products.

Currently we import the ratings-only data from the Books segment of the 2014 data set, and the books reviews from the 2018 data set.

Important

If you use this data, cite the paper(s) documented on the data set web site.

For 2014 data, the citations are:

R. He and J. McAuley. 2016. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In Proc. WWW 2016. DOI:10.1145/2872427.2883037.

J. McAuley, C. Targett, J. Shi, and A. van den Hengel. Image-based recommendations on styles and substitutes. In Proc. SIGIR 2016. DOI:10.1145/2766462.2767755.

For 2018 data:

J. Ni, J. Li, and J. McAuley. Justifying recommendations using distantly-labeled reviews and fined-grained aspects. In Empirical Methods in Natural Language Processing (EMNLP), 2019.

Imported data lives in the az2014 and az2018 directories. The source files are not automatically downloaded — you will need to download the ratings-only data for the Books category from each data site and save them in the data/az2014 and data/az2018 directories.

Configuration

config.yaml allows you to specify whether the review data is used:

az2014:
  enabled: true

az2018:
  enabled: true
  source: reviews

Import Steps

The import is controlled by the following DVC steps:

scan-ratings
Scan the rating CSV file into a Parquet file, converting user strings into numeric IDs. Produces az2014/ratings.parquet.
cluster-ratings
Link ratings with book clusters and aggregate by cluster, to produce user ratings for book clsuters. Produces az2014/az-cluster-ratings.parquet.

Raw Data

az2014/ratings.parquet

The raw rating data, with user strings converted to numeric IDs, is in this file.

File details
Schema for az2014/ratings.parquet.
Field
Type
user
Int32
asin
Utf8
rating
Float32
timestamp
Int64
az2018/ratings.parquet

The raw rating data, with user strings converted to numeric IDs, is in this file.

File details
Schema for az2018/ratings.parquet.
Field
Type
user
Int32
asin
Utf8
rating
Float32
timestamp
Int64

Extracted Rating Tables

az2014/az-cluster-ratings.parquet

This file contains the integrated Amazon ratings, with cluster IDs in the item column.

File details
Schema for az2014/az-cluster-ratings.parquet.
Field
Type
user
Int32
item
Int32
rating
Float32
last_rating
Float32
first_time
Int64
last_time
Int64
nratings
UInt32
az2018/az-cluster-ratings.parquet

This file contains the integrated Amazon ratings, with cluster IDs in the item column.

File details
Schema for az2018/az-cluster-ratings.parquet.
Field
Type
user
Int32
item
Int32
rating
Float32
last_rating
Float32
first_time
Int64
last_time
Int64
nratings
UInt32