Amazon Ratings

This processes two data sets from Julian McAuley’s group at UCSD:

The 2014 Amazon reviews data set
The 2018 Amazon reviews data set

Each consists of user-provided reviews and ratings for a variety of products.

Currently we import the ratings-only data from the Books segment of the 2014 data set, and the books reviews from the 2018 data set.

Important

If you use this data, cite the paper(s) documented on the data set web site.

For 2014 data, the citations are:

R. He and J. McAuley. 2016. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In Proc. WWW 2016. DOI:10.1145/2872427.2883037.

J. McAuley, C. Targett, J. Shi, and A. van den Hengel. Image-based recommendations on styles and substitutes. In Proc. SIGIR 2016. DOI:10.1145/2766462.2767755.

For 2018 data:

J. Ni, J. Li, and J. McAuley. Justifying recommendations using distantly-labeled reviews and fined-grained aspects. In Empirical Methods in Natural Language Processing (EMNLP), 2019.

Imported data lives in the az2014 and az2018 directories. The source files are not automatically downloaded — you will need to download the ratings-only data for the Books category from each data site and save them in the data/az2014 and data/az2018 directories.

Configuration

config.yaml allows you to specify whether the review data is used:

az2014:
  enabled: true

az2018:
  enabled: true
  source: reviews

Import Steps

The import is controlled by the following DVC steps:

scan-ratings: Scan the rating CSV file into a Parquet file, converting user strings into numeric IDs. Produces az2014/ratings.parquet.
cluster-ratings: Link ratings with book clusters and aggregate by cluster, to produce user ratings for book clsuters. Produces az2014/az-cluster-ratings.parquet.

Raw Data

az2014/ratings.parquet

The raw rating data, with user strings converted to numeric IDs, is in this file.

File details

Schema for `az2014/ratings.parquet`.
Field	Type
user_id	Int32
asin	LargeUtf8
rating	Float32
timestamp	Int64

az2018/ratings.parquet

The raw rating data, with user strings converted to numeric IDs, is in this file.

File details

Schema for `az2018/ratings.parquet`.
Field	Type
user_id	Int32
asin	LargeUtf8
rating	Float32
timestamp	Int64

Extracted Rating Tables

az2014/az-cluster-ratings.parquet

This file contains the integrated Amazon ratings, with cluster IDs in the item column.

File details

Schema for `az2014/az-cluster-ratings.parquet`.
Field	Type
user_id	Int32
item_id	Int32
rating	Float32
last_rating	Float32
first_time	Int64
last_time	Int64
nratings	UInt32

az2018/az-cluster-ratings.parquet

This file contains the integrated Amazon ratings, with cluster IDs in the item column.

File details

Schema for `az2018/az-cluster-ratings.parquet`.
Field	Type
user_id	Int32
item_id	Int32
rating	Float32
last_rating	Float32
first_time	Int64
last_time	Int64
nratings	UInt32