Amazon Ratings#

This processes two data sets from Julian McAuley’s group at UCSD:

Each consists of user-provided reviews and ratings for a variety of products.

Currently we import the ratings-only data from the Books segment of the 2014 and 2018 data sets. Future versions of the data tools will support reviews.

If you use this data, cite the paper(s) documented on the data set web site. For 2014 data:

R. He and J. McAuley. 2016. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In Proc. WWW 2016. DOI:10.1145/2872427.2883037.

J. McAuley, C. Targett, J. Shi, and A. van den Hengel. Image-based recommendations on styles and substitutes. In Proc. SIGIR 2016. DOI:10.1145/2766462.2767755.

For 2018 data:

J. Ni, J. Li, and J. McAuley. Justifying recommendations using distantly-labeled reviews and fined-grained aspects. In Empirical Methods in Natural Language Processing (EMNLP), 2019.

Imported data lives in the az2014 and az2018 directories. The source files are not automatically downloaded — you will need to download the ratings-only data for the Books category from each data site and save them in the data/az2014 and data/az2018 directories.

Import Steps#

The import is controlled by the following DVC steps:

scan-ratings

Scan the rating CSV file into a Parquet file, converting user strings into numeric IDs. Produces az2014/ratings.parquet.

cluster-ratings

Link ratings with book clusters and aggregate by cluster, to produce user ratings for book clsuters. Produces az2014/az-cluster-ratings.parquet.

Raw Data#

az2014/ratings.parquet#

The raw rating data, with user strings converted to numeric IDs, is in this file.

Field

Type

user

Int32

asin

Utf8

rating

Float32

timestamp

Int64

az2018/ratings.parquet#

The raw rating data, with user strings converted to numeric IDs, is in this file.

Field

Type

user

Int32

asin

Utf8

rating

Float32

timestamp

Int64

Extracted Rating Tables#

az2014/az-cluster-ratings.parquet#

This file contains the integrated Amazon ratings, with cluster IDs in the item column.

Field

Type

user

Int32?

item

Int32?

rating

Float32?

last_rating

Float32?

first_time

Int64?

last_time

Int64?

nratings

UInt32?

az2018/az-cluster-ratings.parquet#

This file contains the integrated Amazon ratings, with cluster IDs in the item column.

Field

Type

user

Int32?

item

Int32?

rating

Float32?

last_rating

Float32?

first_time

Int64?

last_time

Int64?

nratings

UInt32?