Amazon Ratings
This processes two data sets from Julian McAuley’s group at UCSD:
Each consists of user-provided reviews and ratings for a variety of products.
Currently we import the ratings-only data from the Books segment of the 2014 data set, and the books reviews from the 2018 data set.
If you use this data, cite the paper(s) documented on the data set web site.
For 2014 data, the citations are:
R. He and J. McAuley. 2016. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In Proc. WWW 2016. DOI:10.1145/2872427.2883037.
J. McAuley, C. Targett, J. Shi, and A. van den Hengel. Image-based recommendations on styles and substitutes. In Proc. SIGIR 2016. DOI:10.1145/2766462.2767755.
For 2018 data:
J. Ni, J. Li, and J. McAuley. Justifying recommendations using distantly-labeled reviews and fined-grained aspects. In Empirical Methods in Natural Language Processing (EMNLP), 2019.
Imported data lives in the az2014
and az2018
directories. The source files are not automatically downloaded — you will need to download the ratings-only data for the Books category from each data site and save them in the data/az2014
and data/az2018
directories.
Configuration
config.yaml
allows you to specify whether the review data is used:
az2014:
enabled: true
az2018:
enabled: true
source: reviews
Import Steps
The import is controlled by the following DVC steps:
scan-ratings
-
Scan the rating CSV file into a Parquet file, converting user strings into numeric IDs. Produces
az2014/ratings.parquet
. cluster-ratings
-
Link ratings with book clusters and aggregate by cluster, to produce user ratings for book clsuters. Produces
az2014/az-cluster-ratings.parquet
.
Raw Data
az2014/ratings.parquet
The raw rating data, with user strings converted to numeric IDs, is in this file.
File details
Field
|
Type
|
---|---|
user
|
Int32
|
asin
|
Utf8
|
rating
|
Float32
|
timestamp
|
Int64
|
az2018/ratings.parquet
The raw rating data, with user strings converted to numeric IDs, is in this file.
File details
Field
|
Type
|
---|---|
user
|
Int32
|
asin
|
Utf8
|
rating
|
Float32
|
timestamp
|
Int64
|
Extracted Rating Tables
az2014/az-cluster-ratings.parquet
This file contains the integrated Amazon ratings, with cluster IDs in the item
column.
File details
Field
|
Type
|
---|---|
user
|
Int32
|
item
|
Int32
|
rating
|
Float32
|
last_rating
|
Float32
|
first_time
|
Int64
|
last_time
|
Int64
|
nratings
|
UInt32
|
az2018/az-cluster-ratings.parquet
This file contains the integrated Amazon ratings, with cluster IDs in the item
column.
File details
Field
|
Type
|
---|---|
user
|
Int32
|
item
|
Int32
|
rating
|
Float32
|
last_rating
|
Float32
|
first_time
|
Int64
|
last_time
|
Int64
|
nratings
|
UInt32
|