Source Data

These import tools will integrate several data sets. Some of them are auto-downloaded, but others you will need to download yourself and save in the data directory. The data sources are:

Library of Congress MDSConnect Open MARC Records (auto-downloaded).
LoC MDSConnect Name Authorities (auto-downloaded).
Virtual Internet Authority File MARC 21 XML data (auto-downloaded, but usually needs configuration to access current data file; see the documentation for details).
OpenLibrary Dump (auto-downloaded).
Amazon Ratings (2014) ‘ratings only’ data for Books (not auto-downloaded — save CSV file in data/az2014). If you use this data, cite the paper on that site.
Amazon Ratings (2018) ‘ratings only’ data for Books (not auto-downloaded — save CSV file in data/az2014). If you use this data, cite the paper on that site.
BookCrossing (auto-downloaded). If you use this data, cite the paper on that site.
GoodReads data from UCSD Book Graph — the GoodReads books, works, authors, series, and interaction files (not auto-downloaded - save GZip’d JSON files in data/goodreads). If you use this data, cite the paper on that site. More information on options are in the docs.

If all files are properly downloaded, dvc status -R data will show that all files are up to date (it may also display warnings about locked files).

See Data Model for details on how each data source appears in the final data.

Configuration

The pipeline is reconfigurable to use subsets of this data. To change the pipeline options:

Edit config.yaml to specify the options you want, such as using full GoodReads interaction files.
Re-render the pipeline with cargo run --release pipeline render
Commit the updated pipeline to git (optional, but recommended prior to running)

A dvc repro will now use the reconfigured pipeline.