Source Data
These import tools will integrate several data sets. Some of them are auto-downloaded, but others you will need to download yourself and save in the data
directory. The data sources are:
- Library of Congress MDSConnect Open MARC Records (auto-downloaded).
- LoC MDSConnect Name Authorities (auto-downloaded).
- Virtual Internet Authority File MARC 21 XML data (auto-downloaded, but usually needs configuration to access current data file; see the documentation for details).
- OpenLibrary Dump (auto-downloaded).
- Amazon Ratings (2014) ‘ratings only’ data for Books (not auto-downloaded — save CSV file in
data/az2014
). If you use this data, cite the paper on that site. - Amazon Ratings (2018) ‘ratings only’ data for Books (not auto-downloaded — save CSV file in
data/az2014
). If you use this data, cite the paper on that site. - BookCrossing (auto-downloaded). If you use this data, cite the paper on that site.
- GoodReads data from UCSD Book Graph — the GoodReads books, works, authors, series, and interaction files (not auto-downloaded - save GZip’d JSON files in
data/goodreads
). If you use this data, cite the paper on that site. More information on options are in the docs.
If all files are properly downloaded, dvc status -R data
will show that all files are up to date (it may also display warnings about locked files).
See Data Model for details on how each data source appears in the final data.
Configuration
The pipeline is reconfigurable to use subsets of this data. To change the pipeline options:
- Edit
config.yaml
to specify the options you want, such as using full GoodReads interaction files. - Re-render the pipeline with
cargo run --release pipeline render
- Commit the updated pipeline to
git
(optional, but recommended prior to running)
A dvc repro
will now use the reconfigured pipeline.