We organize the data and pipelines in directories as follows:
Contains the raw import data as downloaded from its original source. Manually-downloaded files and files that can be natively downloaded by DVC are tracked with a
dvc.yamlpipeline contains stages to automatically download additional files. The only processing in this directory is downloading.
Data sets consisting of multiple files generally get a subdirectory under this directory.
Contains the results of processing data from the Library of Congress MDSConnect Open MARC service. See LOC for details.
Contains the results of processing the OpenLibrary data.
Contains the results of integrating BookCrossing.
Contains the results of integrating the Amazon 2014 ratings data set.
Contains the GoodReads processing and integration
Each directory has a DVC pipeline for managing that directory’s outputs. Post-clustering integrations are stored
in the data source directory; e.g. the
goodreads directory contains both the direct tabular GoodReads data, and
the conversion of ratings into ratings for book clusters based on
book-links (so the flow from directory to
directory is not one-directional).