The book tools are built around Data Version Control, a tool for managing data processing pipelines. We use the software in a somewhat unusal way due to our use of PostgreSQL as our primary storage.
Import is organized into stages that are also tracked in the PostgreSQL database, so that we can get current status information from the DB about the data that has been loaded into it.
Stages and Status¶
Because our primary data storage is in a database, and DVC likes to track files, our design is a little unusual for a DVC project. Each stage that produces output in the database (import, index, etc.) is implemented as two stages:
A primary stage, that produces a
.transcriptfile as output; it works just like any normal DVC stage.
A status stage, that depends on the
.transcriptfile and produces a
.statusfile. This stage is marked as
always_run, so it always re-runs even if the transcript is unchanged, so that it can make sure the
.statusfile contains the current state of the database.
Stages that take database state as input depend on the corresponding
.status file, not the
.transcript file, so that their need to update is triggered based on the current database state.
This wires together all of the dependencies, and uses the current state in the database instead of
files that might become out-of-sync with the database to track import status.
There are a couple of holes in this design, but it’s the best we can do and it works.
The stage name matches the name of the
The reason for this somewhat bizarre layoutis that if we just wrote the output files, and the database was reloaded or corrupted, the DVC status-checking logic would not be ableto keep track of it. This double-file design allows us to make subsequent steps depend on the actual results of the import, not our memory of the import in the Git repository.
In-Database Status Tracking¶
Import steps are tracked in the
stage_status table in the database. For completed stages, this can
include a key (checksum, UUID, or other identifier) to identify a ‘version’ of the stage. Stages
can also have dependencies, which are solely used for computing the status of a stage (all actual
dependency relationships are handled by DVC):
stage_depstracks stage-to-stage dependencies, to say that one stage used another as input.
stage_filetracks stage-to-file dependencies, to say that a stage used a file as input.
source_file table tracks input file checksums.
Projects using the book database can also use
stage_status to obtain data version information, to
see if they are up-to-date.