Setting Up the Environment¶
These tools require PostgreSQL and an Anaconda installation.
You will need:
A PostgreSQL server with at least 500GB of disk space available for the database and ample RAM.
An environment to save the raw files and run the import code. This can be the same machine as the PostgreSQL server, but needs:
Anaconda or Miniconda
100GB of disk space for input files
A few GB of memory
Linux or macOS (since graph-tool isn’t built for Windows right now)
The scripts don’t have substantial memory requirements, but do need a good deal of disk space. The most memory-intensive operation is the connected components computation for book clustering.
The book data tools require PostgreSQL (at least version 10), with the following extensions installed:
PostgreSQL Contrib (specifically
The database will take approximately 500GB.
Create a database for the book data, e.g.
bookdata, owned by the database user you will be using to run the data integration tools. The tools will create various tables and schemas.
Once you have created the database, run the following as the database superuser to enable the PostgreSQL extensions:
CREATE EXTENSION pg_prewarm; CREATE EXTENSION orafce; CREATE EXTENSION "uuid-ossp";
Import Tool Dependencies¶
The import tools are written in Python and Rust. The provided
environment.yml file defines an Anaconda environment (named
bookdata by default) that contains all required runtimes and libraries:
conda env create -f environment.yml conda activate bookdata
If you don’t want to use Anaconda, see the following for more details on dependencies.
This needs the following Python dependencies:
Python 3.6 or later
dvc (0.90 or later)
All scripts read database configuration from the
DB_URL environment variable, or alternately
a config file
db.cfg. This file should look like:
[DEFAULT] host = localhost database = bookdata user = user password = password
This file additionally supports branch-specfic configuration sections that will apply to work on different Git branches, e.g.:
[DEFAULT] host = localhost database = bookdata [master] database = bdorig
This setup will use
bookdata for most branches, but will connect to
bdorig when working
master branch in the git repository.
This file should not be committed to Git. It is ignored in