OpenLibrary

We also source book data from OpenLibrary, as downloaded from their developer dumps.

The DVC control files automatically download the appropriate version. The version can be updated by modifying the data/ol_dump_*.txt.gz.dvc files.

Imported data lives under the openlibrary directory.

erDiagram
    editions ||--o{ edition-isbn-ids : ""
    edition-isbn-ids }o--|| all-isbns : ""
    editions {
        Int32 id PK
        Utf8 key
        Utf8 title
    }
    editions }o--o{ works : "edition-works"
    editions |o--o{ edition-subjects : ""
    edition-subjects {
        Int32 id
        Utf8 subject
    }
    works {
        Int32 id PK
        Utf8 key
        Utf8 title
    }
    works |o--o{ work-subjects : ""
    work-subjects {
        Int32 id
        Utf8 subject
    }
    authors {
        Int32 id PK
        Utf8 key
        Utf8 name
    }
    authors ||--o{ author-names : ""
    editions }o--o{ authors : "edition-authors"
    works }o--o{ authors : "work-authors"

Import Steps

The import is controlled by the following DVC steps:

scan-*: The various scan-* stages (e.g. scan-authors) scan an OpenLibrary JSON file into the resulting Parquet files. There are dependencies, to resolve OpenLibrary keys to numeric identifiers for cross-referencing. These scan stages do not currently extract all available data from the OpenLibrary JSON; they only extract the fields we currently use, and need to be extended to extract and save additional fields.
edition-isbn-ids: Convert edition ISBNs into ISBN IDs, producing {{ERR unknown file edition-isbn-ids.parquet}}.

Raw Data

The raw data lives in the data/openlib directory, as compressed JSON files. Right now we do not extract very many fields from OpenLibrary; additional fields can be extracted by extending the import scripts.

Extracted Edition Tables

We extract the following tables from OpenLibrary editions:

openlibrary/editions.parquet

This file contains a primary record for each edition, with the numeric edition ID, OpenLibrary key, and edition data.

File details

Schema for `openlibrary/editions.parquet`.
Field	Type
id	UInt32
key	LargeUtf8
title	LargeUtf8

openlibrary/edition-authors.parquet

This file contains mappings between editions and their authors.

File details

Schema for `openlibrary/edition-authors.parquet`.
Field	Type
edition	UInt32
pos	Int16
author	UInt32

openlibrary/edition-works.parquet

This maps editions to their works.

File details

Schema for `openlibrary/edition-works.parquet`.
Field	Type
edition	UInt32
work	UInt32

openlibrary/edition-isbns.parquet

This contains the ISBN fields extracted from each OpenLibrary edition. This is primarily for internal purposes and most people won’t need to use it. ISBNs are cleaned (with clean_isbn_chars or clean_asin_chars prior to being stored in this file.

File details

Schema for `openlibrary/edition-isbns.parquet`.
Field	Type
edition	UInt32
isbn	LargeUtf8

openlibrary/edition-subjects.parquet

This table contains the subjects for OpenLibrary editions. Each row contains an edition ID and one subject. Its schema is in EditionSubjectRec.

File details

Schema for `openlibrary/edition-subjects.parquet`.
Field	Type
id	UInt32
subj_type	UInt8
subject	LargeUtf8

openlibrary/edition-isbn-ids.parquet

This file maps editions to numeric ISBN identifiers. It is derived from openlibrary/edition-isbns.parquet.

File details

Schema for `openlibrary/edition-isbn-ids.parquet`.
Field	Type
edition	UInt32
isbn_id	Int32

Extracted Work Tables

We extract the following tables from OpenLibrary works:

openlibrary/works.parquet

This file contains the primary record for each work, mapping a numeric ID to its OpenLibrary key and containing other per-work fields.

File details

Schema for `openlibrary/works.parquet`.
Field	Type
id	UInt32
key	LargeUtf8
title	LargeUtf8

openlibrary/work-authors.parquet

This file links work records to the work’s author list (works may have separate author lists from their editions).

File details

Schema for `openlibrary/work-authors.parquet`.
Field	Type
id	UInt32
pos	Int16
author	UInt32

openlibrary/work-subjects.parquet

This table contains the subjects for OpenLibrary editions. Each row contains an edition ID and one subject. Its schema is in WorkSubjectRec.

File details

Schema for `openlibrary/work-subjects.parquet`.
Field	Type
id	UInt32
subj_type	UInt8
subject	LargeUtf8

Extracted Author Tables

openlibrary/authors.parquet

This file contains basic information about OpenLibrary authors.

File details

Schema for `openlibrary/authors.parquet`.
Field	Type
id	UInt32
key	LargeUtf8
name	LargeUtf8

openlibrary/author-names.parquet

This file contains the names associated with each author in openlibrary/authors.parquet.

File details

Schema for `openlibrary/author-names.parquet`.
Field	Type
id	UInt32
source	UInt8
name	LargeUtf8

Utility Tables

openlibrary/work-clusters.parquet

This file is a helper table to make it easier to connect OpenLibrary data to clusters by mapping OpenLibrary work IDs to book data cluster IDs.

File details

Schema for `openlibrary/work-clusters.parquet`.
Field	Type
work_id	Int32
cluster	Int32