Library of Congress

One of our sources of book data is the Library of Congress MDSConnect Books bibliography records.

We download and import the XML versions of these files.

Imported data lives under the loc-mds directory.

erDiagram
    book-ids |o--|{ book-fields : contains
    book-ids ||--o{ book-isbns : ""
    book-ids ||--o{ book-isbn-ids : ""
    book-ids ||--o{ book-authors : ""

Import Steps

The import is controlled by the following DVC steps:

scan-books: Scan the book MARC data from data/loc-books into Parquet files (described in book data).
book-isbn-ids: Resolve ISBNs from LOC books into ISBN IDs, producing loc-mds/book-isbn-ids.parquet.
book-authors: Extract (and clean up) author names for LOC books.

Raw MARC data

When importing MARC data, we create a “fields” file that contains the data exactly as recorded in MARC. We then process this data to produce additional files. One of these MARC field files contains the following columns (defined by FieldRecord):

rec_id: The record identifier (generated at import)
fld_no: The field number. This corresponds to a single MARC field entry; rows in this table containing data from MARC subfields will share a fld_no with their containing field.
tag: The MARC tag; either a three-digit number, or -1 for the MARC leader.
ind1, ind2: MARC indicators. Their meanings are defined in the MARC specification.
sf_code: MARC subfield code.
contents: The raw textual content of the MARC field or subfield.

Extracted Book Tables

We extract a number of tables from the LOC MDS book data. These tables only contain information about actual “books” in the collection, as opposed to other types of materials. We consider a book to be anything that has MARC record type ‘a’ or ‘t’ (language material), and is not also classified as a government record in MARC field 008.

loc-mds/book-fields.parquet (struct FieldRecord)

The book-fields table contains the raw data imported from the MARC files, as MARC fields. The LOC book data follows the MARC 21 Bibliographic Data format; the various tags, field codes, and indicators are defined there. This table is not terribly useful on its own, but it is the source from which the other tables are derived.

File details

Schema for `loc-mds/book-fields.parquet`.
Field	Type
rec_id	UInt32
fld_no	UInt32
tag	Int16
ind1	UInt8
ind2	UInt8
sf_code	UInt8
contents	LargeUtf8

loc-mds/book-ids.parquet (struct BookIds)

This table includes code information for each book record.

Record ID
MARC Control Number
Library of Congress Control Number (LCCN)
Record status
Record type
Bibliographic level

More information about the last three is in the leader specification.

File details

Schema for `loc-mds/book-ids.parquet`.
Field	Type
rec_id	UInt32
marc_cn	LargeUtf8
lccn	LargeUtf8
status	UInt8
rec_type	UInt8
bib_level	UInt8

loc-mds/book-isbns.parquet (struct ISBNrec)

Textual ISBNs as extracted from LOC records. The actual ISBN strings (tag 020 subfield ‘a’) are quite messy; the parser in bookdata::cleaning::isbns parses out ISBNs, along with additional tags or descriptors, from the ISBN strings using a number of best-effort heuristics. This table contains the results of that process.

File details

Schema for `loc-mds/book-isbns.parquet`.
Field	Type
rec_id	UInt32
isbn	LargeUtf8
tag	LargeUtf8

loc-mds/book-isbn-ids.parquet

Map book records (LOC book rec_id values) to ISBN IDs. It is produced by converting the ISBNs in loc-mds/book-isbns.parquet into ISBN IDs.

File details

Schema for `loc-mds/book-isbn-ids.parquet`.
Field	Type
rec_id	UInt32
isbn_id	Int32

loc-mds/book-authors.parquet

Author names for book records. This only extracts the primary author name (MARC field 100 subfield ‘a’).

File details

Schema for `loc-mds/book-authors.parquet`.
Field	Type
rec_id	UInt32
author_name	LargeUtf8