Library of Congress

One of our sources of book data is the Library of Congress MDSConnect Books bibliography records.

We download and import the XML versions of these files.

Imported data lives under the loc-mds directory.

erDiagram
    book-ids |o--|{ book-fields : contains
    book-ids ||--o{ book-isbns : ""
    book-ids ||--o{ book-isbn-ids : ""
    book-ids ||--o{ book-authors : ""

Import Steps

The import is controlled by the following DVC steps:

scan-books
Scan the book MARC data from data/loc-books into Parquet files (described in book data).
book-isbn-ids
Resolve ISBNs from LOC books into ISBN IDs, producing loc-mds/book-isbn-ids.parquet.
book-authors
Extract (and clean up) author names for LOC books.

Raw MARC data

When importing MARC data, we create a “fields” file that contains the data exactly as recorded in MARC. We then process this data to produce additional files. One of these MARC field files contains the following columns (defined by FieldRecord):

rec_id
The record identifier (generated at import)
fld_no
The field number. This corresponds to a single MARC field entry; rows in this table containing data from MARC subfields will share a fld_no with their containing field.
tag
The MARC tag; either a three-digit number, or -1 for the MARC leader.
ind1, ind2
MARC indicators. Their meanings are defined in the MARC specification.
sf_code
MARC subfield code.
contents
The raw textual content of the MARC field or subfield.

Extracted Book Tables

We extract a number of tables from the LOC MDS book data. These tables only contain information about actual “books” in the collection, as opposed to other types of materials. We consider a book to be anything that has MARC record type ‘a’ or ‘t’ (language material), and is not also classified as a government record in MARC field 008.

loc-mds/book-fields.parquet (struct FieldRecord)

The book-fields table contains the raw data imported from the MARC files, as MARC fields. The LOC book data follows the MARC 21 Bibliographic Data format; the various tags, field codes, and indicators are defined there. This table is not terribly useful on its own, but it is the source from which the other tables are derived.

File details
Schema for loc-mds/book-fields.parquet.
Field
Type
rec_id
UInt32
fld_no
UInt32
tag
Int16
ind1
UInt8
ind2
UInt8
sf_code
UInt8
contents
LargeUtf8
loc-mds/book-ids.parquet (struct BookIds)

This table includes code information for each book record.

  • Record ID
  • MARC Control Number
  • Library of Congress Control Number (LCCN)
  • Record status
  • Record type
  • Bibliographic level

More information about the last three is in the leader specification.

File details
Schema for loc-mds/book-ids.parquet.
Field
Type
rec_id
UInt32
marc_cn
LargeUtf8
lccn
LargeUtf8
status
UInt8
rec_type
UInt8
bib_level
UInt8
loc-mds/book-isbns.parquet (struct ISBNrec)

Textual ISBNs as extracted from LOC records. The actual ISBN strings (tag 020 subfield ‘a’) are quite messy; the parser in bookdata::cleaning::isbns parses out ISBNs, along with additional tags or descriptors, from the ISBN strings using a number of best-effort heuristics. This table contains the results of that process.

File details
Schema for loc-mds/book-isbns.parquet.
Field
Type
rec_id
UInt32
isbn
LargeUtf8
tag
LargeUtf8
loc-mds/book-isbn-ids.parquet

Map book records (LOC book rec_id values) to ISBN IDs. It is produced by converting the ISBNs in loc-mds/book-isbns.parquet into ISBN IDs.

File details
Schema for loc-mds/book-isbn-ids.parquet.
Field
Type
rec_id
UInt32
isbn_id
Int32
loc-mds/book-authors.parquet

Author names for book records. This only extracts the primary author name (MARC field 100 subfield ‘a’).

File details
Schema for loc-mds/book-authors.parquet.
Field
Type
rec_id
UInt32
author_name
LargeUtf8