Library of Congress¶
One of our sources of book data is the Library of Congress MDSConnect Books bibliography records.
We download and import the XML versions of these files.
Imported data lives under the
Data Model Diagram¶
The import is controlled by the following DVC steps:
loc-mds-schema.sqlto set up the base schema.
Import raw MARC data from
Parse ISBNs from LOC ISBN records.
loc-mds-index-books.sqlto index the book data and extract tables.
loc-mds-book-info.sqlto extract additional book data into tables.
Raw Book Data¶
locmds.book_marc_fields table contains the raw data imported from the MARC files, as MARC fields. The LOC book data follows the MARC 21 Bibliographic Data format; the various tags, field codes, and indicators are defined there. This table is not terribly useful on its own, but it is the source from which the other tables are derived.
It has the following columns:
The record identifier (generated at import)
The field number. This corresponds to a single MARC field entry; rows in this table containing data from MARC subfields will share a
fld_nowith their containing field.
The MARC tag; either a three-digit number, or
LDRfor the MARC leader.
MARC indicators. Their meanings are defined in the MARC specification.
MARC subfield code.
The raw textual content of the MARC field or subfield.
Extracted Book Tables¶
We then extract a number of tables and views from this MARC data. These tables include:
Code information for each book record.
MARC Control Number
Library of Congress Control Number (LCCN)
More information about the last three is in the leader specification.
A subset of
book_record_infointended to capture the actual books in the collection, as opposed to other types of materials. We consider a book to be anything that has MARC record type ‘a’ or ‘t’ (language material), and is not also classified as a government record in MARC field 008.
Textual ISBNs as extracted from LOC records. The actual ISBN strings (tag 020 subfield ‘a’) are quite messy; the Rust program
parse-isbnsparses out ISBNs, along with additional tags or descriptors, from the ISBN strings using a number of best-effort heuristics. This table contains the results of that process.
Map book records to their ISBNs.
Author names for book records. This only extracts the primary author name (MARC field 100 subfield ‘a’).
Book publication year (MARC field 260 subfield ‘c’).
Book title (MARC field 245 subfield ‘a’).