erDiagram book-ids |o--|{ book-fields : contains book-ids ||--o{ book-isbns : "" book-ids ||--o{ book-isbn-ids : "" book-ids ||--o{ book-authors : ""
Library of Congress
One of our sources of book data is the Library of Congress MDSConnect Books bibliography records.
We download and import the XML versions of these files.
Imported data lives under the loc-mds
directory.
Import Steps
The import is controlled by the following DVC steps:
scan-books
-
Scan the book MARC data from
data/loc-books
into Parquet files (described in book data). book-isbn-ids
-
Resolve ISBNs from LOC books into ISBN IDs, producing
loc-mds/book-isbn-ids.parquet
. book-authors
- Extract (and clean up) author names for LOC books.
Raw MARC data
When importing MARC data, we create a “fields” file that contains the data exactly as recorded in MARC. We then process this data to produce additional files. One of these MARC field files contains the following columns (defined by FieldRecord
):
rec_id
- The record identifier (generated at import)
fld_no
-
The field number. This corresponds to a single MARC field entry; rows in this table containing data from MARC subfields will share a
fld_no
with their containing field. tag
- The MARC tag; either a three-digit number, or -1 for the MARC leader.
ind1
,ind2
- MARC indicators. Their meanings are defined in the MARC specification.
sf_code
- MARC subfield code.
contents
- The raw textual content of the MARC field or subfield.
Extracted Book Tables
We extract a number of tables from the LOC MDS book data. These tables only contain information about actual “books” in the collection, as opposed to other types of materials. We consider a book to be anything that has MARC record type ‘a’ or ‘t’ (language material), and is not also classified as a government record in MARC field 008.
loc-mds/book-fields.parquet
(struct FieldRecord
)
The book-fields
table contains the raw data imported from the MARC files, as MARC fields. The LOC book data follows the MARC 21 Bibliographic Data format; the various tags, field codes, and indicators are defined there. This table is not terribly useful on its own, but it is the source from which the other tables are derived.
File details
Field
|
Type
|
---|---|
rec_id
|
UInt32
|
fld_no
|
UInt32
|
tag
|
Int16
|
ind1
|
UInt8
|
ind2
|
UInt8
|
sf_code
|
UInt8
|
contents
|
Utf8
|
loc-mds/book-ids.parquet
(struct BookIds
)
This table includes code information for each book record.
- Record ID
- MARC Control Number
- Library of Congress Control Number (LCCN)
- Record status
- Record type
- Bibliographic level
More information about the last three is in the leader specification.
File details
Field
|
Type
|
---|---|
rec_id
|
UInt32
|
marc_cn
|
Utf8
|
lccn
|
Utf8
|
status
|
UInt8
|
rec_type
|
UInt8
|
bib_level
|
UInt8
|
loc-mds/book-isbns.parquet
(struct ISBNrec
)
Textual ISBNs as extracted from LOC records. The actual ISBN strings (tag 020 subfield ‘a’) are quite messy; the parser in bookdata::cleaning::isbns
parses out ISBNs, along with additional tags or descriptors, from the ISBN strings using a number of best-effort heuristics. This table contains the results of that process.
File details
Field
|
Type
|
---|---|
rec_id
|
UInt32
|
isbn
|
Utf8
|
tag
|
Utf8
|
loc-mds/book-isbn-ids.parquet
Map book records (LOC book rec_id
values) to ISBN IDs. It is produced by converting the ISBNs in loc-mds/book-isbns.parquet
into ISBN IDs.
File details
Field
|
Type
|
---|---|
rec_id
|
UInt32
|
isbn_id
|
Int32
|