VIAF

We source author data from the Virtual Internet Authority File, as downloaded from their data dumps. This file is slow to download, as the VIAF server is rather slow.

Note

VIAF also does not keep old copies of the dump file. You may need to edit data/params.yaml to update the VIAF URL to fetch in order to import this data.

Imported data lives under the viaf directory.

Import Steps

The import is controlled by the following DVC steps:

scan-authors: Import the VIAF MARC data into {{ERR unknown file viaf.parquet}}.
author-genders: Extract author genders from the VIAF MARC data, producing {{ERR unknown file author-genders.parquet}}.
index-names: Normalize and expand author names and map to VIAF record IDs, producing {{ERR unknown file author-name-index.parquet}}.

Raw Data

The VIAF data is in MARC 21 Authority Record format. The initial scan stage extracts this into a table using the MARC schema.

viaf/viaf.parquet

The table storing raw MARC fields from VIAF.

File details

Schema for `viaf/viaf.parquet`.
Field	Type
rec_id	UInt32
fld_no	UInt32
tag	Int16
ind1	UInt8
ind2	UInt8
sf_code	UInt8
contents	LargeUtf8

Extracted Author Tables

We process the MARC records to produce several derived tables.

viaf/author-name-index.parquet

The author-name index file maps record IDs to author names, as defined in field 700a. For each record, it stores each of the names extracted by bookdata::cleaning::names. This file is also available in csv.gz format.

File details

Schema for `viaf/author-name-index.parquet`.
Field	Type
rec_id	UInt32
name	LargeUtf8

viaf/author-genders.parquet

This file contains the extracted gender information for each author record (field 375a). If a record has multiple gender fields, they are all recorded. Merging gender records happens later in the integration.

File details

Schema for `viaf/author-genders.parquet`.
Field	Type
rec_id	UInt32
gender	LargeUtf8

VIAF Gender Vocabulary

The MARC gender field is defined as the author’s gender identity. It allows identities from an open vocabulary, along with start and end dates for the validity of each identity.

The Program for Cooperative Cataloging Task Group on Gender in Name Authority Records produced a report with recommendations for how to record this field. Many libraries contributing to the Library of Congress file, from which many VIAF records are sourced, follow these recommendations, but it is not safe to assume they are universally followed by all VIAF contributors.

Further, as near as we can tell, the VIAF removes all non-binary gender identities or converts them to ‘unknown’.

This data should only be used with great care. We discuss these limitations in the extended paper.