VIAF

We source author data from the Virtual Internet Authority File, as downloaded from their data dumps. This file is slow to download, as the VIAF server is rather slow.

Note

VIAF also does not keep old copies of the dump file. You may need to edit data/params.yaml to update the VIAF URL to fetch in order to import this data.

Imported data lives under the viaf directory.

Import Steps

The import is controlled by the following DVC steps:

scan-authors
Import the VIAF MARC data into {{ERR unknown file viaf.parquet}}.
author-genders
Extract author genders from the VIAF MARC data, producing {{ERR unknown file author-genders.parquet}}.
index-names
Normalize and expand author names and map to VIAF record IDs, producing {{ERR unknown file author-name-index.parquet}}.

Raw Data

The VIAF data is in MARC 21 Authority Record format. The initial scan stage extracts this into a table using the MARC schema.

viaf/viaf.parquet

The table storing raw MARC fields from VIAF.

File details
Schema for viaf/viaf.parquet.
Field
Type
rec_id
UInt32
fld_no
UInt32
tag
Int16
ind1
UInt8
ind2
UInt8
sf_code
UInt8
contents
Utf8

Extracted Author Tables

We process the MARC records to produce several derived tables.

viaf/author-name-index.parquet

The author-name index file maps record IDs to author names, as defined in field 700a. For each record, it stores each of the names extracted by bookdata::cleaning::names. This file is also available in csv.gz format.

File details
Schema for viaf/author-name-index.parquet.
Field
Type
rec_id
UInt32
name
Utf8
viaf/author-genders.parquet

This file contains the extracted gender information for each author record (field 375a). If a record has multiple gender fields, they are all recorded. Merging gender records happens later in the integration.

File details
Schema for viaf/author-genders.parquet.
Field
Type
rec_id
UInt32
gender
Utf8

VIAF Gender Vocabulary

The MARC gender field is defined as the author’s gender identity. It allows identities from an open vocabulary, along with start and end dates for the validity of each identity.

The Program for Cooperative Cataloging Task Group on Gender in Name Authority Records produced a report with recommendations for how to record this field. Many libraries contributing to the Library of Congress file, from which many VIAF records are sourced, follow these recommendations, but it is not safe to assume they are universally followed by all VIAF contributors.

Further, as near as we can tell, the VIAF removes all non-binary gender identities or converts them to ‘unknown’.

This data should only be used with great care. We discuss these limitations in the extended paper.