VIAF
We source author data from the Virtual Internet Authority File, as downloaded from their data dumps. This file is slow to download, as the VIAF server is rather slow.
VIAF also does not keep old copies of the dump file. You may need to edit data/params.yaml
to update the VIAF URL to fetch in order to import this data.
Imported data lives under the viaf
directory.
Import Steps
The import is controlled by the following DVC steps:
scan-authors
-
Import the VIAF MARC data into {{ERR unknown file
viaf.parquet
}}. author-genders
-
Extract author genders from the VIAF MARC data, producing {{ERR unknown file
author-genders.parquet
}}. index-names
-
Normalize and expand author names and map to VIAF record IDs, producing {{ERR unknown file
author-name-index.parquet
}}.
Raw Data
The VIAF data is in MARC 21 Authority Record format. The initial scan stage extracts this into a table using the MARC schema.
viaf/viaf.parquet
The table storing raw MARC fields from VIAF.
File details
Field
|
Type
|
---|---|
rec_id
|
UInt32
|
fld_no
|
UInt32
|
tag
|
Int16
|
ind1
|
UInt8
|
ind2
|
UInt8
|
sf_code
|
UInt8
|
contents
|
Utf8
|
VIAF Gender Vocabulary
The MARC gender field is defined as the author’s gender identity. It allows identities from an open vocabulary, along with start and end dates for the validity of each identity.
The Program for Cooperative Cataloging Task Group on Gender in Name Authority Records produced a report with recommendations for how to record this field. Many libraries contributing to the Library of Congress file, from which many VIAF records are sourced, follow these recommendations, but it is not safe to assume they are universally followed by all VIAF contributors.
Further, as near as we can tell, the VIAF removes all non-binary gender identities or converts them to ‘unknown’.
This data should only be used with great care. We discuss these limitations in the extended paper.