Virtual Internet Authority File

We source author data from the Virtual Internet Authority File, as downloaded from their data dumps. This file is slow and error-prone to download, and is not* auto-downloaded.

Imported data lives in the viaf schema.

Import Steps

The import is controlled by the following DVC steps:


Run viaf-schema.sql to set up the base schema.


Import raw VIAF MARC data from data/viaf-clusters-marc21.xml.gz.


Run viaf-index.sql to index the MARC data and extract tables.

Raw Data

VIAF data is in MARC 21 Authority Record format. The raw MARC data is imported into the marc_field table with the same format as LOC.

Extracted Author Tables

We extract the following tables for VIAF authors:


The author’s name(s). We insert an author name for each field with tag 700 and subfield code ‘a’. For all author names of the form ‘Family, Given’, we insert an additional record with the form ‘Given Family’ and indicator ‘S’. This helps maximize links.


The author’s gender, from field 375 subfield ‘a’. This is a raw extract of all gender identity assertions in the record; we resolve multiple assertions later in the data integration process.

VIAF Gender Vocabulary

The MARC gender field is defined as the author’s gender identity. It allows identities from an open vocabulary, along with start and end dates for the validity of each identity.

The Program for Cooperative Cataloging Task Group on Gender in Name Authority Records produced a report with recommendations for how to record this field. Many libraries contributing to the Library of Congress file, from which many VIAF records are sourced, follow these recommendations, but it is not safe to assume they are universally followed by all VIAF contributors.

Further, as near as we can tell, the VIAF removes all non-binary gender identities or converts them to ‘unknown’.

This data should only be used with great care. We discuss these limitations in the extended preprint.