Cluster Gender Changes

This notebook audits for significant changes in cluster gender annotations, to allow us to detect the significance of shifts over time. It depends on the aligned cluster identities in isbn-version-clusters.parquet.

from pathlib import Path
from functools import reduce
import pandas as pd
import polars as pl
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Load Data

Define the versions we care about:

versions = ['pgsql', '2022-03-2.0', '2022-07', '2022-10', '2022-11-2.1', '2023-07', 'current']

Load the aligned ISBNs:

isbn_clusters = pd.read_parquet('isbn-version-clusters.parquet')
isbn_clusters.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43027360 entries, 0 to 43027359
Data columns (total 9 columns):
 #   Column       Dtype  
---  ------       -----  
 0   isbn         object 
 1   isbn_id      int32  
 2   current      float64
 3   2023-07      float64
 4   2022-11-2.1  float64
 5   2022-10      float64
 6   2022-07      float64
 7   2022-03-2.0  float64
 8   pgsql        float64
dtypes: float64(7), int32(1), object(1)
memory usage: 2.7+ GB

Different Genders

How many clusters changed gender?

To get started, we need a list of genders in order.

genders = [
    'ambiguous', 'female', 'male', 'unknown',
    'no-author-rec', 'no-book-author', 'no-book', 'absent'
]

Let’s make a function to read gender info:

def read_gender(path, map_file=None):
    cg = pl.scan_parquet(path)
    cg = cg.select([
        pl.col('cluster').cast(pl.Int32),
        pl.when(pl.col('gender') == 'no-loc-author')
            .then('no-book-author')
            .when(pl.col('gender') == 'no-viaf-author')
            .then('no-author-rec')
            .otherwise(pl.col('gender'))
            .cast(pl.Categorical)
            .alias('gender')
    ])
    if map_file is not None:
        map = pl.scan_parquet(map_file)
        cg = cg.join(map, on='cluster', how='left')
        cg = cg.select([
            pl.col('common').alias('cluster'),
            pl.col('gender')
        ])
    return cg

Read each data source’s gender info and map to common cluster IDs:

gender_cc = {
    v: read_gender(f'{v}/cluster-genders.parquet', f'{v}/cluster-map.parquet')
    for v in versions if v != 'current'
}
gender_cc['current'] = read_gender('../book-links/cluster-genders.parquet')
/tmp/ipykernel_69125/183506089.py:6: DeprecationWarning: in a future version, string input will be parsed as a column name rather than a string literal. To silence this warning, pass the input as an expression instead: `pl.lit('no-book-author')`
  .then('no-book-author')
/tmp/ipykernel_69125/183506089.py:8: DeprecationWarning: in a future version, string input will be parsed as a column name rather than a string literal. To silence this warning, pass the input as an expression instead: `pl.lit('no-author-rec')`
  .then('no-author-rec')

Set up a sequence of frames for merging:

to_merge = [
    gender_cc[v].select([
        pl.col('cluster'),
        pl.col('gender').alias(v)
    ]).unique()
    for v in versions
]

Merge and collect results:

cluster_genders = reduce(lambda df1, df2: df1.join(df2, on='cluster', how='outer'), to_merge)
cluster_genders = cluster_genders.collect()

For unclear reasons, a few versions have a null cluster. Drop that.

cluster_genders = cluster_genders.filter(cluster_genders['cluster'].is_not_null())

Now we will convert to Pandas and fix missing values:

cluster_genders = cluster_genders.to_pandas().set_index('cluster')

Now we’ll unify the categories and their orders:

cluster_genders = cluster_genders.apply(lambda vdf: vdf.cat.set_categories(genders, ordered=True))
cluster_genders.fillna('absent', inplace=True)
cluster_genders.head()
pgsql 2022-03-2.0 2022-07 2022-10 2022-11-2.1 2023-07 current
cluster
416243397 absent absent absent absent absent absent no-book-author
410767599 absent absent absent absent absent absent no-book-author
421374693 absent absent absent absent absent absent no-book-author
449455849 absent absent absent absent absent absent no-book-author
415350734 absent absent absent absent absent absent no-book-author

Let’s save this file for further analysis:

cluster_genders.to_parquet('cluster-version-genders.parquet', compression='zstd')

PostgreSQL to Current

Now we are ready to actually compare cluster genders across categories. Let’s start by comparing original data (PostgreSQL) to current:

ct = cluster_genders[['pgsql', 'current']].value_counts().unstack()
ct = ct.reindex(labels=genders, columns=genders)
ct
current ambiguous female male unknown no-author-rec no-book-author no-book absent
pgsql
ambiguous 98387.0 4331.0 11287.0 1856.0 996.0 2773.0 4.0 4574.0
female 16360.0 1120232.0 978.0 12941.0 9439.0 19.0 29.0 34113.0
male 28527.0 2690.0 3493131.0 18109.0 31095.0 688.0 152.0 70824.0
unknown 3004.0 102929.0 215359.0 1545706.0 19026.0 15.0 12.0 14324.0
no-author-rec 10533.0 58486.0 330352.0 226923.0 1395181.0 436.0 125.0 13658.0
no-book-author 8356.0 114884.0 219279.0 125984.0 211482.0 2457210.0 903525.0 273353.0
no-book NaN NaN NaN NaN NaN NaN NaN NaN
absent 121068.0 1022710.0 2439415.0 1026095.0 4539059.0 17046837.0 734647.0 65883.0
ctf = ct.divide(ct.sum(axis='columns'), axis='rows')
def style_row(row):
    styles = []
    for col, val in zip(row.index, row.values):
        if col == row.name:
            styles.append('font-weight: bold')
        elif val > 0.1:
            styles.append('color: red')
        else:
            styles.append(None)
    return styles
ctf.style.apply(style_row, 'columns')
current ambiguous female male unknown no-author-rec no-book-author no-book absent
pgsql                
ambiguous 0.792115 0.034869 0.090872 0.014943 0.008019 0.022325 0.000032 0.036825
female 0.013701 0.938131 0.000819 0.010837 0.007905 0.000016 0.000024 0.028568
male 0.007826 0.000738 0.958278 0.004968 0.008530 0.000189 0.000042 0.019429
unknown 0.001581 0.054162 0.113324 0.813369 0.010012 0.000008 0.000006 0.007537
no-author-rec 0.005174 0.028730 0.162280 0.111472 0.685359 0.000214 0.000061 0.006709
no-book-author 0.001937 0.026630 0.050829 0.029203 0.049021 0.569580 0.209437 0.063363
no-book nan nan nan nan nan nan nan nan
absent 0.004485 0.037884 0.090363 0.038010 0.168140 0.631465 0.027213 0.002440

Most of the change is coming from clusters absent in the original but present in the new.

There are also quite a few that had no book author in PGSQL, but no book in the current data - not sure what’s up with that. Let’s look at more crosstabs.

def gender_crosstab(old, new, fractional=True):
    ct = cluster_genders[[old, new]].value_counts().unstack()
    ct = ct.reindex(labels=genders, columns=genders)

    if fractional:
        ctf = ct.divide(ct.sum(axis='columns'), axis='rows')
        return ctf
    else:
        return ct
def plot_gender(set):
    cluster_genders[set].value_counts().sort_index().plot.barh()
    plt.title(f'Gender Distribution in {set}')

PostgreSQL to March 2022 (2.0 release)

This marks the change from PostgreSQL to pure-Rust.

ct = gender_crosstab('pgsql', '2022-03-2.0')
ct.style.apply(style_row, 'columns')
2022-03-2.0 ambiguous female male unknown no-author-rec no-book-author no-book absent
pgsql                
ambiguous 0.977924 0.002963 0.013928 0.000636 0.000878 nan nan 0.003671
female 0.002192 0.993942 0.000001 0.000301 0.000430 0.000003 0.000005 0.003126
male 0.000591 0.000000 0.995942 0.000528 0.000796 0.000002 0.000014 0.002127
unknown 0.000043 0.002760 0.005303 0.988899 0.001953 0.000001 0.000003 0.001038
no-author-rec 0.000104 0.007922 0.049966 0.031481 0.908594 nan 0.000008 0.001925
no-book-author 0.000002 0.000051 0.000198 0.000107 0.000053 0.649173 0.335017 0.015398
no-book nan nan nan nan nan nan nan nan
absent 0.000006 0.000049 0.000254 0.000120 0.000178 0.002007 0.000070 0.997316

This is where we change from no-book-author to no-book for a bunch of books; otherwise things are pretty consistent. This major change is likely a result of changes that count more books and book clusters - we had some inner joins in the PostgreSQL version that were questionable, and in particular we didn’t really cluster solo ISBNs but now we do. But now, if we have a solo ISBN from rating data, it gets a cluster with no book record instead of being excluded from the clustering.

Let’s look at the distribution of statuses for each, starting with PostgreSQL:

plot_gender('pgsql')

And the Rust version:

plot_gender('2022-03-2.0')

March to July 2022

We updated a lot of data files and changed the name and ISBN parsing logic.

ct = gender_crosstab('2022-03-2.0', '2022-07')
ct.style.apply(style_row, 'columns')
2022-07 ambiguous female male unknown no-author-rec no-book-author no-book absent
2022-03-2.0                
ambiguous 0.836189 0.035578 0.083571 0.014798 0.004862 0.000087 0.000016 0.024900
female 0.010076 0.963180 0.000488 0.007911 0.001253 0.000002 0.000012 0.017079
male 0.006706 0.000646 0.974629 0.003702 0.001364 0.000079 0.000014 0.012859
unknown 0.001899 0.040307 0.092948 0.856037 0.003412 0.000200 nan 0.005197
no-author-rec 0.003532 0.020634 0.108700 0.101631 0.762109 0.000009 0.000037 0.003349
no-book-author 0.002058 0.030443 0.057348 0.035320 0.056237 0.809983 0.000008 0.008604
no-book 0.000156 0.002342 0.005269 0.002737 0.004768 0.000634 0.980664 0.003431
absent 0.002246 0.020150 0.043122 0.015901 0.062986 0.003484 0.000001 0.852108

Mostly fine; some more are resolved, existing resolutions are pretty consistent.

plot_gender('2022-07')

July 2022 to Oct. 2022

We changed from DataFusion to Polars and made further ISBN and name parsing changes.

ct = gender_crosstab('2022-07', '2022-10')
ct.style.apply(style_row, 'columns')
2022-10 ambiguous female male unknown no-author-rec no-book-author no-book absent
2022-07                
ambiguous 0.989408 0.004969 0.003626 0.000336 0.001647 nan nan 0.000014
female nan 0.995091 nan 0.000361 0.004471 nan nan 0.000078
male 0.000001 nan 0.994582 0.000431 0.004975 0.000000 nan 0.000011
unknown nan nan nan 0.995469 0.004492 nan nan 0.000040
no-author-rec nan 0.000001 0.000003 0.000005 0.999824 0.000131 nan 0.000037
no-book-author nan 0.000000 0.000001 0.000000 0.000000 0.996620 nan 0.003378
no-book 0.000001 0.000100 0.000029 0.000066 0.000091 0.198053 0.670216 0.131445
absent nan nan nan nan nan nan nan 1.000000
plot_gender('2022-10')

Oct. 2022 to release 2.1 (Nov. 2022)

We added support for GoodReads CSV data and the Amazon 2018 rating CSV files.

ct = gender_crosstab('2022-10', '2022-11-2.1')
ct.style.apply(style_row, 'columns')
2022-11-2.1 ambiguous female male unknown no-author-rec no-book-author no-book absent
2022-10                
ambiguous 0.999995 nan nan nan nan nan nan 0.000005
female nan 0.999999 nan nan nan nan nan 0.000001
male nan nan 0.999999 nan nan nan nan 0.000001
unknown nan nan nan 0.999998 nan nan nan 0.000002
no-author-rec nan nan nan nan 0.999999 nan nan 0.000001
no-book-author nan nan nan 0.000000 nan 0.999982 nan 0.000017
no-book nan nan nan nan nan nan 1.000000 nan
absent nan 0.000000 0.000000 0.000000 0.000000 0.000003 0.033872 0.966125
plot_gender('2022-11-2.1')

Release 2.1 to Jul. 2023

We updated OpenLibrary and VIAF, and made some technical changes.

ct = gender_crosstab('2022-11-2.1', '2023-07')
ct.style.apply(style_row, 'columns')
2023-07 ambiguous female male unknown no-author-rec no-book-author no-book absent
2022-11-2.1                
ambiguous 0.879463 0.006926 0.018911 0.001573 0.000239 0.074744 0.000009 0.018136
female 0.004890 0.975552 0.000307 0.004275 0.000396 0.000482 0.000002 0.014097
male 0.002247 0.000044 0.986206 0.001428 0.000497 0.001463 0.000005 0.008110
unknown 0.000267 0.019108 0.032233 0.945152 0.001086 0.000425 0.000000 0.001728
no-author-rec 0.000278 0.002898 0.007174 0.009780 0.975848 0.000360 0.000004 0.003657
no-book-author 0.000411 0.006228 0.011051 0.004080 0.006719 0.967640 0.000001 0.003869
no-book 0.000360 0.005640 0.011799 0.007429 0.024741 0.003807 0.940827 0.005397
absent 0.003096 0.021000 0.056276 0.026941 0.127455 0.014997 0.000001 0.750233
plot_gender('2023-07')

Jul. 2023 to Current

Mostly technical code updates.

ct = gender_crosstab('2023-07', 'current')
ct.style.apply(style_row, 'columns')
current ambiguous female male unknown no-author-rec no-book-author no-book absent
2023-07                
ambiguous 1.000000 nan nan nan nan nan nan nan
female nan 1.000000 nan nan nan nan nan nan
male nan nan 1.000000 nan nan nan nan nan
unknown nan nan nan 1.000000 nan nan nan nan
no-author-rec nan nan nan nan 1.000000 nan nan nan
no-book-author nan nan 0.000000 nan 0.000000 0.999999 nan nan
no-book nan nan nan nan nan nan 1.000000 nan
absent nan nan nan nan nan 0.971987 nan 0.028013
plot_gender('current')