ISBN Cluster Changes

This notebook audits for significant changes in the clustering results in the book data, to allow us to detect the significance of shifts from version to version. It depends on the aligned cluster identities in isbn-version-clusters.parquet.

Data versions are indexed by month; versions corresponding to tagged versions also have the version in their name.

We are particularly intersted in the shift in number of clusters, and shifts in which cluster an ISBN is associated with (while cluster IDs are not stable across versions, this notebook works on an aligned version of the cluster-ISBN associations).

import pandas as pd
import matplotlib.pyplot as plt

Load Data

Define the versions we care about:

versions = ['pgsql', '2022-03-2.0', '2022-07', '2022-10', '2022-11-2.1', 'current']

Load the aligned ISBNs:

isbn_clusters = pd.read_parquet('isbn-version-clusters.parquet')
isbn_clusters.info()

Cluster Counts

Let’s look at the # of ISBNs and clusters in each dataset:

metrics = isbn_clusters[versions].agg(['count', 'nunique']).T.rename(columns={
    'count': 'n_isbns',
    'nunique': 'n_clusters',
})
metrics

Cluster Size Distributions

Now we’re going to look at how the sizes of clusters, and the distribution of cluster sizes and changes.

sizes = dict((v, isbn_clusters[v].value_counts()) for v in versions)
sizes = pd.concat(sizes, names=['version', 'cluster'])
sizes.name = 'size'
sizes

Compute the histogram:

size_hist = sizes.groupby('version').value_counts()
size_hist.name = 'count'
size_hist

And plot the cumulative distributions:

for v in versions:
    vss = size_hist.loc[v].sort_index()
    vsc = vss.cumsum() / vss.sum()
    plt.plot(vsc.index, vsc.values, label=v)

plt.title('Distribution of Cluster Sizes')
plt.ylabel('Cum. Frac. of Clusters')
plt.xlabel('Cluster Size')
plt.xscale('symlog')
plt.legend()
plt.show()

Save more metrics:

metrics['max_size'] = pd.Series({
    v: sizes[v].max()
    for v in versions
})
metrics

Different Clusters

ISBN Changes

How many ISBNs changed cluster across each version?

statuses = ['same', 'added', 'changed', 'dropped']
changed = isbn_clusters[['isbn_id']].copy(deep=False)
for (v1, v2) in zip(versions, versions[1:]):
    v1c = isbn_clusters[v1]
    v2c = isbn_clusters[v2]
    cc = pd.Series('same', index=changed.index)
    cc = cc.astype('category').cat.set_categories(statuses, ordered=True)
    cc[v1c.isnull() & v2c.notnull()] = 'added'
    cc[v1c.notnull() & v2c.isnull()] = 'dropped'
    cc[v1c.notnull() & v2c.notnull() & (v1c != v2c)] = 'changed'
    changed[v2] = cc
    del cc
changed.set_index('isbn_id', inplace=True)
changed.head()

Count number in each trajectory:

trajectories = changed.value_counts()
trajectories = trajectories.to_frame('count')
trajectories['fraction'] = trajectories['count'] / len(changed)
trajectories['cum_frac'] = trajectories['fraction'].cumsum()
trajectories
metrics['new_isbns'] = (changed[versions[1:]] == 'added').sum().reindex(metrics.index)
metrics['dropped_isbns'] = (changed[versions[1:]] == 'dropped').sum().reindex(metrics.index)
metrics['changed_isbns'] = (changed[versions[1:]] == 'changed').sum().reindex(metrics.index)
metrics

The biggest change is that the July 2022 update introduced a large number (8.2M) of new ISBNs. This update incorporated more current book data, and changed the ISBN parsing logic, so it is not surprising.

Let’s save these book changes to a file for future re-analysis:

changed.to_parquet('isbn-cluster-changes.parquet', compression='zstd')

Final Saved Metrics

Now we’re going to save this metric file to a CSV.

metrics.index.name = 'version'
metrics
metrics.to_csv('audit-metrics.csv')