import pandas as pd
import matplotlib.pyplot as plt
ISBN Cluster Changes
This notebook audits for significant changes in the clustering results in the book data, to allow us to detect the significance of shifts from version to version. It depends on the aligned cluster identities in isbn-version-clusters.parquet
.
Data versions are indexed by month; versions corresponding to tagged versions also have the version in their name.
We are particularly intersted in the shift in number of clusters, and shifts in which cluster an ISBN is associated with (while cluster IDs are not stable across versions, this notebook works on an aligned version of the cluster-ISBN associations).
Load Data
Define the versions we care about:
= ['pgsql', '2022-03-2.0', '2022-07', '2022-10', '2022-11-2.1', 'current'] versions
Load the aligned ISBNs:
= pd.read_parquet('isbn-version-clusters.parquet')
isbn_clusters isbn_clusters.info()
Cluster Counts
Let’s look at the # of ISBNs and clusters in each dataset:
= isbn_clusters[versions].agg(['count', 'nunique']).T.rename(columns={
metrics 'count': 'n_isbns',
'nunique': 'n_clusters',
}) metrics
Cluster Size Distributions
Now we’re going to look at how the sizes of clusters, and the distribution of cluster sizes and changes.
= dict((v, isbn_clusters[v].value_counts()) for v in versions)
sizes = pd.concat(sizes, names=['version', 'cluster'])
sizes = 'size'
sizes.name sizes
Compute the histogram:
= sizes.groupby('version').value_counts()
size_hist = 'count'
size_hist.name size_hist
And plot the cumulative distributions:
for v in versions:
= size_hist.loc[v].sort_index()
vss = vss.cumsum() / vss.sum()
vsc =v)
plt.plot(vsc.index, vsc.values, label
'Distribution of Cluster Sizes')
plt.title('Cum. Frac. of Clusters')
plt.ylabel('Cluster Size')
plt.xlabel('symlog')
plt.xscale(
plt.legend() plt.show()
Save more metrics:
'max_size'] = pd.Series({
metrics[max()
v: sizes[v].for v in versions
}) metrics
Different Clusters
ISBN Changes
How many ISBNs changed cluster across each version?
= ['same', 'added', 'changed', 'dropped']
statuses = isbn_clusters[['isbn_id']].copy(deep=False)
changed for (v1, v2) in zip(versions, versions[1:]):
= isbn_clusters[v1]
v1c = isbn_clusters[v2]
v2c = pd.Series('same', index=changed.index)
cc = cc.astype('category').cat.set_categories(statuses, ordered=True)
cc & v2c.notnull()] = 'added'
cc[v1c.isnull() & v2c.isnull()] = 'dropped'
cc[v1c.notnull() & v2c.notnull() & (v1c != v2c)] = 'changed'
cc[v1c.notnull() = cc
changed[v2] del cc
'isbn_id', inplace=True)
changed.set_index( changed.head()
Count number in each trajectory:
= changed.value_counts()
trajectories = trajectories.to_frame('count')
trajectories 'fraction'] = trajectories['count'] / len(changed)
trajectories['cum_frac'] = trajectories['fraction'].cumsum() trajectories[
trajectories
'new_isbns'] = (changed[versions[1:]] == 'added').sum().reindex(metrics.index)
metrics['dropped_isbns'] = (changed[versions[1:]] == 'dropped').sum().reindex(metrics.index)
metrics['changed_isbns'] = (changed[versions[1:]] == 'changed').sum().reindex(metrics.index)
metrics[ metrics
The biggest change is that the July 2022 update introduced a large number (8.2M) of new ISBNs. This update incorporated more current book data, and changed the ISBN parsing logic, so it is not surprising.
Let’s save these book changes to a file for future re-analysis:
'isbn-cluster-changes.parquet', compression='zstd') changed.to_parquet(
Final Saved Metrics
Now we’re going to save this metric file to a CSV.
= 'version'
metrics.index.name metrics
'audit-metrics.csv') metrics.to_csv(