Self Organizing Maps
Here I would like to share my experience in a real production pipeline with a very interesting data reduction algorithm: Self Organizing Maps (SOM) introduced by Professor Teuvo Kohonen. It is a simple but powerful example of unsupervised machine learning. There are several implementations available, but the most prominent is SOMPY.
14 million UK building geometries indexed with SOM
Motivation
A major design requirement was to provide a common geographic reference for record linkage of different data sets related to building portfolios. You might want to think this as a data-driven postcode. Real postcodes greatly vary in terms of population density and aren’t well suited for scalable data matching pipelines. Why not just use a nearest neighbor approach? The advantage of a many-to-many indexed comparison, performed on the level of artificial postcodes, is that it provides qualitative confidence measures, as in Python Record Linkage Toolkit. In contrast, a simple haversine distance is difficult to interpret because an absolute tolerance of 10 meters might be “good enough” in rural areas, but “poor” in densely populated urban areas like Soho.
Data source
The data comes from Ordinance Survey (OS) and contains all building geometries in the UK.
Training the model
SOM is organized in a 2-D grid (more dimensions are theoretically possible) of fully connected neurons called Best Matching Units. After initialization and training, similar data points get assigned to the same BMUs and similar BMUs have topological proximity which is different to other approaches like KNN. When the data is in the range of millions, a sample of would suffice for training.
How big?
An important decision is the size of the model, or number of neurons . Our goal is to achieve maximum utilization of the provided gird. Manual tuning of the hyperparameters can be time-consuming and lead to poor results or slow training. For this kind of simple model there exist some not widely known heuristics that satisfactorily answer the question. The answer is , where is the total size of the projected data. The final grid should then be .
Result
So finally we have a trained SOM that effectively acts as a density-aware smart grid of artificial postcodes. When projecting coordinates from different sources, best matching units are assigned and full indexing, with certain confidence, is performed locally in a scalable fashion, enabling record linkage on a planet level.
Hitmap of neural activations of the best matching units (BMU)
Caution! SOM is extremely sensitive to duplicates, a problem commonly referred to as imbalanced data. Especially when the dimensionality is low, this can be very detrimental to the model, so make sure to drop any duplicates before training.