At Muserk, we spend a great deal of time and effort claiming royalties on behalf of composition copyright holders. Part of this process involves matching metadata we receive from clients to metadata generated by streaming services. This is known as record linkage. Sometimes all we have to go on is title and writer information. In this case, we might first think to do some kind of a text lookup. However, because streaming services accept submissions (and metadata) from a global user base, this method becomes increasingly challenging to correctly identify compositions. One of the techniques we use to solve this problem is called fuzzy matching.
Fuzzy matching can be used to draw conclusions about related information when the rules that link them are fuzzy. As humans, we do this all the time, and the goal is to replicate that with a computer. With this process, we can take text records that aren’t exact matches and determine the likelihood that they are related . Suppose we want to look at Youtube metadata for all videos containing a particular song. What we will find is a highly varying set of metadata. In order to demonstrate, let’s take an example:
Song Title: All My Loving
Writers: John Lennon | Paul McCartney
Here are two example records for our song:
Record | Title | Writers |
1 | All My Loving | John Lennon/Paul McCartney |
2 | All My Loving | John Lennon, Paul McCartney |
It’s obvious that these two records contain the same underlying work, but how would we approach this determination at scale? If we used a direct text comparison, like a text lookup, the “/” & “, ” characters in the writers’ field would yield a mismatch. So how can we programmatically identify these as the same work?
One approach would be to first turn the text into a “token set” before analyzing it. With this method we could create a list of words from each entity by splitting on spaces and punctuation. Then we would run those lists through a fuzzy algorithm to determine their similarity. Let’s try it with writers metadata. If we make a list of words from each record and alphabetize those lists we get the following:
List 1 = [John, Lennon, McCartney, Paul]
List 2 = [John, Lennon, McCartney, Paul]
Great! Now we compare the lists word by word with our fuzzy algorithm and find that we have an exact match! With this basic fuzzy matching algorithm, we are able to properly link these two records.
Now let’s try a more difficult set of records:
Record | Title | Writers |
1 | All My Loving | Lennon/McCartney |
2 | All My Loving | Paul James McCartney, John Winston Lennon |
Here we have the same work with different metadata. If we use the same approach we get the following lists for writers:
List 1 = [Lennon, McCartney]
List 2 = [James, John, Lennon, McCartney, Paul, Winston]
Notice List 2 contains all of List 1’s words. Another fuzzy algorithm approach involves removing identical words first, and then running the same fuzzy analysis as before. If we do this we get the following remaining words:
List 1 = []
List 2 = [James, John, Paul, Winston]
We can see that List 1 is now empty. Running this through our fuzzy algorithm will return a 100% match because there is nothing left to compare. We can summarize this by saying that List 2 contains the entire record of List 1 and thus they are related records. Though this approach isn’t perfect, it does help with the vast majority of cases.
There are all kinds of fuzzy matching algorithms available to handle varying use cases, and fuzzy matching is just one step in our data pipeline. Most software packages that can perform text comparisons will have some kind of fuzzy matching capability. Give it a try next time you have a similar situation. It may just be the tool you’ve been missing!