I’m currently working on Kaggle’s “What’s Cooking” challenge. Basically, using data provided by the recipe site Yummly, the task is to predict a dish’s cuisine based on its ingredients. An example entry from the dataset looks like this:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
{ "id": 24717, "cuisine": "indian", "ingredients": [ "tumeric", "vegetable stock", "tomatoes", "garam masala", "naan", "red lentils", "red chili peppers", "onions", "spinach", "sweet potatoes" ] }, |

The 39774 recipes in the training set have in total a unique number of 6716 ingredients. Or rather, unique in the sense that their strings don’t equal. In fact, there are a quite some ingredients that are identical or at least very similar under different names. In the remainder of this post I’m going to describe how I use the ‘stringdist’ R package to tackle the problem of cleaning the dataset by grouping such ingredients together.

### The challenge: Identical Ingredients with different names

Since the dataset covers recipes from 20 different cuisines, there is a large number of truly unique ingredients. However, it appears that the input on Yummly is free text, which leads to duplicates like these (ingredients_unique is a character vector):

1 2 3 4 5 6 7 8 |
grep("mozzarella", ingredients_unique, value=TRUE) [1] "mozzarella cheese" "part-skim mozzarella cheese" "shredded mozzarella cheese" [4] "mozzarella balls" "fresh mozzarella" "chees fresh mozzarella" [7] "part-skim mozzarella" "fresh mozzarella balls" "low fat mozzarella" [10] "low-fat mozzarella cheese" "chees mozzarella stick" "preshred low fat mozzarella chees" [13] "low moisture mozzarella" "reduced fat mozzarella" "buffalo mozzarella" [16] "smoked mozzarella" "shredded low-fat mozzarella cheese" "nonfat mozzarella cheese" [19] "mozzarella string cheese" "2% milk shredded mozzarella cheese" |

Now, arguably, these are not all the same from a culinary perspective. However, they are at least closely related and belong primarily to the italian cuisine. In a cleaned dataset, we would like to group these together as to reduce the number of similar features to make the remaining ones more meaningful. At the same time, it is important to avoid false groupings as the distortions they would bring are likely more harmful than the benefits of correct grouping.

### The stringdist package

stringdist is an R package for measuring the distance (i.e. how much they differ) between strings. It implements several ametrics for this including the well-known Levenshtein distance and cosine similarity and even the phonetic Soundex algorithm, which however turned out to be inadequate for the task at hand. While the `stringdist` measures the distance for two strings, `stringdistmatrix` takes two lists of strings and produces a matrix by comparing each string of list A with each string of list B.

### Applying stringdist to the ingredients

The idea is to find ingredients using stringdist that are so close together by the chosen metric that it is safe to treat them as identical without manual checking. Of course, for this number of ingredients a manual approach is still feasible, but for similar use cases of a much higher dimension it is probably not. To start with, we compute the Levenshtein distance and inspect those pairs which have the lowest possible difference of 1 (0 means they are identical):

1 2 3 |
lv <- stringdistmatrix(ingredients_unique, ingredients_unique, method="lv") lv_match <- which(lv==1, arr.ind=TRUE) for(i in seq_along(lv_match)/2) { print(ingredients_unique[c(lv_match[i,1], lv_match[i, 2])])} |

The resulting list contains many true positives, e.g:

“lemongrass” “lemon grass”

“sambal olek” “sambal ulek”

“thyme sprig” “thyme sprigs”

“crema mexican” “crema mexicana”

“self raising flour” “self rising flour”

But, unfortunately, also very false positives:

“ice” “rice”

“jam” “ham”

“malt” “salt”

“peas” “pears”

“beer” “beef”

Clearly, the list as it is can’t be used to perform the grouping. Let’s do the same with a continuous measure, cosine distance, and look at those with a value of smaller than or equal to 0.03 and greater than 0:

1 2 3 |
cd <- stringdistmatrix(ingredients_unique, ingredients_unique, method="cosine") cd_match <- which(lv <= 0.03, arr.ind=TRUE) for(i in seq_along(cd_match)/2) { print(ingredients_unique[c(cd_match[i,1], cd_match[i, 2])])} |

This finds other true positives, such as:

“whole wheat spaghettini” “whole wheat thin spaghetti”

“skinless and boneless chicken breast fillet” “skinless chicken breast fillets”

“garlic chili sauce” “Thai chili garlic sauce”

But also other false positives (omitted here). One characteristic is that it finds duplicates with reordered or additional adjectives/words, that often describe the same base ingredient like above, but not always (I wouldn’t want to use “unsalted peanut butter” instead of “unsalted butter”). Again, the list can’t be used unfiltered. Now, both lists provided true and false positives. Is there a way to get just the true positives or at least a subset of it? Fortunately, there is, albeit it comes at the cost of also losing true positives: Combine different measures. What if, for example, we took all ingredients with a Levenshtein distance of 1 and a maximum cosine distance smaller than 0.03, but larger than 0?

1 2 3 |
combined <- cd + lv matches <- which(combined > 1 & combined < 1.03, arr.ind=TRUE) for(i in seq_along(matches)/2) { print(ingredients_unique[c(matches[i,1], matches[i, 2])])} |

This results in a list of 41 distinct pairs, which – I’m not going to paste it here, I hope you believe me – are indeed all true duplicates. Compared with 198 cosine and 161 Levenshtein matches this is not much, but following the hypothesis that false negatives hurt less than false positives, the combination of the metrics is an improvement. I’m still in the process of tweaking the values and there is likely a better combination. For example, accepting up to 1.1 combined score yields 91 pairs of which – by visual inspection – almost all are true positives.

### What about the mozzarellas?

I’ve chosen the different variants of mozzarella as an example for duplicates. Unfortunately, none of these can be found using the combined filter described above, because they have too large an edit distance. The average Levenshtein distance between them is as follows:

1 2 3 4 5 |
mozzarellas <- grep("mozzarella", ingredients_unique) distances_mozzarellas <- stringdistmatrix(ingredients_unique[mozzarellas], ingredients_unique[mozzarellas], method = "lv") mean(distances_mozzarellas) [1] 13.425 |

This is still a slight underestimation, because the diagonal of the distance matrix is 0, but 13 is already far too high for a filter criterion. For brevity I won’t include the averages of the other distance metrics here, but they are not suitable either. A more promising approach would be to match the ingredients with a second ingredient database free of duplicates, but the competition’s rules forbid the use of external databases.

To summarize, the stringdist package (or, more generally, the underlying metrics) is a good starting point for problems like this one. The weighing of false positives and false negatives depends on the particular task. While in this case avoiding false positives appears more important than not catching all true positives, it can be different in other cases.