Theta Joins

Common Joins

Common Joins

If you’re like me, you’re looking at that diagram wondering what the heck is a ‘theta’ join?! Well, that’s the formal name for a ‘fuzzy’ join. In fact, you can read about all the different join types in this StackOverflow answer.

An equi join is the most common kind of join because it leverages keys in your data source which are essential in a SQL environment. In SQL it would be something like:

select a.*, b.*
from dbo.table1 as a
left join dbo.table2 as b
on a.key = b.key

Actually this is a ‘natural’ join because the key has the same name in both sources and I’m using an equality to establish the join. But, because I’m joining based on two keys being the same, this fits into the ‘equi’ join category.

How can we do ‘equi’ joins in R? I’ll present three methods below using three different packages, data.table, dplyr, and sqldf.

Getting ready

We’ll need to load some packages. Please note the versions I’m using. Also, the ‘live’ code I’m running is on my 64-bit Windows 10 laptop (8Gb RAM, i7-6600U CPU) in R 3.5.0

Package Version
data.table 1.11.4
sqldf 0.4-11
dplyr 0.8.3
library(data.table)
library(sqldf)
## Warning: package 'sqldf' was built under R version 3.5.3
## Warning: package 'gsubfn' was built under R version 3.5.3
## Warning: package 'proto' was built under R version 3.5.1
## Warning: package 'RSQLite' was built under R version 3.5.3
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.5.3

equijoin data

Two quick tables to demonstrate equijoin behavior in each ecosystem. Here we have some customer id’s in two tables, one which has purchases (df1) and another with their location. I’ll write everything in a function to enable easy benchmarking and profiling.

df1 = data.frame(CustomerId = c(1:6), Product = c(rep("Laptop", 2), rep("Tablet", 2), rep("Smartphone", 2)))
df2 = data.frame(CustomerId = c(2,3,4,6), State = c(rep("Tennessee", 3), rep("North Carolina", 1)))

Different Approaches for Equijoins

Base R

baseR_left = function(){
  merge(x = df1, y = df2, by = "CustomerId", all.x = TRUE)
}
baseR_left()
##   CustomerId    Product          State
## 1          1     Laptop           <NA>
## 2          2     Laptop      Tennessee
## 3          3     Tablet      Tennessee
## 4          4     Tablet      Tennessee
## 5          5 Smartphone           <NA>
## 6          6 Smartphone North Carolina

data.table

Are you not familiar with data.table? I’ll be honest, I don’t use it a lot but it’s described as follows on CRAN:

Fast aggregation of large data (e.g. 100GB in RAM), fast ordered joins, fast add/modify/delete of columns by group using no copies at all, list columns, a fast friendly file reader and parallel file writer. Offers a natural and flexible syntax, for faster development.

The syntax is different from a lot of what I’m used to which is why I don’t use it a bunch, but that’s a bad reason to not consider it.

dt1 = data.table(df1)
dt2 = data.table(df2)
#left join
dt_left = function(){
  dt2[dt1,on=.(CustomerId = CustomerId)]
}
dt_left()
##    CustomerId          State    Product
## 1:          1           <NA>     Laptop
## 2:          2      Tennessee     Laptop
## 3:          3      Tennessee     Tablet
## 4:          4      Tennessee     Tablet
## 5:          5           <NA> Smartphone
## 6:          6 North Carolina Smartphone

I think this sytax is a bit backwards since Y[X] produces a left join of Y onto X meaning you put the left anchors on the right… anyway.

sqldf

Are you a lover of SQL and maybe new to R? I have a coworker who lived and breathed SQL for decades before learning R and they swear by this package to manage data. It literally sets up a database in the background to support the SQL language. That’s dedication.

The sqldf() function is typically passed a single argument which is an SQL select statement where the table names are ordinary R data frame names. sqldf() transparently sets up a database, imports the data frames into that database, performs the SQL select or other statement and returns the result using a heuristic to determine which class to assign to each column of the returned data frame. The sqldf() or read.csv.sql() functions can also be used to read filtered files into R even if the original files are larger than R itself can handle. ‘RSQLite’, ‘RH2’, ‘RMySQL’ and ‘RPostgreSQL’ backends are supported.

sqldf_left = function(){
  sqldf("SELECT a.CustomerId, Product, State 
         FROM df1 a
         LEFT JOIN df2 b 
           on a.CustomerID = b.CustomerID")
}
sqldf_left()
##   CustomerId    Product          State
## 1          1     Laptop           <NA>
## 2          2     Laptop      Tennessee
## 3          3     Tablet      Tennessee
## 4          4     Tablet      Tennessee
## 5          5 Smartphone           <NA>
## 6          6 Smartphone North Carolina

dplyr

If I described myself as having a man crush on Hadley Wickham it would be accurate. I have a signed copy of his book ‘Advanced R’ on my bookshelf which I treasure. I didn’t even get it signed, I sent it with a friend who went to an R conference where I knew Hadley would be present. ANYWAY… dplyr is part of the tidyverse which is a collection of packages to support ‘tidy’ data analysis. It includes support for many operations include joins.

dplyr_left = function(){
  left_join(df1,df2,by="CustomerId")
}
dplyr_left()
##   CustomerId    Product          State
## 1          1     Laptop           <NA>
## 2          2     Laptop      Tennessee
## 3          3     Tablet      Tennessee
## 4          4     Tablet      Tennessee
## 5          5 Smartphone           <NA>
## 6          6 Smartphone North Carolina

Profiling

Let’s profile these just to get a sense of where they stand.

profmem

Ecosystem profmem bytes
Base R 0
data.table 58,048
sqldf 183,888
dplyr 5,088

microbenchmark

library(microbenchmark)
microbenchmark(base = baseR_left(),
               dt = dt_left(),
               sqldf = sqldf_left(),
               dplyr = dplyr_left(),
               times = 100)
## Unit: microseconds
##   expr       min        lq      mean     median        uq       max neval cld
##   base   548.602   670.702   842.999   718.5015  1032.251  2141.200   100  a 
##     dt   807.002  1100.651  1379.426  1204.6010  1650.751  3164.802   100  a 
##  sqldf 20937.301 23652.301 27614.825 25711.2005 28242.851 66439.100   100   b
##  dplyr  1106.402  1321.050  1667.317  1415.2510  1834.452  5005.301   100  a

Results

Base R has the smallest memory footprint (duh) and performs pretty well. merge is pretty flexible and can do some cool stuff. dplyr has the fastest performance on this toy set and a small memory footprint. data.table takes about twice as long as dplyr and is ~10x the memory footprint. sqldf is slow and bloated - but what else could you expect from a package that is literally spinning up a database in the background? It’s the price you pay to use SQL in R.

Different Approaches for Theta Joins

The (small) data

We’ll use the same data for the toy problem and the more realistic version. This is the result of some text mining, specifically, named entity extraction from corpuses of text. It’s a two part, non-sequential process. We identify sentences by extracting the starting and ending characters.

sent_df = read.table(text="id start end
                            1     1  39
                            2    42  63", header=TRUE)

Then we extract entities by getting the type, starting and ending characters.

ent_df = read.table(text="start end     ent_type
                              1   9 organization
                             23  31        money
                             45  50 organization", header=TRUE)

But, to help our users get the context of where entities appear as well as to get co-occurence information, we need to know what sentence the entity appeared in. Additionally, we only want to keep the sentence id and the entity type.

In this case, we can see the first entity (an organization) starts at location 1 and ends at 9 which puts it in sentence 1 (which starts at location 1 and ends at 39). The second entity (money!) is also in sentence 1 (starts at 23 and ends at 31) while the last entity belongs in sentence 2. An equi join won’t work because I need to check if entity.start >= sentence.start AND entity.end <= sentence.end. Luckily, we have a few approaches.

Base R

What we’ll do is cross join the data together, then do some filtering. I’m doing this all with base R functionality which I’m not totally used to doing, so this may not be the most perfect form of filtering. I decided to try using the %between% operator as well. Since an entity cannot be extracted which spans sentences all we really need to check is that the entity start is between the sentence start and end.

baseR_fuzzy = function(){
  tmp = merge(x = sent_df, y = ent_df, by = NULL)
  tmp[tmp$start.x <= tmp$start.y & tmp$end.x >= tmp$end.y,c("id","ent_type")]
}

baseR_fuzzy_between = function(){
  tmp = merge(x = sent_df, y = ent_df, by = NULL)
  tmp[tmp$start.y %between% tmp[,c("start.x","end.x")],c("id","ent_type")]
}
  
baseR_fuzzy()
##   id     ent_type
## 1  1 organization
## 3  1        money
## 6  2 organization

data.table

One great thing about data.table is that it supports fuzzy joins right out of the gate. However, there’s a lot of ambiguity in the join logic. Because doing a left join takes Y[X] and we’re joining the entities onto the sentences we list column names in the left-to-right order, not the left join order. So, where we have on=.(start >= start) you can think of it as on=.(start.Y >= start.X) which is equivalent to on=.(start.ent >= start.sent).

sent_dt = data.table(sent_df)
ent_dt  = data.table(ent_df)

dt_fuzzy = function(){
  ent_dt[sent_dt, on=.(start >= start, end <= end)][,.(id, ent_type)]
}
dt_fuzzy()
##    id     ent_type
## 1:  1 organization
## 2:  1        money
## 3:  2 organization

sqldf

sqldf also has the BETWEEN operator which lets us simplify the inequality of our join.

sqldf_fuzzy = function() {
  sqldf(
      "
      SELECT id, ent_type
      FROM sent_df a, ent_df b
      WHERE b.start BETWEEN a.start AND a.end 
      "
  )
}
sqldf_fuzzy()
##   id     ent_type
## 1  1 organization
## 2  1        money
## 3  2 organization

fuzzyjoin

Here I’ll introduce a new package aptly named fuzzyjoin (I’m using version 0.1.5). This adds ‘fuzzy’ versions of all the dplyr joins and is fully compatible in the tidyverse. In addition to providing both tables we need to provide match functions which provide logical indicators of fuzzy matching. What’s interesting is that each row is returned twice which must be a function of providing two match functions.

This is also the first time I use the magrittr pipes (%>%). If you’re not familiar with them I’d suggest learning about them because they’re great and I have stickers of them on my truck and motorcycle.

library(fuzzyjoin)
## Warning: package 'fuzzyjoin' was built under R version 3.5.3
startMatch = function(x,y){
  x <= y
}
endMatch = function(x,y){
  x >= y
}
fuzzy_fuzzy = function(){
  fuzzy_join(sent_df,ent_df,by=c("start","end"),match_fun = list(start = startMatch,end = endMatch)) %>% 
    select(id,ent_type) %>% 
    distinct()
}
fuzzy_fuzzy()
##   id     ent_type
## 1  1 organization
## 2  1        money
## 3  2 organization

Pure tidyverse

I’m a fan of limiting the number of packages your code relies upon. Adding fuzzyjoin to handle this problem feels unnecessary (it provides great value in other applications because of the flexibility of the match functions) so I’d like to try and avoid it but also leverage the tidyverse. So I’m going to load the tidyr package (I’m using version 0.8.1) which is a solid compliment to dplyr and does lots of useful stuff. I know it looks like I’m just swapping out one package for another, but given the general practicality of tidyr compared to fuzzyjoin, I think it’s OK.

tidyr adds a cross_join function (not present in dplyr!!) which has a funky problem of not adjusting the names of duplicate columns. This impacts the ability of dplyr and other things to work on the resulting tibble (data.frame). So, I manually adjust the names using names<-, filter down to what I want and then select the columns. The approach is identical to the base R code, just using the tidyverse.

library(tidyr)
tidy_fuzzy = function(){
  sent_df %>% 
    crossing(ent_df) %>% 
    `names<-`(c("id","start.x","end.x","start.y","end.y","ent_type")) %>% 
    filter(end.x >= end.y & start.x <= start.y) %>% 
    select(id,ent_type)
}

Profiling

Let’s profile these (small) fuzzy joins just to get a sense of where they stand.

profmem

Ecosystem profmem bytes
Base R 0
Base R (%between%) 0
data.table 83,136
sqldf 184,480
fuzzyjoin 117,104
tidyverse 41,848

microbenchmark

microbenchmark(base = baseR_fuzzy(),
               base_between = baseR_fuzzy_between(),
               dt = dt_fuzzy(),
               sqldf = sqldf_fuzzy(),
               fuzzyjoin = fuzzy_fuzzy(),
               tidyverse = tidy_fuzzy(),
               times = 100)
## Unit: microseconds
##          expr       min         lq       mean     median        uq        max
##          base   526.701   667.2005   972.9729   836.8515  1058.001   6870.201
##  base_between   585.300   711.1505  1052.8059   975.5010  1193.802   4877.101
##            dt  1839.401  2091.8010  2769.4180  2428.2005  2985.002   9181.001
##         sqldf 20464.601 24900.9005 32261.5380 28242.0010 34966.951  75240.401
##     fuzzyjoin 34239.301 41375.1010 50601.5500 44211.1515 51053.701 152096.201
##     tidyverse  3239.801  3898.6015  5429.9969  4823.4510  6344.301  12026.900
##  neval  cld
##    100 a   
##    100 a   
##    100 ab  
##    100   c 
##    100    d
##    100  b

Results

Both base R functions have no memory footprint and actually run the fastest with the %between% function being a touch slower but not enough to ignore it’s improved readability. fuzzyjoin has a large footprint and runs the slowest being in the same ballpark as sqldf - neither of these are great choices for the task we have. data.table and tidyverse are in the same ballpark, but are substantially slower than base R. BUT this was a very small toy problem. Let’s do something more meaningful.

Bigger Data

It turned out that in the real-world case we only needed to do fuzzy joining on a per-document basis which meant the data never got so big that scaling issues of each approach would matter. It was scaling how much parallel processing we could do to slam documents through.

But that’s boring. We have scaling issues we need to test and solve.

Common Joins

Common Joins

Originally I created 5000 sentences with ~23,000 entities, but this proved to be too much for some of the functions to handle on my laptop (having closed all other applications first!). So, I shrunk it to 2000 sentences and ~9,000 entities.

set.seed(2300)
nsent = 2000
sentence_length = floor(rnorm(nsent,100,10))
sent_df = data.frame(id = 1:nsent,
                     start = Reduce(sum,sentence_length,init=1,accumulate = T)[1:nsent],
                     end = cumsum(sentence_length))
stop = FALSE
i = 1
ent_df = data.frame(start = NA, end = NA, ent_type = NA)
while(!stop){
  ent_start = as.logical(rbinom(n=1,size=1,prob = 0.2))
  if(ent_start){
    sent = max(which(sent_df$start <= i))
    sent_end = sent_df$end[sent_df$id == sent]
    ent_end = min(c(i+rpois(1,1)+3,sent_end))
    if(ent_end-i < 3){
      i = i+3
      next
    }
    ent_type = sample(c("organization","person","money"),size = 1)
    ent_df = rbind(ent_df,data.frame(start=i, end=ent_end, ent_type=ent_type))
    i = ent_end+2
  }
  i = i+3
  if(i > max(sent_df$end)) stop = TRUE
}

Now we have a bunch of data - so let’s test stuff!

microbenchmark

I suspect some of these are going to run very poorly, so we’ll start with an n of 1. This isn’t a good benchmark but when you see the eval times, you’ll realize why.

microbenchmark(base = baseR_fuzzy(),
               base_between = baseR_fuzzy_between(),
               dt = dt_fuzzy(),
               sqldf = sqldf_fuzzy(),
               #fuzzyjoin = fuzzy_fuzzy(),
               tidyverse = tidy_fuzzy(),
               times = 3)

Method | Small Eval Time (ms) | Big(ger) Eval Time (ms) | Scale Up ——-|————– ——–|————————-|——— Base | 1.2 | 102,039 | 85,000x Base (%between%) | 1.3 | 93,731 | 72,100x data.table | 3.8 | 17 | 4x sqldf | 47 | 2,501 | 53x fuzzyjoin | 75 | 470,068 | 6,267x tidyverse | 7.2 | 83,413 | 11,585x

The ‘scale up’ column is simply dividing the execution time of the 1000x sentence (3000x entities) data by the execution time of the original data. I find the results to be somewhat surprising.

Results

  • data.table is the clear winner overall - it was on fast end of the toy data and scales better than the data.
  • sqldf has a lot of overhead and ran (relatively) poorly on the toy problem but scales well with the larger data.
  • fuzzyjoin took nearly 8 minutes to run where the next slowest was closer to 2 minutes. Even though the scaling factor is an order of magnitude smaller than the worst, the total time involved is absurd. This is not a good choice for the type of fuzzy joining we’re doing.
  • Base R (both versions) and tidyverse were in the same rough ballpark, ~1.5 minutes with ridiculous scaling problems.

Conclusion

I’ve shown you 6 different approaches to doing fuzzy joins (aka theta joins) and demonstrated their relative merits. We’ve seen that performance on small joins is no indication of performance on large joins but data.table scales very well on a variety of sizes and sqldf leverages it’s bloat in small sizes to perform well on larger sizes. Also important to note - base R has reasonable performance and beats a ‘tidy’ solution at the (potential) cost of readability.

Should you pick up data.table and replace all your tidy work with it? I’m begining to believe that maybe I should. data.table has at least one other practical advantage over the tidyverse. Stability. data.table is past version 1.0 where dplyr and tidyr are not. This is especially apparent with the fairly frequent breaking changes made to the tidyverse which can have a strong, negative impact in an enterprise environment.

Well - I hope you have fun with theta joins (and start calling them that!)