TAGS :Viewed: 7 - Published at: a few seconds ago

[ Joining two dataframes with sqldf results in NA Column ]

I am trying to join two data frames using the sqldf package.

These are reproducible examples of my two dataframes:

a <- c(1,2,3,4,5)
b <- c(1,2,3,4,5)
c <- c(1,2,3,4,5)
d <- c(1,2,3,4,5)
e <- c(1,2,3,4,5)

dataframe1 <- data.frame(a,b,c,d,e)

a <- c(NA,NA,NA,NA,5)
b <- c(NA,NA,NA,4,NA)
c <- c(NA,NA,3,NA,NA)
d <- c(NA,2,NA,NA,NA)
e <- c(1,NA,NA,NA,NA)
f <- c(1,2,3,4,5)

dataframe2 <- data.frame(a,b,c,d,e,f)

This is a reproducible example of the join on the dataframes:

final_data <- sqldf("SELECT *
                     FROM dataframe1
                     LEFT OUTER JOIN dataframe2 USING(a,b,c,d,e)")

The resulting f column in final_data which is introduced via the join is filled with NAs. Why? The ideal f column would have the corresponding values for a,b,c,d, and e that exist in dataframe2. How could I solve this?

Answer 1


If you want OR criteria, the corresponding SQL is

SELECT *
FROM dataframe1 d1
LEFT OUTER JOIN dataframe2 d2
ON (
  d1.a = d2.a
  OR d1.b = d2.b
  OR d1.c = d2.c
  OR d1.d = d2.d
  OR d1.e = d2.e
)

Since the column values from df1 aren't necessarily equal to those from df2, you get a copy from each of them. If you only want to retain the values from df1 as well as f from df2, then do this:

SELECT d1.*, d2.f
FROM dataframe1 d1
LEFT OUTER JOIN dataframe2 d2
ON (
  d1.a = d2.a
  OR d1.b = d2.b
  OR d1.c = d2.c
  OR d1.d = d2.d
  OR d1.e = d2.e
)

The R join functions I know about (merge, plyr::join and dplyr::left_join) all require all of the columns in the join to be equal, though you could pretty easily hack together a custom function that joins one at a time and then coalesces the results.