[ Optimized version of grep to match vector against vector ]
Suppose I have two vectors of characters a
and b
:
set.seed(123)
categ <- c("Control", "Gr", "Or", "PMT", "P450")
genes <- paste(categ, rep(1:40, each=length(categ)), sep="_")
a0 <- paste(genes, "_", rep(1:50, each=length(genes)), "_", sep="")
b0 <- paste (a0, "1", sep="")
ite <- 200
lg <- 2000
b <- b0[1:lg]
a <- (a0[1:lg])[sample(seq(lg), ite)]
I want to apply the grep
function in order to find the match of each value of a
in b
.
Of course I could do:
sapply(a, grep, b)
but I wonder if there is something more efficient as I will have to run this a lot of times for much larger vectors in simulations (note that I don't want to use mclapply
either as I already use it to run each iteration of my simulations):
system.time(lapply(seq(100000), function(x) sapply(a, grep, b)))
library(parallel)
system.time(mclapply(seq(100000), function(x) sapply(a, grep, b), mc.cores=8))
Answer 1
Since you don't use regular expressions but want to find substrings in longer strings, you can use fixed = TRUE
. It is much faster.
library(microbenchmark)
microbenchmark(lapply(a, grep, b), # original
lapply(paste0("^", a), grep, b), # @flodel
lapply(a, grep, b, fixed = TRUE))
Unit: microseconds
expr min lq median uq max neval
lapply(a, grep, b) 112.633 114.2340 114.9390 116.0990 326.857 100
lapply(paste0("^", a), grep, b) 119.949 121.7380 122.7425 123.9775 191.851 100
lapply(a, grep, b, fixed = TRUE) 21.004 22.5885 23.8580 24.6110 33.608 100
Testing with longer vectors (1000 times the original length).
ar <- rep(a, 1000)
br <- rep(b, 1000)
library(microbenchmark)
microbenchmark(lapply(ar, grep, br), # original
lapply(paste0("^", ar), grep, br), # @flodel
lapply(ar, grep, br, fixed = TRUE))
Unit: seconds
expr min lq median uq max neval
lapply(ar, grep, br) 32.288139 32.564223 32.726149 32.97529 37.818299 100
lapply(paste0("^", ar), grep, br) 24.997339 25.343401 25.531138 25.71615 28.238802 100
lapply(ar, grep, br, fixed = TRUE) 2.461934 2.494759 2.513931 2.55375 4.194093 100
(This took quite a while...)
Answer 2
Following on my last suggestion...
The big problem with what you are asking is that, a priori, you need to do length(a) * length(b)
comparisons. However, you can take advantage of the fact that the matches here will only happen at the beginning of the strings (what I gathered from the comments.).
I suggested you first split your a
and b
vectors into lists, after looking at the first word ( "Or", "Gr", "Control", "PMT", etc.) in each item, then only look for matches in the corresponding sets. In other words, take the items in a
that start with Or_
and only look for matches in the items in b
that also start with Or_
.
To give you an idea of why this is efficient in terms of complexity. Imagine a
and b
both have length n
; that there are x
possible prefixes, uniformly distributed throughout a
and b
. Then you would only have to do x * (n/x * n/x)
comparisons versus n * n
in your case. That's x
times fewer comparisons. And you could even imagine repeating the process using the second word, third, etc. in a recursive way.
Now here is the code for it:
reduced.match <- function(a, b) {
first.word <- function(string) sub("_.*", "", string)
a.first <- first.word(a)
b.first <- first.word(b)
l.first <- unique(c(a.first, b.first))
a.first <- factor(a.first, l.first)
b.first <- factor(b.first, l.first)
a.split <- split(a, a.first)
b.split <- split(b, b.first)
a.idx.split <- split(seq_along(a), a.first)
b.idx.split <- split(seq_along(b), b.first)
unsorted.matches <-
Map(function(a, b, i) lapply(a, function(x) i[grep(x, b, fixed = TRUE)]),
a.split, b.split, b.idx.split, USE.NAMES = FALSE)
sorted.matches <-
unlist(unsorted.matches, recursive = FALSE)[
match(seq_along(a), unlist(a.idx.split))]
return(sorted.matches)
}
# sample data
set.seed(123)
n <- 10000
words <- paste0(LETTERS, LETTERS, LETTERS)
a <- paste(sample(words[-1], n, TRUE),
sample(words, n, TRUE), sep = "_")
b <- paste(sample(words[-2], n, TRUE),
sample(words, n, TRUE), sep = "_")
# testing
identical(reduced.match(a, b), lapply(a, grep, b, fixed = TRUE))
# [1] TRUE
# benchmarks
system.time(reduced.match(a, b))
# user system elapsed
# 0.187 0.000 0.187
system.time(lapply(a, grep, b, fixed = TRUE))
# user system elapsed
# 2.915 0.002 2.920
Answer 3
If a and b are sorted (and a unique) and one is interested in exact matches at the beginning of the string, then the following C code will usually be relatively efficient (something on the order of length(a) + length(b) string comparisons?). The R wrapper makes sure the C code and R user get appropriate data.
f3 <- local({
library(inline)
.amatch <- cfunction(c(a="character", b="character"),
includes="#include <string.h>", '
int len_a = Rf_length(a), len_b = Rf_length(b);
SEXP ans = PROTECT(allocVector(INTSXP, len_b));
memset(INTEGER(ans), 0, sizeof(int) * len_b);
int cmp, i = 0, j = 0;
while (i < len_a) {
const char *ap = CHAR(STRING_ELT(a, i));
while (j < len_b) {
cmp = strncmp(ap, CHAR(STRING_ELT(b, j)), strlen(ap));
if (cmp > 0) {
j += 1;
} else break;
}
if (j == len_b)
break;
if (cmp == 0)
INTEGER(ans)[j++] = i + 1;
else if (cmp < 0) i += 1;
}
UNPROTECT(1);
return(ans);')
function(a, b) {
locale = Sys.getlocale("LC_COLLATE")
if (locale != "C") {
warning('temporarily trying to set LC_COLLATE to "C"')
Sys.setlocale("LC_COLLATE", "C")
on.exit(Sys.setlocale("LC_COLLATE", locale))
}
a0 <- a
lvls <- unique(a)
a <- sort(lvls)
o <- order(b)
idx <- .amatch(a, b[o])[order(o)]
f <- factor(a[idx[idx != 0]], levels=lvls)
split(which(idx != 0), f)[a0]
}
})
In comparison with this semi-friendly grep
f0 <- function(a, b) {
a0 <- a
a <- unique(a)
names(a) <- a
lapply(a, grep, b, fixed=TRUE)[a0]
}
that allows for (but doesn't pay too much of a price for) duplicate 'a' values the timings for @flodel's data set are
> microbenchmark(f0(a, b), f3(a, b), times=5)
Unit: milliseconds
expr min lq median uq max neval
f0(a, b) 431.03595 431.45211 432.59346 433.96036 434.87550 5
f3(a, b) 15.70972 15.75976 15.93179 16.05184 16.06767 5
Unfortunately, this simple algorithm fails when one element is a prefix of another
> str(f0(c("a", "ab"), "abc"))
List of 2
$ : chr "abc"
$ : chr "abc"
> str(f3(c("a", "ab"), "abc"))
List of 2
$ : chr "abc"
$ : chr(0)
Contrary to a comment, for this data set (the random number seed needs to be specified for reproducibility)
set.seed(123)
categ <- c("Control", "Gr", "Or", "PMT", "P450")
genes <- paste(categ, rep(1:40, each=length(categ)), sep="_")
a0 <- paste0(genes, "_", rep(1:50, each=length(genes)), "_")
b0 <- paste0(a0, "1")
ite <- 50
lg <- 1000
b <- b0[1:lg]
a <- (a0[1:lg])[sample(seq(lg), ite)]
f3()
returns the same values as grep
> identical(unname(f3(a, b)), lapply(a, grep, b, fixed=TRUE))
[1] TRUE
The algorithms f0 and f3 have been modified to return indexes in a named list.
Answer 4
I tested out on my own data the different solutions proposed by @flodel and @Sven Hohenstein (Note that @Martin Morgan's method cannot be tested for the moment as it doesn't support when elements of a
that are prefix of other elements of a
).
IMPORTANT NOTE: altough all methods give the same result in my specific case, remind that they all have their own way, and thus can give different results depending on the structure of the data
Here is a quick summary (the results are shown below):
- In my tests,
length(a)
andlength(b)
are set to 200 or 400 and 2,000 or 10,000 respectively - there is only a single match of each value of
a
inb
- the best method really depends of the problem and all deserves to be tested for each specific cases
pmatch
always performs very well (notably for small length of vectorsa
andb
, say less than 100 and 1,000 respectively - not shown below),sapply(a, grep, b, fixed=T)
andreduced.match
(flodel's method) functions always perform better thansapply(a, grep, b))
andsapply(paste0("^", a), grep, b)
.
Here is the reproductible code along with the results of the tests
# set up the data set
library(microbenchmark)
categ <- c("Control", "Gr", "Or", "PMT", "P450")
genes <- paste(categ, rep(1:40, each=length(categ)), sep="_")
a0 <- paste(genes, "_", rep(1:50, each=length(genes)), "_", sep="")
b0 <- paste (a0, "1", sep="")
# length(a)==200 & length(b)==2,000
ite <- 200
lg <- 2000
b <- b0[1:lg]
a <- (a0[1:lg])[sample(seq(lg), ite)]
microbenchmark(as.vector(sapply(a, grep, b)), # original
as.vector(sapply(paste0("^", a), grep, b)), # @flodel 1
as.vector(sapply(a, grep, b, fixed = TRUE)), # Sven Hohenstein
unlist(reduced.match(a, b)), # @ flodel 2
#~ f3(a, b), @Martin Morgan
pmatch(a, b))
Unit: milliseconds
expr min lq median
as.vector(sapply(a, grep, b)) 188.810585 189.256705 189.827765
as.vector(sapply(paste0("^", a), grep, b)) 157.600510 158.113507 158.560619
as.vector(sapply(a, grep, b, fixed = TRUE)) 23.954520 24.109275 24.269991
unlist(reduced.match(a, b)) 7.999203 8.087931 8.140260
pmatch(a, b) 7.459394 7.489923 7.586329
uq max neval
191.412879 222.131220 100
160.129008 186.695822 100
25.923741 26.380578 100
8.237207 10.063783 100
7.637560 7.888938 100
# length(a)==400 & length(b)==2,000
ite <- 400
lg <- 2000
b <- b0[1:lg]
a <- (a0[1:lg])[sample(seq(lg), ite)]
microbenchmark(as.vector(sapply(a, grep, b)), # original
as.vector(sapply(paste0("^", a), grep, b)), # @flodel 1
as.vector(sapply(a, grep, b, fixed = TRUE)), # Sven Hohenstein
unlist(reduced.match(a, b)), # @ flodel 2
#~ f3(a, b), @Martin Morgan
pmatch(a, b))
Unit: milliseconds
expr min lq median
as.vector(sapply(a, grep, b)) 376.85638 379.58441 380.46107
as.vector(sapply(paste0("^", a), grep, b)) 314.38333 316.79849 318.33426
as.vector(sapply(a, grep, b, fixed = TRUE)) 49.56848 51.54113 51.90420
unlist(reduced.match(a, b)) 13.31185 13.44923 13.57679
pmatch(a, b) 15.15788 15.24773 15.36917
uq max neval
383.26959 415.23281 100
320.92588 346.66234 100
52.02379 81.65053 100
15.56503 16.83750 100
15.45680 17.58592 100
# length(a)==200 & length(b)==10,000
ite <- 200
lg <- 10000
b <- b0[1:lg]
a <- (a0[1:lg])[sample(seq(lg), ite)]
microbenchmark(as.vector(sapply(a, grep, b)), # original
as.vector(sapply(paste0("^", a), grep, b)), # @flodel 1
as.vector(sapply(a, grep, b, fixed = TRUE)), # Sven Hohenstein
unlist(reduced.match(a, b)), # @ flodel 2
#~ f3(a, b), @Martin Morgan
pmatch(a, b))
Unit: milliseconds
expr min lq median
as.vector(sapply(a, grep, b)) 975.34831 978.55579 981.56864
as.vector(sapply(paste0("^", a), grep, b)) 808.79299 811.64919 814.16552
as.vector(sapply(a, grep, b, fixed = TRUE)) 119.64240 120.41718 120.73548
unlist(reduced.match(a, b)) 34.23893 34.56048 36.23506
pmatch(a, b) 37.57552 37.82128 38.01727
uq max neval
986.17827 1061.89808 100
824.41931 854.26298 100
121.20605 151.43524 100
36.57896 43.33285 100
38.21910 40.87238 100
# length(a)==400 & length(b)==10500
ite <- 400
lg <- 10000
b <- b0[1:lg]
a <- (a0[1:lg])[sample(seq(lg), ite)]
microbenchmark(as.vector(sapply(a, grep, b)), # original
as.vector(sapply(paste0("^", a), grep, b)), # @flodel 1
as.vector(sapply(a, grep, b, fixed = TRUE)), # Sven Hohenstein
unlist(reduced.match(a, b)), # @ flodel 2
#~ f3(a, b), @Martin Morgan
pmatch(a, b))
Unit: milliseconds
expr min lq median
as.vector(sapply(a, grep, b)) 1977.69564 2003.73443 2028.72239
as.vector(sapply(paste0("^", a), grep, b)) 1637.46903 1659.96661 1677.21706
as.vector(sapply(a, grep, b, fixed = TRUE)) 236.81745 238.62842 239.67875
unlist(reduced.match(a, b)) 57.18344 59.09308 59.48678
pmatch(a, b) 75.03812 75.40420 75.60641
uq max neval
2076.45628 2223.94624 100
1708.86306 1905.16534 100
241.12830 283.23043 100
59.76167 88.71846 100
75.99034 90.62689 100