mgsub v1.0 Launched to CRAN

Multiple, simultaneous string substitions done safely.

Mark Ewing

9 minute read

Official CRAN Launch

Earlier this week I submitted mgsub to CRAN and after a couple of days it was accepted! Now it’s live! I’m very excited to have published my second package and one that I think is a more valuable contribution than my first. The package represented a few firsts for me. The first package that I wrote tests for, checked code coverage on and for which I wrote a vignette. Woot!

In my nervous anticipation after submission I worried I might have missed the email indicating it went live so I did a quick search for mgsub to see what would come back. Turns out, there are at least 4 other implementations of mgsub in packages on CRAN already. Nothing kicks imposter syndrome into overdrive like seeing that 4 other people already did what you did. But the focus of my implementation is safety - it’s important to know that a string manipulation function is going to work the way you intend. If your processing enough strings (or even a big enough string) it is difficult to do QA on a function. Spot checking could miss things and if the process isn’t safe enough, there may be no way of accomplishing your goal.

So, I decided to download all 4 implementations and test them to see how they stack up. Get ready for lots of examples!

Contenders

The 4 contenders are qdap (which I covered in a previous post), bayesbio (which actually doesn’t export their function), bazar and textclean. Note that I’m not actually loading the libraries because they will all have a namespace collision on mgsub so I’m calling by reference.

# library(mgsub)
# library(qdap)
# library(bayesbio) #not exported
# library(bazar)
# library(textclean)

To make testing/comparing easier I wrote a function for each package which accepts a list of the string to be modified, the vector of matches and the vector of replacements. This way I can put the functions in a list and just lapply (the name of f for my list is super expressive). One thing to note is that I have ensured that each function accepts regular expression input (by setting fixed=FALSE where necessary).

mgsub = function(a){
  replace = a$replace
  names(replace) = a$match
  mgsub::mgsub(a$orig,replace)
}
qdap = function(a){
  qdap::mgsub(a$match,a$replace,a$orig,fixed=FALSE)
}
bayesbio = function(a){
  bayesbio:::mgsub(a$match,a$replace,a$orig)
}
bazar = function(a){
  bazar::mgsub(a$match,a$replace,a$orig)
}
textclean = function(a){
  textclean::mgsub(a$orig,a$match,a$replace,fixed = FALSE)
}
f = list(mgsub,qdap,bayesbio,bazar,textclean)
names(f) = c("mgsub","qdap","bayesbio","bazar","textclean")

Each test will cover how well each package handles different scenarios in multiple, global string substition. For each test I provide the original string, the matches, the replacements and the target result. Each test will return a table which contains the result of the call and indicates how successful each library was.

Simple Test

We’ll start off nice and easy. We’ll modify a string by replacing “hey” with “yo” and “let’s” with “we”.

a = list(
orig = "hey ho, let's go!",
match = c("hey","let's"),
replace = c("yo","we")
)
simple_target = "yo ho, we go!"
simple = lapply(f,function(x) x(a))

htmlTable(prepTable(simple,simple_target),
          align="llr",
          css.cell = "padding-left: 1em; padding-right: 1em")
Library Result Correct
1 mgsub yo ho, we go! TRUE
2 qdap yo ho, we go! TRUE
3 bayesbio we FALSE
4 bazar yo ho, we go! TRUE
5 textclean yo ho, we go! TRUE

Right away we see that bayesbio is doing something very different from expected. Spoiler Alert: it continues. Given that it’s not an exported function, it may have a very specific use case in mind. Otherwise, the other packages all make the correct substitutions.

Substring

Next we’ll test how the libraries protect against substring substitution. In this case we want to replace “the” with “any” and “they” with “we”. I specifically put “the” earlier in the list of matches. If the functions aren’t safe, they may detect “the” as a substring of “they” and the result would be “any”. What’s very problematic about this example is that “any” is a real word which would make detection even more difficult when scanning results.

a = list(
orig = "they don't know the answer",
match = c("the","they"),
replace = c("an","we")
)
substring_target = "we don't know an answer"
substring = lapply(f,function(x) x(a))

htmlTable(prepTable(substring,substring_target),
          align="llr",
          css.cell = "padding-left: 1em; padding-right: 1em")
Library Result Correct
1 mgsub we don't know an answer TRUE
2 qdap any don't know an answer FALSE
3 bayesbio we FALSE
4 bazar any don't know an answer FALSE
5 textclean any don't know an answer FALSE

The other packages have made the substring mistake. Now, this was an engineered example - by providing “the” before “they”, anything that just matches in order will have this problem. It could be avoided by the user simply sorting their matches by nchar. qdap actually does this by default - if fixed = TRUE (the actual argument is order.pattern but it seems to ignore setting it to TRUE if fixed = FALSE). So, just know that if you’re only working with fixed matches, qdap would have worked correctly here.

qdap::mgsub(c("the","they"),c("an","we"),"they don't know the answer",fixed=TRUE)
## [1] "we don't know an answer"

Transpose words

Transposing words means taking a pair of words and replacing each with the other. Here, we replace “hey” with “ho” and “ho” with “hey”.

a = list(
orig = "hey ho, let's go!",
match = c("hey","ho"),
replace = c("ho","hey")
)
transpose_target = "ho hey, let's go!"
transpose = lapply(f,function(x) x(a))

htmlTable(prepTable(transpose,transpose_target),
          align="llr",
          css.cell = "padding-left: 1em; padding-right: 1em")
Library Result Correct
1 mgsub ho hey, let's go! TRUE
2 qdap hey hey, let's go! FALSE
3 bayesbio hey FALSE
4 bazar hey hey, let's go! FALSE
5 textclean hey hey, let's go! FALSE

The failure is amongst all the contenders again. Note, in this case, qdap fails even if fixed = TRUE. The problem (at least for bazar and textclean) is that they’re simply looping through matches. “hey” is replaced with “ho” leaving “ho ho, let’s go” and then “ho” is replaced with “hey” generating the result.

qdap::mgsub(c("hey","ho"),c("ho","hey"),"hey ho, let's go!",fixed=TRUE)
## [1] "hey hey, let's go!"

Shifting Words

This is similar to shifting words except it’s a chain of shifts. By shifting each word one spot to the left (with wraparound) we check if it’s a simple looping problem or something else. Also note, each match here is the same number of characters so any placeholder work done based on nchar would fail.

a = list(
orig = "hey, how are you?",
match = c("hey","how","are","you"),
replace = c("how","are","you","hey")
)
shift_target = "how, are you hey?"
shift = lapply(f,function(x) x(a))

htmlTable(prepTable(shift,shift_target),
          align="llr",
          css.cell = "padding-left: 1em; padding-right: 1em")

Library Result Correct
1 mgsub how, are you hey? TRUE
2 qdap hey, hey hey hey? FALSE
3 bayesbio hey FALSE
4 bazar hey, hey hey hey? FALSE
5 textclean hey, hey hey hey? FALSE
In this case, every word has been replaced with the last replacement value (again, due to the looping).

Regex

Next I test regular expression support (including backreferences.)

a = list(
orig = "Dopazamine is not the same as dopachloride or dopamezamine and is still fake.",
match = c("[Dd]opa([^ ]*?mine)","fake"),
replace = c("Meta\\1","real")
)
regex_target = "Metazamine is not the same as dopachloride or Metamezamine and is still real."
regex = lapply(f,function(x) x(a))

htmlTable(prepTable(regex,regex_target),
          align="llr",
          css.cell = "padding-left: 1em; padding-right: 1em")
Library Result Correct
1 mgsub Metazamine is not the same as dopachloride or Metamezamine and is still real. TRUE
2 qdap Metazamine is not the same as dopachloride or Metamezamine and is still real. TRUE
3 bayesbio real FALSE
4 bazar Metazamine is not the same as dopachloride or Metamezamine and is still real. TRUE
5 textclean Metazamine is not the same as dopachloride or Metamezamine and is still real. TRUE

This one passes easily (everything is working with a form of gsub turned on) though remember we did have to explicitly flag fixed = FALSE for qdap and textclean. Forgetting to do so would have resulted in failure to match or weird replacements.

Regex Susbtring

Most things failed to protect against substring matches, but what about when the smaller string (in terms of nchar) is a regular expression that matches a variable number of characters? Does the function actively determine what is a substring and what is a longer string? The example below is super engineered to create the scenario.

a = list(
orig = "Dopazamine is a fake chemical",
match = c("Dopazamin","Do.*ne"),
replace = c("freakout","metazamine")
)
regex_substring_target = "metazamine is a fake chemical"
regex_substring = lapply(f,function(x) x(a))

htmlTable(prepTable(regex_substring,regex_substring_target),
          align="llr",
          css.cell = "padding-left: 1em; padding-right: 1em")
Library Result Correct
1 mgsub metazamine is a fake chemical TRUE
2 qdap freakoute is a fake chemical FALSE
3 bayesbio metazamine FALSE
4 bazar freakoute is a fake chemical FALSE
5 textclean freakoute is a fake chemical FALSE

Note that only mgsub::mgsub correctly protects substrings in the presence of variable length regular expressions - even when that variable length is presented later in the list of inputs.

Speed

So I’ve shown several examples of cases where other implementations of mgsub fail to safely perform substitutions. But what is the performance cost?

Simple

library(microbenchmark)

s = "Hi, my name is Mark"
m = c("Hi","Mark")
r = c("Goodbye","Tom")
names(r) = m

smb = microbenchmark(
  mgsub = mgsub::mgsub(s,r,fixed=TRUE),
  qdap = qdap::mgsub(m,r,s),
  bayesbio = bayesbio:::mgsub(m,r,s),
  bazar = bazar::mgsub(m,r,s),
  textclean = textclean::mgsub(s,m,r)
)
smb = print(smb)
## Unit: microseconds
##       expr     min       lq      mean   median       uq     max neval
##      mgsub  95.545 140.9460 160.71527 161.9150 180.8780 304.137   100
##       qdap 184.890 242.8725 273.25684 286.8155 298.3025 422.291   100
##   bayesbio  16.411  28.6270  33.59056  33.5500  37.9260  88.252   100
##      bazar  26.257  41.9380  49.17671  48.3195  53.2430 114.143   100
##  textclean  43.397  68.1940  77.61020  77.6755  84.7875 216.251   100

mgsub and qdap in this simple case are in the same order of magnitude. The other three are significantly faster - the slowest of them is still 2x faster than mgsub or qdap.

Regex

s = "Dopazamine is not the same as Dopachloride and is still fake."
m = c("[Dd]opa(.*?mine)","fake")
r = c("Meta\\1","real")
names(r) = m

rmb = microbenchmark(
  mgsub = mgsub::mgsub(s,r),
  qdap = qdap::mgsub(m,r,s,fixed=FALSE),
  bayesbio = bayesbio:::mgsub(m,r,s),
  bazar = bazar::mgsub(m,r,s),
  textclean = textclean::mgsub(s,m,r,fixed=FALSE)
)
rmb = print(rmb)
## Unit: microseconds
##       expr     min       lq      mean   median       uq     max neval
##      mgsub 186.348 308.8780 345.39981 347.8975 381.4475 605.356   100
##       qdap 168.479 266.3935 275.46668 283.8980 297.9375 735.544   100
##   bayesbio  19.328  34.6445  41.66810  42.6675  49.5960  94.450   100
##      bazar  32.092  54.7015  59.58088  59.8070  69.1060  94.816   100
##  textclean  32.092  58.3485  65.30250  64.9120  73.3000 161.551   100

When we add regular expressions (with backreferences) the speed difference grows even bigger.

All that being said, things are still in microseconds, so it’s not necessarily a practical difference. And, is it worth being so much faster knowing you could be getting wrong results?

Conclusion

Correctness Results Median Runtime (microseconds)
Library     Simple     Substring     Transpose     Shift     Regex   Regex
Substring
Simple Regex
mgsub Y Y Y Y Y Y 162 348
qdap Y N N N Y N 287 284
bayesbio N N N N N N 34 43
bazar Y N N N Y N 48 60
textclean Y N N N Y N 78 65

I’ve shown that only mgsub::mgsub provides actual correct multiple, global string substitution from the 5 packages that have a function named mgsub in every possible scenario. While it is on the slower end of runtime, it’s a small price to pay for safety.

  • Category
  • r
comments powered by Disqus