# Safe, Multiple String Substitutions with mgsub::mgsub

Solving an infrequent problem with a package.

## String substitutions

Note - the package I wrote was originally inspired by a challenge a coworker tossed out. It also happened to provide a solution to this SO question which was really cool!

Substitutions in strings are best handled with regular expressions which are an amazingly powerful and flexible tool. Regular expressions are a way of expressing patterns in strings. In the example below I want to find the four letters, “dopa” and replace them with “meta”. I accomplish this with the internal function sub.

sub("dopa","meta","The chemical dopaziamine is fake, long live dopaziamine!")
## [1] "The chemical metaziamine is fake, long live dopaziamine!"

## Multiple substitutions

Sometimes you need to substitute many things, all at once.

### Multiple instances of the same pattern

In the first example, only the first instance of “dopa” was replaced with “meta”. If the goal is to replace all of them, I can use the internal function gsub (the g stand for global!).

gsub("dopa","meta","The chemical dopaziamine is fake, long live dopaziamine!")
## [1] "The chemical metaziamine is fake, long live metaziamine!"

However, the internal regex functions aren’t vectorized, so I can’t have multiple patterns.

gsub(c("dopa","fake"),c("meta","real"),"The chemical dopaziamine is fake, long live dopaziamine!")
## Warning in gsub(c("dopa", "fake"), c("meta", "real"), "The chemical
## dopaziamine is fake, long live dopaziamine!"): argument 'pattern' has
## length > 1 and only the first element will be used
## Warning in gsub(c("dopa", "fake"), c("meta", "real"), "The chemical
## dopaziamine is fake, long live dopaziamine!"): argument 'replacement' has
## length > 1 and only the first element will be used
## [1] "The chemical metaziamine is fake, long live metaziamine!"

### Multiple patterns

There are a few string substitution methods that handle multiple patterns.

#### stringr

The function str_replace_all in the stringr package supports vectorized patterns and replacements. However, it applies each set individually giving n responses (where n is the longer length of the pattern or replacement vector). So, this doesn’t really work.

stringr::str_replace_all("The chemical dopaziamine is fake, long live dopaziamine!"
,c("dopa","fake"),c("meta","real"))
## [1] "The chemical metaziamine is fake, long live metaziamine!"
## [2] "The chemical dopaziamine is real, long live dopaziamine!"

#### qdap

The function mgsub in the qdap package also supports vectorized patterns and replacements. It works by using placeholders and then iteratively applying internal regex functions.

qdap::mgsub(c("dopa","fake"),c("meta","real"),"The chemical dopaziamine is fake, long live dopaziamine!")
## [1] "The chemical metaziamine is real, long live metaziamine!"

#### chartr

There’s a special case which is mostly for transliteration, so it only works on single characters. chartr, an internal function, takes a string of old characters and a string of new characters and does simultaneous replacement, characterwise, on the string. This does not support regex or anything other than single characters, so it’s pretty limited.

chartr("ho","oh","ho ho hoot")
## [1] "oh oh ohht"

### Problems with safety

I noted that qdap::mgsub uses placeholders. This can actually cause a problem in certain cases where patterns are the same length. Consider the phrase “Hey, how are you?” where I want to shift each word to the left. So, “hey” shoudl be replaced by “how”, “how” by “are”, etc. Note, each pattern to match is 3 characters long. Note also that I pass in the ignore.case=T argument so my patterns won’t be bothered by capitalization.

qdap::mgsub(c("hey","how","are","you"),c("how","are","you","hey"),"Hey, how are you?"
,fixed=F,ignore.case=T)
## [1] "hey, hey hey hey?"

The placeholders became indistinguishable and so every word was replaced with the same word.

### A safer option

I just published a github repo which contains a new R package called mgsub. It is a safe alternative to qdap::mgsub, fully supporting regular expression matching and replacement in a way that guarantees safety. It also replaces with the longer match first, so sub-matches won’t mess things up.

Rather than passing in vectors (which could be recylced) of matches and replacements, I require named lists.

Finally, the code is pure R (for now) with no dependencies, so you won’t get a lot of bloat.

#### Installing from github

devtools::install_github("bmewing/mgsub")

#### Examples

First, the case that broke qdap.

mgsub::mgsub("Hey, how are you?",list("hey"="how","how"="are","are"="you","you"="hey")
,ignore.case=T)
## [1] "how, are you hey?"

We can also try a complex regular expression. Note we use regular expressions in the match and the replacement and it works exactly as expected. We only replaced “dopa” with “meta” with it’s a -mine group as opposed to a -ride group. Disclaimer - I know nothing about chemistry and stuff, so I don’t know if those are real.

mgsub::mgsub("Dopazamine is not the same as Dopachloride and is still fake.",
list("[Dd]opa(.*?mine)"="Meta\\1","fake"="real"),
ignore.case=F)
## [1] "Metazamine is not the same as Dopachloride and is still real."

We can see the substring protection here. Even though “the” is a substring of “they” and appears in the list first, “they” is given priority when it is also matched.

mgsub::mgsub("They don't understand the value of what they seek.",
list("the"="a","they"="we"),ignore.case=T)
## [1] "we don't understand a value of what we seek."

You can also use it on single characters.

mgsub::mgsub("ho ho hoot",list("h"="o","o"="h"))
## [1] "oh oh ohht"

### Development plans

Once I write unit tests and get some error handling in place I will submit to CRAN. Then I’ll start working on getting the code into C++ to test the performance enhancements. Overall goal will be low overhead.

• Category
• r