I am trying to extract substrings from a unstructured text. For example, assume a vector of country names:
countries <- c("United States", "Israel", "Canada")
How do I go about passing this vector of character values to extract exact matches from unstructured text.
text.df <- data.frame(ID = c(1:5),
text = c("United States is a match", "Not a match", "Not a match",
"Israel is a match", "Canada is a match"))
In this example, the desired output would be:
ID text
1 United States
4 Israel
5 Canada
So far I have been working with gsub
by where I remove all non-matches and then eliminate then remove rows with empty values. I have also been working with str_extract
from the stringr package, but haven't had success getting the arugments for the regular expression correct. Any assistance would be greatly appreciated!