I have been busy with this question since last night and I could not figure out how to do it.
What I want to do is to match df1 strings to df2 strings and get the similar ones out
what I do is like this
# a function to arrange the data to have IDs for each string
normalize <- function(x, delim) {
x <- gsub(")", "", x, fixed=TRUE)
x <- gsub("(", "", x, fixed=TRUE)
idx <- rep(seq_len(length(x)), times=nchar(gsub(sprintf("[^%s]",delim), "", as.character(x)))+1)
names <- unlist(strsplit(as.character(x), delim))
return(setNames(idx, names))
}
# a function to arrange the second df
lookup <- normalize(df2[,1], ",")
# a function to match them and give the IDs
process <- function(s) {
lookup_try <- lookup[names(s)]
found <- which(!is.na(lookup_try))
pos <- lookup_try[names(s)[found]]
return(paste(s[found], pos, sep="-"))
#change the last line to "return(as.character(pos))" to get only the result as in the comment
}
then I get the results like this
res <- lapply(colnames(df1), function(x) process(normalize(df1[,x], ";")))
This gives me the row number of each string from df1 and row number of string from df2 that matched. so the output of this data looks like this
> res
$s1
[1] "3-4" "4-1" "5-4"
$s2
[1] "2-4" "3-15" "7-16"
The first column IDs is the row number of df2 which matched with strings in df1 The second column No is the number of times it matched The third column ID-col-n is the row number of string in df1 which matched with that string + their column name the forth is string from first column of the df1 which matched with that string the fifth column is the string of second column which matched with that string and so on
See Question&Answers more detail:os