Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I am trying to check the overlap between one and several other files (overlap_files in code below).

Main file:

chr1    8014812 8014812
chr1    22371954    22371954
chr1    35328666    35328666

Example of overlap_files:

chr1    8014812 8014812
chr1    22371954    22371954

My code looks like this:

# Load variants
a1 <- read.table("main.txt", header=FALSE)

#Begin looping
overlap=lapply(overlap_files, 
function(x) {

#Load in "x" file skipping empty files
t=if(!file.size(x) == 0) {
read.table(x, header=FALSE)
}
#Overlap
apply(a1, 1, function(x) 
    ifelse(any(x[1]==t$V1 & x[2]==t$V2 & x[3]==t$V3), '1','0')) 
})

Although the two first rows exist in both files, in the output the first variant is marked as 0 (it should have been 1), the second as 1 (correct) and the third as 0 (correct). It seems to be because of the difference in length (i.e. 8014812 has 7 digits, while the other two numbers 8 digits). Is there a way of fixing this? Thank you.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
4.7k views
Welcome To Ask or Share your Answers For Others

1 Answer

From your example, I am not entirely sure what the separators in your files are. (tabs?)

Either way, I would propose the following approach:

  1. Read in files as data frames (one per file)
  2. Using dplyr::join will give you all rows that match (you can define multiple columns to match across with the by property)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...