Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I have two files

encode

X.pattern.name       chr       start    stop    strand  score   p.value q.value matched.sequence
1   V_CETS1P54_01   chr1    98769545    98769554    +   11.42280    8.89e-05    NA  TCAGGATGTA
2   V_CETS1P54_01   chr1    152013037   152013046   +   11.98020    4.74e-05    NA  ACAGGAAGTT
3   V_CETS1P54_01   chr1    168932563   168932572   +   11.60860    7.59e-05    NA  ACCGGATGCT

encode.total

    chr     start           stop
1   chr1    58708485    58708713
2   chr1    58709084    58710538
3   chr1    98766295    98766639
4   chr1    98766902    98770338
5   chr1    107885506   107889414
6   chr1    138589531   138590856
7   chr1    138591180   138593378
8   chr1    152011746   152013185
9   chr1    152014263   152014695
10  chr1    168930561   168933076
11  chr1    181808064   181808906
12  chr1    184609002   184611519
13  chr1    193288453   193289567
14  chr1    193290105   193290490
15  chr1    193290744   193291092
16  chr1    196801920   196804954

I want to compare the two files, each entry by chr, start and stop. If the start and stop values in first file falls between the start and stop of the second file for the same chromosome, then that start & stop values in the first file should be replaced by the second file's start & stop values. I have written a for loop for that purpose but it is taking too long. What are the alternatives?

Code:

for(i in 1:nrow(encode))
{
     for(j in 1:nrow(encode.total))
     {
           if(encode[i,2]==encode.total[j,1])
           {
               if((encode[i,3]>=encode.total[j,2]) & (encode[i,4]<=encode.total[j,3]))
               {
                   encode[i,3]=encode.total[j,2]
                   encode[i,4]=encode.total[j,3]
               }
           }
     }
}

For the same purpose I also tried the GenomicRanges package like below. The size of my dataframes is huge and the merge function on them creates a VERY LARGE dataframe (>2 billion rows which is not allowed), although I eventually subset the dataframe in a smaller one. But merge() is taking a LOT of memory and terminating R.

gr1<-GRanges(seqnames=encode$chr,IRanges(start=encode$start,end=encode$end))
gr2<-GRanges(seqnames=encode.total$chr, IRanges(start=encode.total$start,end=encode.total$end))

ranges <- merge(as.data.frame(gr1),as.data.frame(gr2),by="seqnames",suffixes=c("A","B"))
ranges <- ranges[with(ranges, startB <= startA & endB >= endA),]
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
466 views
Welcome To Ask or Share your Answers For Others

1 Answer

Use the Bioconductor GenomicRanges package.

source("http://bioconductor.org/biocLite.R")
biocLite("GenomicRanges")

Suppose there are two data frames x0 and x1, like encode and encode.total in the original example. We'd like to make these into a GRanges instance. I did this with

library(GenomicRanges)
gr0 = with(x0, GRanges(chr, IRanges(start=start, end=stop)))
gr1 = with(x1, GRanges(chr, IRanges(start=start, end=stop)))

It will often be possible to simply say makeGRangesFromDataFrame(x0), or use standard R commands to create a GRanges instance 'by hand'. Here we use with(), so that we can write GRanges(chr, IRanges(start=start, end=stop)) instead of GRanges(x0$chr, IRanges(start=x0$start, end=x0$stop)).

The next step is to find overlaps between query (gr0) and subject (gr1)

hits = findOverlaps(gr0, gr1)

which leads to

> hits
Hits of length 3
queryLength: 3
subjectLength: 16
  queryHits subjectHits 
   <integer>   <integer> 
 1         1           4 
 2         2           8 
 3         3          10 

Then updated the relevant start / end coordinates

ranges(gr0)[queryHits(hits)] = ranges(gr1)[subjectHits(hits)]

giving

> gr0
GRanges with 3 ranges and 0 metadata columns:
      seqnames                 ranges strand
         <Rle>              <IRanges>  <Rle>
  [1]     chr1 [ 98766902,  98770338]      *
  [2]     chr1 [152011746, 152013185]      *
  [3]     chr1 [168930561, 168933076]      *
  ---
  seqlengths:
   chr1
     NA

This will be fast up into the millions of ranges.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...