I have a dataframe, which I read by Match <- read.table("Match.txt", sep="", fill =T, stringsAsFactors = FALSE, quote = "", header = F)
and looks like this:
> ab
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 Inspecting sequence ID chr1:173244300-173244500 NA NA
2 V$ATF3_Q6 | 19 (-) | 0.877 | 0.622 | aagtccCATCAggg
3 V$ATF3_Q6 | 34 (-) | 0.788 | 0.655 | agggaaCGACAcag
4 V$ATF3_Q6 | 102 (+) | 0.738 | 0.685 | cccTGAGCttagga
5 V$CEBPB_01 | 24 (+) | 0.950 | 0.882 | ccatcagGGAAGgg
72 V$YY1_01 | 117 (+) | 0.996 | 0.984 | acttCCCATcttttaag
73 Inspecting sequence ID chr1:173244350-173244550 NA NA
74 V$ATF3_Q6 | 52 (+) | 0.738 | 0.685 | cccTGAGCttagga
75 V$ATF3_Q6 | 160 (+) | 0.862 | 0.687 | gtcTGACCtggaga
76 V$CEBPB_01 | 57 (+) | 0.966 | 0.958 | agcttagGAAACtt
It contains million of such repetition, where first line is: Inspecting sequence ID chr1:173244300-173244500
and then some value as can be seen above. I want to process it keeping following things in mind:
- Extract the first line, break it on
:
and-
so I will get three columns like:chr1 173244300 173244500
- The 4th column should contain the V1$Row2 1st element, splitted on
$
and_
and just take the 2nd index which will beATF3
, like this I have 30 definite (lets call them names) cases, some will be observed while others not in each case (1 case is from Row 1 to row 72, second start from row 73). - If that name appears in 1 case then value
B
will be assigned to that column, if not valueU
will be assigned
So based on my input, I want to get the following output:
chr start stop ATF3 CEBPB YY1 ..(All which appear e.g from row 1 to 72, ignoring duplicates)
chr1 173244300 173244500 B B B
chr1 173244350 173244550 B B U
I want a fix no.of column in the header (I know they are 32 such names) so if they appear in one case B
will be assigned, otherwise U
will be assigned.
If anybody can help me in doing this, it will be a great help.
Here is the dput of this sample dataframe:
> ab <- dput(Match[c(1:5,72:76), ])
structure(list(V1 = c("Inspecting", "V$ATF3_Q6", "V$ATF3_Q6",
"V$ATF3_Q6", "V$CEBPB_01", "V$YY1_01", "Inspecting", "V$ATF3_Q6",
"V$ATF3_Q6", "V$CEBPB_01"), V2 = c("sequence", "|", "|", "|",
"|", "|", "sequence", "|", "|", "|"), V3 = c("ID", "19", "34",
"102", "24", "117", "ID", "52", "160", "57"), V4 = c("chr1:173244300-173244500",
"(-)", "(-)", "(+)", "(+)", "(+)", "chr1:173244350-173244550",
"(+)", "(+)", "(+)"), V5 = c("", "|", "|", "|", "|", "|", "",
"|", "|", "|"), V6 = c(NA, 0.877, 0.788, 0.738, 0.95, 0.996,
NA, 0.738, 0.862, 0.966), V7 = c("", "|", "|", "|", "|", "|",
"", "|", "|", "|"), V8 = c(NA, 0.622, 0.655, 0.685, 0.882, 0.984,
NA, 0.685, 0.687, 0.958), V9 = c("", "|", "|", "|", "|", "|",
"", "|", "|", "|"), V10 = c("", "aagtccCATCAggg", "agggaaCGACAcag",
"cccTGAGCttagga", "ccatcagGGAAGgg", "acttCCCATcttttaag", "",
"cccTGAGCttagga", "gtcTGACCtggaga", "agcttagGAAACtt")), .Names = c("V1",
"V2", "V3", "V4", "V5", "V6", "V7", "V8", "V9", "V10"), row.names = c(1L,
2L, 3L, 4L, 5L, 72L, 73L, 74L, 75L, 76L), class = "data.frame")
See Question&Answers more detail:os