In R 2.15.0
and data.table 1.8.9
:
d = data.table(a = 1:5, value = 2:6, key = "a")
d[J(3), value]
# a value
# 3 4
d[J(3)][, value]
# 4
I expected both to produce the same output (the 2nd one) and I believe they should.
In the interest of clearing up that this is not a J
syntax issue, same expectation applies to the following (identical to the above) expressions:
t = data.table(a = 3, key = "a")
d[t, value]
d[t][, value]
I would expect both of the above to return the exact same output.
So let me rephrase the question - why is (data.table
designed so that) the key column printed out automatically in d[t, value]
?
Update (based on answers and comments below): Thanks @Arun et al., I understand the design-why now. The reason the above prints the key is because there is a hidden by present every time you do a data.table
merge via the X[Y]
syntax, and that by
is by the key. The reason it's designed this way seems to be the following - since the by
operation has to be performed when merging, one might as well take advantage of that and not do another by
if you are going to do that by the key of the merge.
Now that said, I believe that's a syntax design flaw. The way I read data.table
syntax d[i, j, by = b]
is
take
d
, apply thei
operation (be that subsetting or merging or whatnot), and then do thej
expression "by" b
The by-without-by breaks this reading and introduces cases one has to think about specifically (am I merging on i
, is by
just the key of the merge, etc). I believe this should be the job of the data.table
- the commendable effort to make data.table
faster in one particular case of the merge, when the by
is equal to the key, should be done in an alternative way (e.g. by checking internally if the by
expression is actually the key of the merge).