data.table and paste aggregation very slow for big data, bottleneck in
large code, large data
I just finished profiling my R code and it turns out that most operations
are done relatively fast, but for my large data this part of the code its
taking most of the time:
collapsed.bam.dt =
split.bam.dt[,list(read.parent.cigar=paste(block.read.parent.cigar,collapse=""),
parent.ref.cigar=paste(block.parent.ref.cigar,collapse=""),
read.ref.cigar=paste(block.read.ref.cigar,collapse=""),
read.ref.start=block.ref.start[1]),
by = read.index];
Note that block.read.parent.cigar, parent.ref.cigar, read.ref.cigar are of
type "character" and I want to concatenate them. In the case of
read.ref.start they are all the same so I just take the first one.
I am not sure if others have found that aggregating data in very large
data tables takes too much time.
Most of my code runs with mclapply with 48 cores so runs pretty fast, but
I cannot think about other ways to improve/parallelize the line above. If
you have some ideas/suggestions please let me know.
Thanks!
No comments:
Post a Comment