scala - Preferred way of processing this data with parallel arrays -
imagine sequence of java.io.file
objects. sequence not in particular order, gets populated after directory traversal. names of files can this:
/some/file.bin /some/other_file_x1.bin /some/other_file_x2.bin /some/other_file_x3.bin /some/other_file_x4.bin /some/other_file_x5.bin ... /some/x_file_part1.bin /some/x_file_part2.bin /some/x_file_part3.bin /some/x_file_part4.bin /some/x_file_part5.bin ... /some/x_file_part10.bin
basically, can have 3 types of files. first type simple ones, have .bin
extension. second type of file 1 formed _x1.bin
till _x5.bin
. , third type of file can formed of 10 smaller parts, _part1
till _part10
. know naming may strange, have work :)
i want group files ( pieces of file should processed ), , thinking of using parallel arrays this. thing i'm not sure how can perform reduce/acumulation part, since threads working on same array.
val allbinfiles = allbins.toarray // array of java.io.file
i thinking of handling that:
val mapacumulator = java.util.collections.synchronizedmap[string,listbuffer[file]](new java.util.hashmap[string,listbuffer[file]]()) allbinfiles.par.foreach { file => file match { // /some/x_file_x4.bin nametillpart /some/x_file case composedof5name(nametillpart) => { mapacumulator.getorelseupdate(nametillpart,new listbuffer[file]()) += file } case composedof10name(nametillpart) => { mapacumulator.getorelseupdate(nametillpart,new listbuffer[file]()) += file } // simple file, without pieces case _ => { mapacumulator.getorelseupdate(file.tostring,new listbuffer[file]()) += file } } }
i thinking of doing i've shown in above code. having extractors files, , using part of path key in map. example, /some/x_file
can hold values /some/x_file_x1.bin
/some/x_file_x5.bin
. think there better way of handling this. interested in opinions.
the alternative use groupby
:
val mp = allbinfiles.par.groupby { case composedof5name(x) => x case composedof10name(x) => x case f => f.tostring }
this return parallel map of parallel arrays of files (parmap[string, pararray[file]]
). if want sequential map of sequential sequences of files point:
val sqmp = mp.map(_.seq).seq
to ensure parallelism kicks in, make sure have enough elements in parallel array (10k+).
Comments
Post a Comment