scala - Preferred way of processing this data with parallel arrays -

- September 15, 2013

imagine sequence of java.io.file objects. sequence not in particular order, gets populated after directory traversal. names of files can this:

/some/file.bin /some/other_file_x1.bin /some/other_file_x2.bin /some/other_file_x3.bin /some/other_file_x4.bin /some/other_file_x5.bin ... /some/x_file_part1.bin /some/x_file_part2.bin /some/x_file_part3.bin /some/x_file_part4.bin /some/x_file_part5.bin ... /some/x_file_part10.bin

basically, can have 3 types of files. first type simple ones, have .bin extension. second type of file 1 formed _x1.bin till _x5.bin. , third type of file can formed of 10 smaller parts, _part1 till _part10. know naming may strange, have work :)

i want group files ( pieces of file should processed ), , thinking of using parallel arrays this. thing i'm not sure how can perform reduce/acumulation part, since threads working on same array.

val allbinfiles = allbins.toarray // array of java.io.file

i thinking of handling that:

val mapacumulator = java.util.collections.synchronizedmap[string,listbuffer[file]](new java.util.hashmap[string,listbuffer[file]]())  allbinfiles.par.foreach { file =>    file match {       // /some/x_file_x4.bin nametillpart /some/x_file       case composedof5name(nametillpart) => {           mapacumulator.getorelseupdate(nametillpart,new listbuffer[file]()) += file       }       case composedof10name(nametillpart) => {           mapacumulator.getorelseupdate(nametillpart,new listbuffer[file]()) += file       }       // simple file, without pieces       case _ => {           mapacumulator.getorelseupdate(file.tostring,new listbuffer[file]()) += file       }    } }

i thinking of doing i've shown in above code. having extractors files, , using part of path key in map. example, /some/x_file can hold values /some/x_file_x1.bin /some/x_file_x5.bin. think there better way of handling this. interested in opinions.

the alternative use groupby:

val mp = allbinfiles.par.groupby {   case composedof5name(x) => x   case composedof10name(x) => x   case f => f.tostring }

this return parallel map of parallel arrays of files (parmap[string, pararray[file]]). if want sequential map of sequential sequences of files point:

val sqmp = mp.map(_.seq).seq

to ensure parallelism kicks in, make sure have enough elements in parallel array (10k+).

Search This Blog

Expalin

scala - Preferred way of processing this data with parallel arrays -

Comments

Post a Comment

Popular posts from this blog

c++ - error: use of deleted function -

exception - Python, pyPdf OCR error: pyPdf.utils.PdfReadError: EOF marker not found -

Cursor error with postgresql, pgpool and php -