Performance of StringTokenizer class vs. split method in Java -


in software need split string words. have more 19,000,000 documents more 30 words each.

which of following 2 ways best way (in terms of performance)?

stringtokenizer stokenize = new stringtokenizer(s," "); while (stokenize.hasmoretokens()) { 

or

string[] splits = s.split(" "); for(int =0; < splits.length; i++) 

if data in database need parse string of words, suggest using indexof repeatedly. many times faster either solution.

however, getting data database still more expensive.

stringbuilder sb = new stringbuilder(); (int = 100000; < 100000 + 60; i++)     sb.append(i).append(' '); string sample = sb.tostring();  int runs = 100000; (int = 0; < 5; i++) {     {         long start = system.nanotime();         (int r = 0; r < runs; r++) {             stringtokenizer st = new stringtokenizer(sample);             list<string> list = new arraylist<string>();             while (st.hasmoretokens())                 list.add(st.nexttoken());         }         long time = system.nanotime() - start;         system.out.printf("stringtokenizer took average of %.1f us%n", time / runs / 1000.0);     }     {         long start = system.nanotime();         pattern spacepattern = pattern.compile(" ");         (int r = 0; r < runs; r++) {             list<string> list = arrays.aslist(spacepattern.split(sample, 0));         }         long time = system.nanotime() - start;         system.out.printf("pattern.split took average of %.1f us%n", time / runs / 1000.0);     }     {         long start = system.nanotime();         (int r = 0; r < runs; r++) {             list<string> list = new arraylist<string>();             int pos = 0, end;             while ((end = sample.indexof(' ', pos)) >= 0) {                 list.add(sample.substring(pos, end));                 pos = end + 1;             }         }         long time = system.nanotime() - start;         system.out.printf("indexof loop took average of %.1f us%n", time / runs / 1000.0);     }  } 

prints

stringtokenizer took average of 5.8 pattern.split took average of 4.8 indexof loop took average of 1.8 stringtokenizer took average of 4.9 pattern.split took average of 3.7 indexof loop took average of 1.7 stringtokenizer took average of 5.2 pattern.split took average of 3.9 indexof loop took average of 1.8 stringtokenizer took average of 5.1 pattern.split took average of 4.1 indexof loop took average of 1.6 stringtokenizer took average of 5.0 pattern.split took average of 3.8 indexof loop took average of 1.6 

the cost of opening file 8 ms. files small, cache may improve performance factor of 2-5x. going spend ~10 hours opening files. cost of using split vs stringtokenizer far less 0.01 ms each. parse 19 million x 30 words * 8 letters per word should take 10 seconds (at 1 gb per 2 seconds)

if want improve performance, suggest have far less files. e.g. use database. if don't want use sql database, suggest using 1 of these http://nosql-database.org/


Comments

Popular posts from this blog

c# - how to write client side events functions for the combobox items -

exception - Python, pyPdf OCR error: pyPdf.utils.PdfReadError: EOF marker not found -