I have tried to optimize the TigerXML parser in STA [↪] a bit, the results (graphs are stored in memory, with a 55 MiB corpus, on my Core 2 Duo 2.2GHz):
- unoptimized: 61.01s
- optimized: 39.68s
That still seemed a bit too slow to me, and I decided to try out some parallelization: since the ID reference remapping and graph store code in the parser is decoupled from parsing and the creation of the node data structures, I tried to put that into a separate thread. Both threads are mostly CPU-bound, so I quickly ran into the good old GIL problem. In short: in a Python process, only one thread my execute Python byte code at a time. This is not a problem if you are doing a lot of I/O or lurking around in message loops, but it’s deadly for CPU-bound tasks. It made the parser considerably slower, even when I used the shelve module instead of storing the graphs to memory (which involves a lot more disk I/O).
In times where multi-core CPUs are becoming a commodity, more fine-grained locking would be desirable in CPython, or, even better, more advanced types of parallelization support.
There are other approaches to parallel computing in Python readily available (based on forks, some with shared memory, some with pipes [↪]), but they seem too heavyweight to be dragged in for this small problem: a non-trivial dependency just to shave off some seconds from the parsing process is not a good tradeoff.

![Validate my RSS feed [Valid RSS]](http://shlomme.diotavelli.net/images/valid-rss.png)
0 responses so far ↓
There are no comments yet...Kick things off by filling out the form below.
Leave a Comment