I just stumbled upon this video from the 2011 RuPy conference, in which Dave Beazley, after showing some pretty weird corner cases of python and ruby performance issues with regards to threading, explains in detail why the Global Interpreter Lock (GIL) is the actual show-stopper for threading in both python and ruby (since both are using pretty much the same model).
Also, I think the questions period at the end is very interesting since the attendees ask some pretty relevant questions, to which Mr. Beazley has good answers.
Basically, there already have been an effort to get rid of the GIL in python 1.4 by placing mutex locks in mutable data structures, ref-counting handling and other relevant places to make the python interpreter actually thread-safe, but the GIL was brought back soon afterwards because the change made the interpreter roughly 2x slower even without threading.
It turns out that Python's ref-counting method of garbage collection is the thread and/or performance killer.
Also, Python has in version 3.2, tried to make some changes to the GIL to make it behave a bit better. And while the changes fixed some cases, they haven't fixed all of the corner cases, and even made some of them worse.
In some cases, to get real threading in python, escaping to C and then releasing the GIL within the C code will give you threading that's unburdened by the interpreter. But in some cases that's too complicated for no good reason. The other way to get parallelism in python is to create other processes, but then you're duplicating all of the python interpreter and you can't share variables directly: you need to implement a means of communication between processes.
So, now the question that remains is: how could that be made better?
The GIL seems to me like a longstanding problem for Python multithreading performance. Refcounting GC has it's limits when it comes to having multiple concurrent threads. It's interesting to note that Lua doesn't have a GIL, but they have a tracing GC instead of refcouny. LuaJIT, a JIT compiler for Lua has recently implemented a GC that uses a generational, write-barrier based method (http://wiki.luajit.org/New-Garbage-Collector/01fd5e5ca4f95d45e0c4b8a98b49f2b656cc23dd)
This isn't to say that tracing GCs are a magic bullet of any kind, anyone who has used (or debugged) Java programs knows that you can easily make memory leaks when not being careful. (Circular references,keeping object references in long-lived data structures etc.)