Python concurrency lessons
Now that everyone is talking about and using async/await
and event loops everywhere, I thought I would go to the basics (well not to the fundamentals, but practical basics) and write a small program to download a bunch of text files and process them.
The program
Fetch n
ebooks from project gutenberg site and do some elemental processing on the content.
Some of the code snippets:
def download(url: str) -> str:
"""download and decode content from a url."""
= requests.get(url)
page = url.split("/")[-1]
name return page.content.decode("utf-8"), f"texts/{name}"
def save(content: str, fname: str = None):
"""save content to file with name."""
with open(fname, "w") as fh:
fh.write(content)
def process(string: list) -> list:
"""cpu bound tasks"""
...
...return string
def main():
= [
urls "https://www.gutenberg.org/cache/epub/376/pg376.txt",
"https://www.gutenberg.org/files/84/84-0.txt",
"https://www.gutenberg.org/cache/epub/844/pg844.txt",
]
# call download
# call save
#call string process
The full source code is on github at pyjuggle.
Python and concurrency
Let’s get this straight, concurrency and parallelism are not the same. Concurrency means, running multiple programs as though they are running at the same time, where as parallelism is running the programs truly in parallel in multiple cores of your machine.
The interesting aspect of Python (Ruby and few others) is that there is something called the global interpreter lock or GIL. This is part of the cpython implementation, where only one Python object can have access to the interpreter at one time. This means that, we cannot run two threads of execution parallely in Python but that all threads of execution are time sliced. You can read more on GIL here.
Threads come handy though when there is a lot of IO involved or when there are subprocess calls, where the interpreter is handing-off to work to another subsystem, and a new thread of execution can start, while the previous thread waits for a response.
When we need to real multi-core parallel tasks in Python, mulitprocessing library comes to our favour. It essentially spins up multiple Python interpreters and works on orthogal exlusive chunks of data, this method is good for CPU bound tasks.
When to use what?
Although, there are exceptions, I would say when IO or subprocess calls are involved or when the interpreter is giving of control to another sub-system use threads.
The module of choice for me for concurrency in python is concurrent.futures
.
To use a thread pool, we can do this:
from concurrent.futures import ThreadPoolExecutor as tpe
# call download method concurrently
with tpe(max_workers=3) as exe:
= exe.map(download, urls) results
Here, maximum of 4 threads will be spun up, each time sliced, and will start downloading the urls
. One thing to note here is that, depending upon the amount of data, here the urls
threaded code may not show significant performance improvement when compared to one without threading, so it is always good to time sequential and threaded code before you decide.
{% include info.html text=“use threads when the task is IO bound” %}
The same approach can be used to save the files, create a threadpool and save the downloaded content:
from concurrent.futures import ThreadPoolExecutor as tpe
...
...# call save method concurrently
with tpe(max_workers=3) as exe:
map(save, results)
exe.
... ...
Finally we come across, a CPU bound task, which has to tokenize the saved files, here if we use threading, although there could be some speedup than serial execution, it is advicable to use multiprocessing, in some cases, threading could actually slow down the program if the program is mostly CPU bound.
An example of using a multiprocess executor is:
from concurrent.futures import ProcessPoolExecutor as ppe
# call process method parallely
with ppe(max_workers=4) as exe:
map(process, open("./all.txt").readlines()) exe.
{% include info.html text=“use processes when the task is CPU bound” %}
Here we are mapping 4 process
methods with chunks of data from the file, one thing to note here is that the amount of work distributed to the 4 processes is not controlled by the programmer and may not be equal.
The end
This was an interesting experience, I wanted to write about this for few years now, finally got to it. I am pretty sure I have got some things wrong, if you find any such techincal inacurracies ping me on twitter or on github.
May be next time I will write, when to use async-await
.