Asynchronous HTTP Requests in Python 3.5+
So you’ve heard that Python now supports that fancy
async/await syntax. You want to play with it, but asyncio seems intimidating.
In this tutorial, I’m going to show you how to make non-blocking HTTP requests using Curio.
Let’s start with a single request:
import curio import curio_http async def main(): async with curio_http.ClientSession() as session: response = await session.get('https://httpbin.org/get') print('Status code:', response.status_code) content = await response.json() print('Content:', content) if __name__ == '__main__': curio.run(main())
async def to declare what’s called a coroutine. The last line —
curio.run(main()) — kicks off the coroutine.
What’s inside the
main coroutine should look familiar, if you’ve ever used the requests library.
At each point where
await is called, the coroutine could theoretically yield control to a different coroutine. However, since there are no other coroutines here, the script behaves roughly like a synchronous program:
- Create an HTTP session.
- Make an HTTP request.
- Wait for the response headers.
- Print the response status code.
- Wait for the response content.
- Print the content.
To reap the benefits of asynchronous I/O, it’s not enough to sprinkle our programs with the
await keywords. We need to encode which operations can be executed independently (concurrent) and which need to happen one after the other (sequential).
response1 = await session.get('https://foo.com') response2 = await session.get('https://bar.com')
taks1 = await curio.spawn(session.get('https://foo.com')) task2 = await curio.spawn(session.get('https://bar.com'))
curio.spawn() is how you express the idea “I want this coroutine to be executed in the background”. The thing that’s spawned is what Curio calls a task.
Let’s look at an example that fetches a list of URLs concurrently by spawning a task for each one:
import curio import curio_http async def fetch_one(url): async with curio_http.ClientSession() as session: response = await session.get(url) content = await response.json() return response, content async def main(url_list): tasks =  for url in url_list: task = await curio.spawn(fetch_one(url)) tasks.append(task) for task in tasks: response, content = await task.join() print('GET %s' % response.url) print(content) print() url_list = [ 'http://httpbin.org/delay/1', 'http://httpbin.org/delay/2', 'http://httpbin.org/delay/3', 'http://httpbin.org/delay/4', ] if __name__ == '__main__': curio.run(main(url_list))
Each URL in the list takes a number of seconds to fetch.
If we were to fetch them sequentially, it would take 1+2+3+4=10 seconds in total.
Since we’re using tasks, the run time will only be around 4 seconds.
What if we want to scrape a site, but we don’t want to hammer it with too many concurrent connections?
The simplest approach is to use what’s called a bounded semaphore.
Let’s see what changes we would need to make to the above example:
import curio import curio_http +MAX_CONNECTIONS_PER_HOST = 2 + +sema = curio.BoundedSemaphore(MAX_CONNECTIONS_PER_HOST) + async def fetch_one(url): - async with curio_http.ClientSession() as session: + async with sema, curio_http.ClientSession() as session: response = await session.get(url) content = await response.json() return response, content
Here, we’re using not one, but two context managers: the semaphore and the HTTP session.
The semaphore is aquired each time a task is started. It’s released right after the URL has finished being fetched.
If more than
MAX_CONNECTIONS tasks have already aquired the semaphore, the next task that tries to aquire it will wait until a release happens.
To learn about more neat features of Curio, such as timeout handling and events, check out the excellent introductory tutorial.
I hope this has given you a glimpse of what modern async I/O can look like in Python. All of the libraries used in this tutorial are in a very early state right now, but I think they have a lot of potential.