Asynchronous HTTP Requests in Python 3.5+

So you’ve heard that Python now supports that fancy async/await syntax. You want to play with it, but asyncio seems intimidating.

Well, someone wrote a simpler alternative to asyncio. It’s called Curio and people are saying good things about it. 1

In this tutorial, I’m going to show you how to make non-blocking HTTP requests using Curio.

Since it doesn’t have a high-level HTTP client yet, I whipped up a small library called curio-http 2, so you’ll need to install that as well.

The syntax

Let’s start with a single request:

import curio
import curio_http

async def main():
    async with curio_http.ClientSession() as session:
        response = await session.get('https://httpbin.org/get')

        print('Status code:', response.status_code)

        content = await response.json()

        print('Content:', content)

if __name__ == '__main__':
    curio.run(main())

You use async def to declare what’s called a coroutine. The last line — curio.run(main()) — kicks off the coroutine.

What’s inside the main coroutine should look familiar, if you’ve ever used the requests library.

At each point where await is called, the coroutine could theoretically yield control to a different coroutine. However, since there are no other coroutines here, the script behaves roughly like a synchronous program:

  1. Create an HTTP session.
  2. Make an HTTP request.
  3. Wait for the response headers.
  4. Print the response status code.
  5. Wait for the response content.
  6. Print the content.

Achieving concurrency

To reap the benefits of asynchronous I/O, it’s not enough to sprinkle our programs with the async and await keywords. We need to encode which operations can be executed independently (concurrent) and which need to happen one after the other (sequential).

Sequential execution:

response1 = await session.get('https://foo.com')
response2 = await session.get('https://bar.com')

Concurrent execution:

taks1 = await curio.spawn(session.get('https://foo.com'))
task2 = await curio.spawn(session.get('https://bar.com'))

curio.spawn() is how you express the idea “I want this coroutine to be executed in the background”. The thing that’s spawned is what Curio calls a task.

Let’s look at an example that fetches a list of URLs concurrently by spawning a task for each one:

import curio
import curio_http

async def fetch_one(url):
    async with curio_http.ClientSession() as session:
        response = await session.get(url)
        content = await response.json()
        return response, content


async def main(url_list):
    tasks = []

    for url in url_list:
        task = await curio.spawn(fetch_one(url))
        tasks.append(task)

    for task in tasks:
        response, content = await task.join()

        print('GET %s' % response.url)
        print(content)
        print()


url_list = [
    'http://httpbin.org/delay/1',
    'http://httpbin.org/delay/2',
    'http://httpbin.org/delay/3',
    'http://httpbin.org/delay/4',
]

if __name__ == '__main__':
    curio.run(main(url_list))

Each URL in the list takes a number of seconds to fetch.

If we were to fetch them sequentially, it would take 1+2+3+4=10 seconds in total.

Since we’re using tasks, the run time will only be around 4 seconds.

Controlling concurrency

What if we want to scrape a site, but we don’t want to hammer it with too many concurrent connections?

The simplest approach is to use what’s called a bounded semaphore.

Let’s see what changes we would need to make to the above example:

 import curio
 import curio_http

+MAX_CONNECTIONS_PER_HOST = 2
+
+sema = curio.BoundedSemaphore(MAX_CONNECTIONS_PER_HOST)
+
 async def fetch_one(url):
-    async with curio_http.ClientSession() as session:
+    async with sema, curio_http.ClientSession() as session:
         response = await session.get(url)
         content = await response.json()
         return response, content

Here, we’re using not one, but two context managers: the semaphore and the HTTP session.

The semaphore is aquired each time a task is started. It’s released right after the URL has finished being fetched.

If more than MAX_CONNECTIONS tasks have already aquired the semaphore, the next task that tries to aquire it will wait until a release happens.

To learn about more neat features of Curio, such as timeout handling and events, check out the excellent introductory tutorial.

I hope this has given you a glimpse of what modern async I/O can look like in Python. All of the libraries used in this tutorial are in a very early state right now, but I think they have a lot of potential.

  1. The downside is that it doesn’t work on Windows right now or with older versions of Python.

  2. Under the hood, it leverages this new thing called a sans I/O network protocol.

comments powered by Disqus