The ray library for distributed computing has been around for a while. It was a few years back when I first noticed it. This was when I had started doing some Deep Reinforcement Learning at work and needed to do distributed policy explorations. Since then, the library has expanded it’s role in many ways and is a true distributed framework to scale compute intensive workloads. Today I want to try and use it for a non machine learning workload and see how different it would be from using thee multiprocessing library that comes part of the Python standard library.

I was doing some webscraping and parsing and needed to parallelize some part of my code, specifically I wanted to run 2 functions parallely.

Python multiprocessing

The usual approach I would choose to do this is using the multiprocessing library, although threads or asyncio can be used as well, as this were heavy IO based functions.

Consider these two functions as an example:

def sum():
  a = 0
  for i in range(1000000):
    a += i
  return a

def product():
  a = 0
  for i in range(100000):
    a *= n
  return a

One way to multiprocess sum() and product() would be:

from multiprocessing import Process

p1 = Process(target=sum)
p2 = Process(target=product) 
p1.start()
p2.start()
n_sum, n_prod = p1.join(), p2.join()

That was fairly simple wasn’t it? Although real world functions will have parameters and might have need to share memory etc.. I am avoiding all those here for simplicity.

Using Ray to do the same thing

import ray

# initialize ray
ray.init()

# decorate function that need to be parallized
@ray.remote()
def sum():
  a = 0
  for i in range(1000000):
    a += i
  return a

# decorate function that need to be parallized
@ray.remote()
def product():
  a = 0
  for i in range(100000):
    a *= n
  return a


sum_id, prod_id = sources.remote(), binaries.remote()
# block until finished
n_sum, n_prod = ray.get([sum_id, prod_id])

That’s it, that was as simple as using the multiprocessing library. The interesting thing about ray is that the same code can be used to distribute the workload across a cluster as well without changing the code in any way- this I think is a game changer especially when I don’t have to deal with MPI configs or spark clusters to get that ability.

Running this code, I get a INFO message:

2020-05-23 18:05:18,848 INFO resource_spec.py:204 -- Starting Ray with 7.28 GiB memory available for workers and up to 3.66 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-05-23 18:05:19,246 INFO services.py:1168 -- View the Ray dashboard at localhost:8265

The dashboard running on localhost:8265 gives me a overview of resource consumption *(cpu, memory), amount of data seent and recieved etc, this is especially useful in a distributed cluster environment, very neat!

The end

For distributed Machine Learning training I used Horovod exlusively, the one thing with Horovod is that it needs MPI and at times, it’s a pain to debug MPI, I would like to tryout Ray and see if it is easier to use in place of Horovod. My initial thoughts are all positive and it looks really great.