Ray Tips and Tricks, Part 2 — ray.get() (2024)

March 25: Updated the discussion of the timeout option for ray.get(). See, in particular, the comparison of ray.get() and ray.wait() at the end of this post.

In the first part of this series, I listed the most commonly-used API features for Ray, then explored how ray.wait() works.

In this installment, I’ll explore using ray.get(object_ids) effectively. The ray.get() documentation provides a good explanation of this method and how to use it. You pass the id of a remote object or a list of them.

When you callray.get(), it blocks until the corresponding object is available or all the objects in the list are available. It returns the objects.

In more detail, it blocks until the object corresponding to the object ID is available in Ray’s local object store (the objects cached in memory with Plasma). If necessary, the object will be copied from an object store on another node. In either case, the computation has to finish first so the object is available. When object_ids is a list, then the objects corresponding to each object ID in the list will be returned. Hence, the method blocks for all of them.

Note: Ray uses “zero-copy” access to the objects in the object store. Any task (or Ray actor) that reads an object will read the one in the cache, not pay the overhead of copying that object to the local process memory.

There is an optional timeout field that defaults to None, meaning no timeout. If you specify a value and the timeout is hit before the objects are available locally, then RayTimeoutError is raised. However, this option may be removed in a future version of Ray, to simplify the API. Later, I’ll discuss the alternative approach you should use instead.

AnException is raised if the task that created the object or one of the objects in the list raised an exception.

Now let’s discuss tips and tricks.

Don’t call ray.get() inside an async context

A warning is issued if ray.get(object_id) is called inside an async context. Use await object_id instead. For a list of object ids, use await asyncio.gather(*object_ids).

Minimizing blocking when calling ray.get()

Let’s adapt the example code from Part 1. This time, we’ll start by defining a busy function that isn’t a Ray remote task.

We don’t need to initialize Ray yet, but we used the previous example with the single change that busy is not a Ray task. Calling ray.init() is harmless here.

Before using Ray, let’s see how this function performs. We will time five sequential invocations of this function. The output is shown as a :

duration: 5.018 seconds
0: 1.0
1: 1.0
2: 1.0
3: 1.0
4: 1.0

Your times should be similar. The loop output isn’t all that interesting, but the duration shows us that we waited five seconds for the loop to finish. Since each call to busy sleeps for one second, the five second duration should make sense.

Now let’s wrap busy in a Ray task:

We can just wrap the normal Python function busy in a new function that defined to be a Ray task.

As usual, we invoke rbusy using rbusy.remote(...) and use ray.get() to fetch the result, so let’s just do that in the loop:

The results are disappointing:

duration: 5.034 seconds
0: 1.0 (time: 1.007 seconds)
1: 1.0 (time: 2.015 seconds)
2: 1.0 (time: 3.022 seconds)
3: 1.0 (time: 4.026 seconds)
4: 1.0 (time: 5.034 seconds)

They are no better than before. Now we tracked how long each loop iteration took to run, about one second (the time values, which are cumulative).

The problem is that we immediately called ray.get() on each object ID returned by arbusy.remote()call. This effectively blocked progress, because we waited for each remote call to finish before spawning the next task.

Instead, we need fire off the rbusy invocations, which will run in parallel because rbusy.remote() is nonblocking. Then, we can call ray.get() to retrieve the objects, blocking if necessary.

We could call ray.get(ids) for all the ids at once, but because we’re going to grab some extra information in our list, i.e., times, we’ll just call ray.get() on each id, one at a time:

Here’s the output, which is a bit busy, so we’ll walk through it carefully:

duration1: 1.892 MILLISECONDS
0: ObjectID(e725323c902523f0ffffffff010000c801000000) (time: 0.602 MILLISECONDS)
1: ObjectID(b63236165db0eea7ffffffff010000c801000000) (time: 1.053 MILLISECONDS)
2: ObjectID(53a5a3a325aa2675ffffffff010000c801000000) (time: 1.332 MILLISECONDS)
3: ObjectID(04df2a355e22995fffffffff010000c801000000) (time: 1.546 MILLISECONDS)
4: ObjectID(a438aa430f188597ffffffff010000c801000000) (time: 1.745 MILLISECONDS)
0: 1.0 (time: 1001.139 MILLISECONDS)
1: 1.0 (time: 1001.476 MILLISECONDS)
2: 1.0 (time: 1004.851 MILLISECONDS)
3: 1.0 (time: 1005.418 MILLISECONDS)
4: 1.0 (time: 1005.659 MILLISECONDS)
duration2: 1.006 seconds

The duration1 value shows that it takes about 2 milliseconds to process the whole first loop, because the calls to rbusy.remote() are nonblocking.

We then print out the object ids and the cumulative time it took for those five rbusy.remote()calls. The last value, for pass “4”, shows a time that’s a bit less than the total duration1 time, as we would expect.

The times for the second loop are interesting. It takes just over one second (1001 milliseconds) for the first iteration through the loop to complete. Why? It calls ray.get(), so it has to wait for the first task to finish. But then the rest of the loop iterations go very quickly. Each one takes a fraction of a millisecond. Why is that?? Recall that we’re running the five tasks in parallel, so when the first one finishes, the other four finish at about the same time, so for the rest of the loop iterations, each call to ray.get() doesn’t wait very long, if at all, before the object is ready. If you watch this run, you’ll notice a short pause after the last ObjectID line, then the rest of the output “bursts” out.

This is the best we can do for this contrived example. Our tasks ran in parallel, but when we needed the results, we had to wait for them to finish. If we had some other busy work to do and then called ray.get() later, we might have seen no pauses at all.

Ray.get() vs. ray.wait()

Let’s discuss two considerations where ray.wait() should be used instead of ray.get().

First, we used a simple example where all tasks took the same amount of time. What if we had very different task durations and what if the longest running task were at the beginning of our list? In this case, we would block waiting for it to finish, even though other tasks would finish sooner and their objects would be ready to process.

This is why you should consider using ray.wait() for real-world applications when you need to wait on multiple running tasks. It will return as soon as results are available, subject to the options you provide when calling it, as we discussed in Part 1. Hence, you can process available results while longer-running tasks continue to run asynchronously.

Second, I mentioned above that the timeout field for ray.get() may be removed in a subsequent release of Ray. Going forward, ray.wait() will be the API call of choice for this blocking operation when you want to specify a timeout.

In general, you should really avoid making blocking calls in production distributed systems without specifying timeout values. Otherwise, the situation can sometimes occur where an application freezes waiting for a blocking call to complete that may never complete for some reason. It’s better to keep the application “live” with a timeout and appropriate handling of the corresponding error.

Hence, for these two reasons, you should strongly consider using ray.wait(), with a timeout value, in production code and limit the use of ray.get(), except when you only need one task result and you’re confident that having no timeout is “safe”.

See Part 1 for more details about ray.wait().

If you have any questions or thoughts about Ray, you can join our community through Discourse or Slack. If you would like to see how Ray is being used throughout industry, please consider attending or watching Ray Summit.

Ray Tips and Tricks, Part 2 — ray.get() (2024)

FAQs

Ray Tips and Tricks, Part 2 — ray.get()? ›

When you call ray. get() , it blocks until the corresponding object is available or all the objects in the list are available. It returns the objects. In more detail, it blocks until the object corresponding to the object ID is available in Ray's local object store (the objects cached in memory with Plasma).

Is Ray Get blocking? ›

A call to ray. get() fetches the results of remotely executed functions. However, it is a blocking call, which means that it always waits until the requested result is available.

What does Ray get do? ›

get. Get a remote object or a list of remote objects from the object store. This method blocks until the object corresponding to the object ref is available in the local object store.

What does Ray Put do? ›

When we pass a large object as an argument to a remote function, Ray calls ray. put() under the hood to store that object in the local object store.

How to use ray in Python? ›

A Simple Python Example: Running a Ray Task on a Remote Cluster. With Ray, you can run functions on a cluster as remote tasks. To use Ray, you need to add the @ray. remote decorator to the function you want to run remotely.

What is a Ray actor? ›

Actors extend the Ray API from functions (tasks) to classes. An actor is essentially a stateful worker (or a service). When a new actor is instantiated, a new worker is created, and methods of the actor are scheduled on that specific worker and can access and mutate the state of that worker.

What is the ray task? ›

Ray enables arbitrary functions to be executed asynchronously on separate Python workers. Such functions are called Ray remote functions and their asynchronous invocations are called Ray tasks.

Why is Ray so powerful? ›

Ultimately, the Force dyad that links Rey and Kylo Ren makes Rey the most powerful Jedi between her and the Skywalkers. A Force dyad is described as being more powerful than life itself, striking a perfect balance between the light and dark side.

Who does Ray fall in love with? ›

Ben Solo is the love of Rey's life. Initially unknown to Rey, she forms a dyad in the Force with Ben. The dyad is an unbreakable Force-bond that makes them one in the Force, despite being born as two physically separated individuals.

What is a raylet? ›

Worker node A Ray instance consists of one or more worker nodes, each of which consists of the following physical processes: Raylet: A raylet is shared among all jobs on the same cluster. That is other nodes on can talk and communicate with it.

What is Ray good for? ›

The RAY framework is an open-source project that provides a simple, universal API for building distributed applications. It is particularly well-suited for applications that require parallel and distributed computing, making it a popular choice for machine learning, deep learning, and big data processing tasks.

What is a ray client? ›

What is the Ray Client? The Ray Client is an API that connects a Python script to a remote Ray cluster. Effectively, it allows you to leverage a remote Ray cluster just like you would with Ray running on your local machine.

What is Ray AI? ›

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI libraries for simplifying ML compute: Learn more about Ray AI Libraries: Data: Scalable Datasets for ML. Train: Distributed Training.

What is a ray cluster? ›

A Ray cluster consists of a single head node and any number of connected worker nodes: A Ray cluster with two worker nodes.

What is ray code? ›

Ray is an open-source unified framework for scaling AI and Python applications like machine learning. It provides the compute layer for parallel processing so that you don't need to be a distributed systems expert.

What is ray serve? ›

Get in touch with us if you're using or considering using Ray Serve. Ray Serve is a scalable model serving library for building online inference APIs.

How does ray object store work? ›

Remote objects are cached in Ray's distributed shared-memory object store, and there is one object store per node in the cluster. In the cluster setting, a remote object can live on one or many nodes, independent of who holds the object ref(s).

What does Ray Remote do? ›

Ray lets you run functions as remote tasks in the cluster. Remote functions can be run in a separate process on the local machine - spreading out the workload over several cores. Or can be executed on remote machines in your server cluster.

What is ray data? ›

Ray Data is a scalable data processing library for ML workloads. It provides flexible and performant APIs for scaling Offline batch inference and Data preprocessing and ingest for ML training.

Top Articles
Latest Posts
Article information

Author: Sen. Ignacio Ratke

Last Updated:

Views: 6045

Rating: 4.6 / 5 (56 voted)

Reviews: 87% of readers found this page helpful

Author information

Name: Sen. Ignacio Ratke

Birthday: 1999-05-27

Address: Apt. 171 8116 Bailey Via, Roberthaven, GA 58289

Phone: +2585395768220

Job: Lead Liaison

Hobby: Lockpicking, LARPing, Lego building, Lapidary, Macrame, Book restoration, Bodybuilding

Introduction: My name is Sen. Ignacio Ratke, I am a adventurous, zealous, outstanding, agreeable, precious, excited, gifted person who loves writing and wants to share my knowledge and understanding with you.