Async & Parallel R with {mirai} | Charlie Gao | Data Science Lab

Transcript#

This transcript was generated automatically and may contain errors.

Alright, it's time to announce our lab manager for today. We are joined by Charlie. Charlie Gao , would you like to introduce yourself?

Sure. Hi everyone. Nice to see you all. I'm Charlie Gao. I'm a member of the Open Source team at Posit and, you know, sort of icebreaker. I have on this PONV t-shirt from Seattle in 2024. And as you can see, this is actually a participant's t-shirt. So at this, the conference, two conferences ago, I was a member of our community and it was really after that conference that I joined Posit. First as a contractor and then full time.

So what do I work on? Well, I work on sort of a number of things across Open Source. So what I've been working on most recently has actually been Quarto . So I've been actively involved in Quarto 2, which is the new Rust rewrite that has been announced. So I'm free to sort of say that. But as part of that, you will all be getting a collaborative editor. So that's been an incredibly exciting project to be working on.

But apart from that, I'm also the creator and maintainer of a few R packages, mirai , 9x, Secretbase, et cetera. And these are all packages that are mostly written in C, actually, for high performance computing and networking. So myself, I'm very deep into sort of async. Apart from working on mirai, which is the package I'm probably most well known for, I've also implemented modern async into packages that you all use, often like Hidde2 and Elmer. I also maintain the later package on behalf of Joe and Winston.

Those are packages that they created very early on. And I've sort of taken over the maintainership as they've focused more on AI-related applications. Later, I was the core event loop in Shiny and other sort of key R packages like Plumber . And it is basically a way for different packages to cooperate with each other, even though they're running at the same time. So as you can see, I'm very much into sort of async, into sort of working on R internals, decode, and also communication. So this is like HTTP, WebSockets, and hence my work with Shiny and now on Quarto and the collaborative editor. So today, I'm mostly going to be focusing on mirai and how to use that for parallel and async programming.

Yes, I would love to start with what the heck is async programming, and how is it different from parallel?

then we're free in our own session to do other tasks. And that is essentially what async is.

Why was mirai created?

So this is where I would love to hop in and ask Lauren's question, which is, what was the motivation or the need for developing mirai? What problem is this built to solve? I'm imagining as a person who in undergrad and grad school had to leave her computer running overnight to run projects, that this would mean if you have something that's a very, very long process that's going to take a very long time, you can do something else while you're waiting for that to finish. Am I right?

Yes. So there are many ways you can use it. One is as you described. And sort of my background, I say I'm not a data scientist sort of now. I did do sort of data science-y type things previously. I was actually sort of building these deep neural nets actually. So I was actually training these neural nets and then I was actually running inference on them. And so I was using mirai for both of those purposes, right. So first of all, if you have something as sort of ongoing for a long time, you can run them in parallel in background processes, and that each of them is self-contained. So if something sort of errors and goes wrong in one of them, all the others will still progress to completion, hopefully, because they don't interact with each other.

And the second part is really what led me to develop mirai, which is I was actually running inference with one of these neural nets, one of these models, right? And I was actually also ingesting real-time data. And so this was actually financial market data that I was ingesting through an API in real time and trying to run this through this model and then save the inference results in basically a kind of database, right? So because this all had to happen in real time, then I couldn't afford to just run things in a sort of a normal loop, because if something falls over, then that basically stops the whole thing. So what mirai can allow you to do is it can allow you to offload different tasks into different processes so that your main loop can stay reactive. So essentially, your main loop, you're not doing very much at all in it. You're just checking to see if other parts of that you need are all done. So whether that's ingesting data or that's writing data, those can be done in separate processes. And you can have basically redundancy and sort of fail over in your main loop. That is a very sort of high-level overview. But hopefully, that makes sense to everyone.

Async with promises

So one way we can sort of work with async is we have this thing that's running another process. And then in mirai, if you do these brackets, right, this is the collect method. So this is actually equivalent. This is basically collect mirai, right? This basically says it's already done. But this will actually wait for the mirai to complete and then return the result. This is basically parallel, this is not really async while you're waiting, but this is one way that you can sort of get the results. But this is where you can pair mirai with promises to actually have actions happen as soon as a result is complete without you having to wait for it.

So this works something like this, and this is basically how mirai with Shiny works. So Shiny, you know, everything is async, your Shiny server serves many different users, you can't have that server sort of like stop and wait for one particular user to finish.

So again, if I say something, if I say sleep, and then that basically, if I just return a value, and then how promises work is you can say, you know, once that's done, then do something else with the return value. And here, like an example I like using is just to use the beeper function, and it will like play a sound. So here's, you know, this mirai is going to run for two seconds, then it's going to return two, and then it's going to call beeper beep two, which is the second beep sound.

mirai map and daemons

And to give you a more sort of concrete example of how this sort of works, if I do a mirai map, so this is, this is basically the equivalent of elaply or per map. But this basically maps each individual element in a different background process, right? So if I map sort of, you know, this thing to a function, and then this just sleeps for the specified number. So from one to five seconds, and returns that number. And this mirai map has an argument, which is called dot promise, and basically works exactly the same as promises then. So we can pass a function to this.

So I hit an error. So this is basically telling me that I need to set daemons. I can just click through here and set daemons. So daemons, again, is another piece of jargon. Daemons are basically just workers. They are background processes, right? And the reason it asked me to set daemons for this map is for a map, here I'm mapping only five elements, but that map could be over like 100 or 1,000 elements, right? And if daemons weren't set, then you'd basically be launching 1,000 different sessions on your own machine, and that will probably crash your machine. So for a map, it will ask you to set the daemons beforehand. Because it's usually sort of an oversight or mistake, as it was in this case. So I've just gone ahead and set six daemons, which are basically background processes.

At any time in mirai, there's a function which is just info, and you can see, like, exactly the status of what's happened. So we have six connections here.

Does mirai inherit from the global environment?

Do the contents of mirai inherit definitions from the global environment?

No. And this is a conscious sort of design decision, and it helps you sort of avoid mistakes. And I know, like, that future, like, does it differently. And that package has been around for a while. So people sort of know that behavior. But the danger in trying to sort of automatically sort of infer what's sort of in your mirai can lead to sort of, well, in the best case, it can lead to errors. But like, the worst case, like, it's something wrong, and you get a plausible answer, and you don't even know you're wrong. So that arguably is sort of worse. mirai is sort of very explicit in that sense, and I'll show you exactly what I mean.

So if I, if I define something like slow func, right? So this, say, just function, and this just, just again, just sleeps. And then it returns something like done. Okay. So if I attempt to sort of run mirai slow func, right? You might think that would work. But what you would actually see is there's an error. And it just says error, cannot find function slow func. Because mirai runs your expression in another process. And this is a clean process. Every time you call mirai, that is like a clean invocation. There's nothing in that evaluation environment. So what you would do here is you can call it, but then you would pass in slow func to the mirai. So the expression is the same, but then you would define slow func inside the mirai is the slow func, which is this function that's living in my current environment. Okay. And then we get the result we expect.

Parallel vs. sequential: a timing demo

So like sequential is basically, so you're right. R is single threaded. So without a package like mirai, everything will run sequentially. And mirai is basically designed to overcome that. And why this is important? Well, I mean, R has been around for, you know, 30 years or so. I mean, back then, you were lucky if you had sort of, you know, two or four cores, that was sort of the sort of the norm. But, you know, modern laptop has at least eight cores. My MacBook, I have 14 cores. And all your computations in R by default only uses one single core.

So easiest way to sort of demonstrate that is if I define a function, again, DOM task, if I just define that as and again, I'm sort of using these generic functions because it's easier for me to control sort of how long they take rather than like an actual function that takes a long time. So this will just leave for two seconds and then return the number. Okay? So if I actually sort of time how long this takes, then if I apply this to say a five, a long task, right? Then, you know, five times two, you expect this to last ten seconds.

As you can see. And, you know, you're sort of this is just sleeping for two seconds. But, you know, you can imagine that that could be doing some kind of, you know, either complex sort of matrix multiplications. You could be doing sort of or you could be like hitting some kind of API and waiting for sort of data to come in, right? So that's what you what you would want is for those to run in parallel. So if you reset six daemons, which are six workers, then if we do the same thing and time sort of a mirai map, which is the parallel equivalent of lapply. Long task. And here we actually we actually wait for these to complete, because otherwise it will return immediately. And then we can see that because they're done in parallel, this whole thing only takes two seconds.

And then we can see that because they're done in parallel, this whole thing only takes two seconds.

All right. That made a lot more sense to my brain. We have each of these elements assigned to a different process and they can all happen at the same time. So I can say I can be like, all right, I have these, let's say, five modeling tasks, I think in statistical modeling. So I have these five modeling tasks and they're going to take a while because they are each going to iterate over a bunch of things themselves. And I can run all of them at the same time on different cores of my computer. And I don't have to keep my laptop awake overnight in order to run them all and get all of those and wait for all of them sequentially to run on the same core.

That's right. And, you know, these they don't have to be sort of, you know, even sort of simple functions. They can be really they can be full scripts and you can literally do mirai and you can say, you know, you can literally say source whatever, you know, whatever file you have and they can run your entire script, which, you know, may take hours, may take days even, right? And you can have them running in different in different processes. And the key about mirai is with mirai, you don't need to care, like, whether that's just running in another process on your own machine. So you're sort of utilizing all the cores that your laptop has. Or if they're running in processes on another machine. So if you have like a network server, right, if you have a workstation, where you can send jobs, or if you're, yeah, distributed, exactly. If you're at a university and you have access to like an HPC cluster, mirai allows you to do all of that.

Dynamic scaling with daemons

So mirai basically works as a, if you think about it as a kind of a hub model, so when you set up daemons, this basically you set up like something that listens for incoming connections, right? So what you can do is you can set up daemons with a URL. And you can, and we have this helper function called local URL. So if you do that, then what happens is that sets up what you can think of as a base station. This listens for incoming connections from your daemons, from daemons that are started, right? As long as they have access to this URL, then they can connect in and you have, then you have a worker that's connected.

So if I set this up, then this info's function is going to come in handy. You can see that there are no connections at the moment. And what you can do is you can launch local daemons with this function, launch local. Okay, so launch. And you can say something like, you know, launch for local processes. And that basically goes ahead and does that. So if we look again, we have four processes.

So I'm just going to copy this thing here. So this just runs a bunch of long tasks. And, okay, so there are two that's executing, five that's completed, and now all seven tasks have completed. But if you have a bunch of things and you sort of, you realize you have a large backlog, then you can basically simply just launch more daemons. So you can say, oh, one launch at another four daemons. And now we basically have eight workers online. So any tasks that have been queued will just automatically use the number of available workers.

Okay, wait, wait, pause for two seconds. We have produced daemons, we have created daemons automatically. Do we need to manually close them? Same with these processes that we're launching or these connections that we're launching. Do we need to explicitly close both daemons and connections?

Yes, so, I mean, daemons are connections and they're the same thing. These, it's always good practice to close them. And the way to do that is to set daemons to zero and that will close them straight away. But if you forget to close them and you just end your R session, they will all disappear as well. There's no danger of sort of leaving hanging processes at all, they're all, as soon as the connection drops, they're all designed to terminate themselves.

Yes, but, you know, if you have a script or if you have a markdown document, it's always good to pair a daemons with a daemon zero at the end.

Running daemons on remote machines via SSH

So apart from sort of the scaling and sort of ability to add and subtract daemons at any time is you can run them anywhere. So if you just do daemons and the number, this will just launch processes on your own machine. But those processes can also be another machine and there are different ways you can launch them. One is over SSH. One is via cluster manager. That's if you have access to a cluster. The other is via Posit Workbench. So if you're lucky enough to be in an enterprise sort of environment that actually has Workbench, you can easily launch workers as Workbench jobs.

But I won't cover that. I will try and cover SSH as an example. So the way you do that is you call daemons and you create a URL, which is using this helper, which is host URL. Okay, so this creates a network socket, which is available to other computers on your network. So other computers will be able to connect to this address, this URL, right? And if I do that again, like nothing's connected yet. And what I can do is I can create a remote configuration. And in this case, this will be an SSH configuration. And again, we have these helpers that really sort of like minimize what needs to be done. So here I create an SSH configuration and the only, sorry, there are arguments, but the only required argument is this remotes argument. And that is simply the URL of the computer that you have SSH access to.

So to give you an example, I switched to my terminal here. If I have, sorry, this is just, so I actually, I'm actually connected to a VPN. So this is, I have access. So if I can have SSH, right, over the, so slash P, this is the port that you connect to. And this is the address of the machine that you can connect to, sorry, here. And if I have access to this machine of SSH, and so this is like, I'm actually running on this machine. So I have a MacBook, but this is connected to a Linux machine now.

But essentially that URL there, which was 192.168.0.101. That's the only information I need to set up this configuration in mirai. Okay, so once I have that, then I can actually launch, use launch remotes and not launch local, launch remotes. So again, I can launch maybe two daemons and I can pass that configuration in, okay?

Ah, so we have a time. So again, this is, this is useful because again, as I was explaining how mirai works is you spin up this basically like this base station, right? Which listens for incoming connections. That means the computer that you have SSH access to has to be able to dial into your, to the machine that you're running mirai on. And in this case, that's not possible because I don't have these ports that are opened on my local laptop.

Fortunately, mirai allows you to connect using SSH tunneling. So SSH tunneling, again, like people tend to get confused. So I will try and explain it in as simple terms as possible. SSH tunneling is basically when you connect to a computer, using SSH, that creates a tunnel immediately. So imagine there's a tunnel between the two sides. Now, instead of a connection being made between the two sides, each side connects to a local port. So if I'm as mirai, sort of, if I set daemons, I'm listening to a local port and then on the daemon side, that computer's dialing to a local port. And then the tunnel basically bridges the two sides. So there's no longer any connection from that computer to my computer.

And like very concretely, if you follow this, you will get it. So if I just shut down the previous port, I'm going to get the previous daemons. And in this case, what I'm going to do is I'm going to create a local URL, right? So this is not the URL that's open to other computers on the network. This is a local URL. I'm going to set TCP equals true because this is a TCP connection. If I use status, this just gives me a little bit more information than info, but you can see we're actually listening to this 127, this address, which is the local host address. So I'm now listening to a port on my own computer. And this is the port that the tunnel is going to be opened on.

I can then create this SSH configuration. And again, I can connect to this address is 192.168.0.101. And I actually need to use this port because that's how I've configured. So I'm just going to connect to this port. And I just need to specify this argument, which is tunnel equals true, okay? And then if I try and launch now using this configuration, this should now hopefully work. So yay, now I have two workers and they are on another machine that I have that's in London. It's actually a very old sort of converted Mac mini, actually, but it's running Linux now. And to prove to you that it's running Linux, I'm going to run like mirai, and I'm just going to ask for the sys.info, okay? From the worker, and you can see that, hey, this is running Linux. And just to prove I'm not running Linux, this is my machine. It's definitely some form of Mac.

Okay, that was a whirlwind, but if you got that, you'll understand how sort of SSH timing. But as you can see, all you had to do was create this configuration, which is just this URL, which is if you can access that machine over SSH, you have all the information you need to run mirai over SSH.

Q&A

Okay, Charlie, this was so much information all at one time. We have so many questions that we will not be able to get to all of them, but I want in our last five, four or five minutes here to see if we can get to some of them. So one of them from Brent was, I believe NumPy distributes calculations over multiple cores. So does mirai allow R to perform the same thing as well?

Yes, it does. And the NumPy is very specific. So it will run only numerical work that NumPy handles. mirai allows you to run arbitrary R expressions. So anything you can do in R, you can put in mirai. That's the difference.

Okay, Dan had also asked, does mirai ship with some kind of task viewing dashboard like Python's dash visualization so that you don't have to type info over and over?

Not built in.

We had another question that was, can we use parallel detect cores to see how many daemons we can use? I really like this question because I don't know how many daemons I would have access to.

And like mirai by design forces you to actually specify the number you want. So that can be, you know, that can be 14, which is sort of the number of cores I have on my laptop. So you can use something like parallel, sorry, parallel detect cores or whatever. But, you know, you can easily be on a server with, you know, 256 cores or something. And like, sometimes you don't want to use all your cores because you might be running something else, right? So like, it never has that context, you running a automatic function. The other reason is this, like you might want to be using your cores if what you're doing is compute intensive, right? You want to sort of utilize all your cores. Fine. But instead, if what you're doing is actually waiting on IO, so this is like, if you're requesting like a download, right? That is actually not taking all your computation power. So for those kinds of tasks, you might actually want to use more than the number of cores you have. So you might want to use like 28, like if all you're doing is like querying some remote API. So that's also another reason why like, I don't sort of try and assume like the number that people want.

So with our last two minutes, Zach said he had to leave, but he had a great question. He said, in typical data science work, would parallel processing be more useful than async processing, do you think?

Yes. And mirai actually powers a lot of things sort of under the hood, right? So if you're using parallel per, that uses mirai, so you set daemons, but you use the normal per syntax, and that is just parallel. It's not async. Where async really sort of comes into its own is if you're using mirai with Shiny, or you're using mirai with something like plumber2 . So again, mirai powers plumber2 under the hood. So if you use an async function in plumber2, that implicitly uses mirai.

Okay, I bet we can get in one more question. So Rob had asked, can you print progress slash percentages that only get shown on call? Like while things are running?

Yes, right. So very quickly, if you do like a mirai map, one of the collection options, so instead of just collecting, is you can do like progress, and that gives you a progress indicator.

I'm going through and just looking at all these questions. I think that I will get with Charlie since we are out of time, and we'll see if we can answer some of these. There is one from Notobeko, there's one from David Diaz, and they look like they could be answered pretty quickly. So we'll answer them in the chat after the session. Charlie will help.

The last thing I want to mention, oh, no, I'm out of time. You are out of time. There is a mirai skill. So if you folks out have access to Cloud Code or OpenCode or some great AI agent, install the RLIB skill mirai, and you can just invoke that. You can transform all your sort of normal scripts into parallel and async code.

All right. Thank you so much, Charlie. That was amazing. You're getting thank yous in the chat. Thank you for being patient with us while we all try to muddle through what async is, what parallel is, what mirai does, what future and promises are. I think we all learned a lot and have a lot to like go Google and think about. I wish that I knew about this in undergrad and grad school. I definitely didn't, and it would have helped my quality of life with coding so much. So thank you so much. We will get with Charlie to answer some of the more simple questions afterwards in the chat. We will see you all on Thursday at the Data Science Hangout. If you would like, thank you, everybody. Have a fantastic week. I'll see you when you see you.

Thanks, Pauline. Thanks, everybody. Bye. Thank you, Charlie.

Async & Parallel R with {mirai} | Charlie Gao | Data Science Lab

Transcript#

What is async vs. parallel?

Introducing mirai

Live coding: first mirai example

Why was mirai created?

Async with promises

mirai map and daemons

Does mirai inherit from the global environment?

Parallel vs. sequential: a timing demo

Dynamic scaling with daemons

Running daemons on remote machines via SSH

Q&A

Featured software#

mirai

mori

plumber

plumber2

Positron