I wrote an article in July 2020, Ruby 3 Fiber changes preview (in Chinese), and followed up by another post in August A Walkthrough of Ruby 3 Scheduler. Ruby 3 has updated lots of versions during these months, including ruby-3.0.0-preview1ruby-3.0.0-preview2 and ruby-3.0.0-rc1, which makes lots of improvements to the Fiber Scheduler API.
But as I mentioned before, what Ruby 3 implements is the interface. It would not use the scheduler, unless a scheduler implementation is included.
I am very busy working and studying in the past four months, and I took some time in the recent days to get updated with the API.
Suppose we have a pair of fds generated by IO.pipe. When we write Hello World to one of them, we could read it from the other side of the pipe. We would have code like this:
1 2 3 4 5 6 7 8
rd, wr = IO.pipe
wr.write("Hello World") wr.close
message = rd.read(20) puts message rd.close
This program has lots of limitations. For example, you can’t write a string longer than the buffer size. Since the other side is not reading at the same time, it would get stuck if the string is too long. You would also have to write first, otherwise it would also get stuck. Of course, we could use multi-threading to solve this problem.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
require'thread'
rd, wr = IO.pipe
t1 = Thread.new do message = rd.read(20) puts message rd.close end
t2 = Thread.new do wr.write("Hello World") wr.close end
t1.join t2.join
But as we all know, using threads to solve I/O problems is very inefficient. The OS context switch is slow. The fairness of thread scheduling is still a very hard problem in the field of OS. For an I/O problem, which is not CPU-bound, all we need is to halt it and wait for the proper callback. In this case, all you need is to call Ruby 3 scheduler.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
require'evt'
rd, wr = IO.pipe scheduler = Evt::Scheduler.new
Fiber.set_scheduler scheduler
Fiber.schedule do message = rd.read(20) puts message rd.close end
Fiber.schedule do wr.write("Hello World") wr.close end
scheduler.run
In general, an async function requires keywords like callback, async, or await. But this is not necessary in Ruby 3. Ruby 3 lists all common situations where you need async functions: I/O multiplexing, process halting, kernel sleep, and mutex. Ruby 3 exposes all of these interfaces for scheduler to improve the performance without adding any new keywords. My project evt is such a scheduler to meet the needs of Ruby 3 Scheduler.
Comparing to the simple example above, here is an example of HTTP/1.1 server
defhandle_socket(socket) until socket.closed? line = socket.gets until line == "\r\n"|| line.nil? line = socket.gets end socket.write("HTTP/1.1 200 OK\r\nContent-Length: 0\r\n\r\n") end end
Fiber.schedule do loop do socket, addr = @server.accept Fiber.schedule do handle_socket(socket) end end end
@scheduler.run
We could see from this that, the code is almost the same with synchronous development. All you need to do is to setup the scheduler with Fiber.set_scheduler, and add Fiber.scheduler where you usually have to solve with multithreading. Finally, use scheduler.run to start the scheduler.
Backend support
io_uring Support
Not only the Ruby API gets lots of updates in the recent months, but also my scheduler. Especially for a better I/O multiplexing backend support. io_uring is included since Linux 5.4. Since the io_uring could reduce the syscalls and could have direct iov calls to acheive better performance comparing to epoll, the support of io_uring is important. Direct iov support requires Ruby Fiber scheduler for some further changes. These changes are introduced by ioquatix since Ruby 3.0.0-preview2. What we need to implement is two parts. One of them is epoll compatible API:
VALUE method_scheduler_init(VALUE self){ int ret; structio_uring* ring; ring = xmalloc(sizeof(struct io_uring)); ret = io_uring_queue_init(URING_ENTRIES, ring, 0); if (ret < 0) { rb_raise(rb_eIOError, "unable to initalize io_uring"); } rb_iv_set(self, "@ring", TypedData_Wrap_Struct(Payload, &type_uring_payload, ring)); return Qnil; }
VALUE method_scheduler_register(VALUE self, VALUE io, VALUE interest){ VALUE ring_obj; structio_uring* ring; structio_uring_sqe *sqe; structuring_data *data; short poll_mask = 0; ID id_fileno = rb_intern("fileno");
int ruby_interest = NUM2INT(interest); int readable = NUM2INT(rb_const_get(rb_cIO, rb_intern("READABLE"))); int writable = NUM2INT(rb_const_get(rb_cIO, rb_intern("WRITABLE")));
if (ruby_interest & readable) { poll_mask |= POLL_IN; }
if (ruby_interest & writable) { poll_mask |= POLL_OUT; }
VALUE method_scheduler_register(VALUE self, VALUE io, VALUE interest){ HANDLE iocp; VALUE iocp_obj = rb_iv_get(self, "@iocp"); structiocp_data* data; TypedData_Get_Struct(iocp_obj, HANDLE, &type_iocp_payload, iocp); int fd = NUM2INT(rb_funcallv(io, rb_intern("fileno"), 0, 0)); HANDLE io_handler = (HANDLE)rb_w32_get_osfhandle(fd); int ruby_interest = NUM2INT(interest); int readable = NUM2INT(rb_const_get(rb_cIO, rb_intern("READABLE"))); int writable = NUM2INT(rb_const_get(rb_cIO, rb_intern("WRITABLE"))); data = (struct iocp_data*) xmalloc(sizeof(struct iocp_data)); data->io = io; data->is_poll = true; data->interest = 0;
if (ruby_interest & readable) { interest |= readable; }
if (ruby_interest & writable) { interest |= writable; }
HANDLE res = CreateIoCompletionPort(io_handler, iocp, (ULONG_PTR) data, 0); printf("IO at address: 0x%08x\n", (void *)data);
return Qnil; }
VALUE method_scheduler_wait(VALUE self){ ID id_next_timeout = rb_intern("next_timeout"); ID id_push = rb_intern("push"); VALUE iocp_obj = rb_iv_get(self, "@iocp"); VALUE next_timeout = rb_funcall(self, id_next_timeout, 0); int readable = NUM2INT(rb_const_get(rb_cIO, rb_intern("READABLE"))); int writable = NUM2INT(rb_const_get(rb_cIO, rb_intern("WRITABLE")));
// for (ULONG i = 0; i < ulNumEntriesRemoved; i++) { // OVERLAPPED_ENTRY entry = lpCompletionPortEntries[i]; // struct iocp_data *data = (struct iocp_data*) entry.lpCompletionKey;
// int interest = data->interest; // VALUE obj_io = data->io; // if (interest & readable) { // rb_funcall(readables, id_push, 1, obj_io); // } else if (interest & writable) { // rb_funcall(writables, id_push, 1, obj_io); // }
// xfree(data); // }
return result; }
But the I/O scheduler receives the wrong pointers when callback. After some researches, to support IOCP, you have to initialize the I/O with FILE_FLAG_OVERLAPPED flag. This may need some changes in Ruby win32/win32.c to support IOCP. But at least I solved the problems of the IO.select fallback. The problem is fine, since nobody cares about Windows production performance…
kqueue Improvements
Another Improvement is to macOS kqueue. kqueue on FreeBSD is good. Bug the performance on macOS is really weird. Since all of our I/O registration is in one-shot, I used one-shot mode of kqueue to reduce the number of syscalls.
VALUE method_scheduler_register(VALUE self, VALUE io, VALUE interest){ structkeventevent; u_short event_flags = 0; ID id_fileno = rb_intern("fileno"); int kq = NUM2INT(rb_iv_get(self, "@kq")); int fd = NUM2INT(rb_funcall(io, id_fileno, 0)); int ruby_interest = NUM2INT(interest); int readable = NUM2INT(rb_const_get(rb_cIO, rb_intern("READABLE"))); int writable = NUM2INT(rb_const_get(rb_cIO, rb_intern("WRITABLE"))); if (ruby_interest & readable) { event_flags |= EVFILT_READ; }
if (ruby_interest & writable) { event_flags |= EVFILT_WRITE; }
At last, we support almost all I/O multiplexing backends of mostly used OS:
Linux
Windows
macOS
FreeBSD
io_uring
✅ (See 1)
❌
❌
❌
epoll
✅ (See 2)
❌
❌
❌
kqueue
❌
❌
✅ (⚠️See 5)
✅
IOCP
❌
❌ (⚠️See 3)
❌
❌
Ruby (IO.select)
✅ Fallback
✅ (⚠️See 4)
✅ Fallback
✅ Fallback
when liburing is installed
when kernel version >= 2.6.8
WOULD NOT WORK until FILE_FLAG_OVERLAPPED is included in I/O initialization process.
Some I/Os are not able to be nonblock under Windows. See Scheduler Docs.
kqueue performance in Darwin is very poor. MAY BE DISABLED IN THE FUTURE.
Benchmark
How is the overall performance?
The benchmark is running under v0.2.2 version and Ruby 3.0.0-rc1. See evt-server-benchmark for test code, the test is running under a single-thread server.
The test command is wrk -t4 -c8192 -d30s http://localhost:3001.
All of the systems have set their file descriptor limit to maximum.
OS
CPU
Memory
Backend
req/s
Linux
Ryzen 2700x
64GB
epoll
54680.08
Linux
Ryzen 2700x
64GB
io_uring
50245.53
Linux
Ryzen 2700x
64GB
IO.select (using poll)
44159.23
macOS
i7-6820HQ
16GB
kqueue
37855.53
macOS
i7-6820HQ
16GB
IO.select (using poll)
28293.36
Very impressive. The results improvements are from lots of aspects. Current async frameworks like Falcon uses nio4r. The backend of nio4r is libev. The performance of libev is average due to the extreme compatibility design. Existing async frameworks also requires lots of meta-programming. But this extension is almost written in C, with only the features the scheduler need.
Comparing to my previous tests on preview 1, this version uses long connection, and Ruby nonblock I/O also has fixed a lot. The wrk results are very error-sensitive. All of these things makes our performance 10 times faster comparing to what we have done 3 months ago.
wrk is very error-sensitive, the parser in the benchmark is incorrect, which could not close the socket properly. I updated my Midori to a Ruby 3 Scheduler project, the performance could reach 247k req/s with kqueue and 647k req/s with epoll, which is more than 100x times faster comparing to blocking I/O.
Combining with Ractor
I also wrote a post on November about Ractor Ruby 3 Ractor Dev Guide (in Chinese) Combining Fiber with Ractor is always a interesting thing. We have two routes for that:
Receive accpets in the main Ractor, and dispatch the request to sub-Ractors. After transferring the results back, return it from the main Ractor with scheduler.
Use Linux SO_REUSEPORT feature to let all Ractor listen to the port at the same time, which is very easy to deal with with exisiting server archs.
Unfortunately, either of these are functioning correctly now. Some Fiber features are not available in Ractor. I suppose this is a bug, and have submitted a patch GitHub #3971. According to my previous benchmarks, Ractor my increase about 4 times the performance by multi-core.
But since API servers are usually stateless, these improvements could be acheived by multi-processes. Ractor’s majot contribution may be fewer memory consumption.
I would test it with Ruby 3.0 future updates.
Conclusion
We acheived a 10 times performance improvement comparing to preview 1, and almost 36 times faster comparing to blocking I/O. The major performance issue of Ruby servers are I/O blocking instead of VM performance. With the I/O scheduler is included, we could improve the I/O performance of Ruby 3 into a new era. The future work is to wait for the updates of some C extension libraries like database connections. Then if we use an async scheduler with a Fiber based Web server like Falcon, you don’t have to do anything about your business code to get dozens of times of performance improvements.