Facebook: www.facebook.com/100009201465316/videos/2530411593942198
Bilibili 搬运:https://www.bilibili.com/video/av845257746/
不过马来语的饮料名称非常有意思。某个目前居住在新加坡的朋友 @david92 给我解释了一下,如何在店里点茶喝。基本是一个组合式的语法,非常规律。最基础的茶底是红茶(Teh)或者咖啡(Kopi)。默认饮料是带糖和炼乳的,但你可以重新定制。如果你在后面加上 O 表示不要加炼乳,而 C 表示把炼乳换成鲜奶。类似,Kosong 是无糖,Siu Dai 是少糖,而 Gah Dai 是加更多的糖。
我查阅了一些资料进一步完善了一下这个概念,发现这个语法完全是「可编译」的,非常简单。很快,我写了一个 bnf 语法来描述这个概念:
<water> ::= "Kopi" | "Teh" | "Milo"
<sugar> ::= "Kosong" | "Siu Dai" | "Gah Dai"
<milk> ::= "O" | "C"
<thickness> ::= "Po" | "Gau"
<extra> ::= "Peng" | "Bubble" | "Halia"
<upsize> ::= "Nga Lat"
<takeout> ::= "Bungkus"
<plastic> ::= "Ikat"
<knot> ::= "Mati" | "Tepi"
<drink> ::=
(<takeout> (" " <plastic> (" " <knot>)?)? " ")?
(<upsize> " ")?
<water> (" " <milk>)? (" " <sugar>)?
(" " <thickness>)?
(" " <extra>)*
其中,Po 是清淡,Gau 是浓缩,Peng 是加冰块,Bubble 是加珍珠,Halia 是加姜汁。Nga Lat 是大杯。外带的概念比校复杂,Bungkus 是外带,通常是杯状的。像是 Bernard Tee 视频里那种塑料袋装的,叫 Ikat。Ikat 有两种不同的打结方式,一种是打死结 Mati,还有一种是侧面打结,开口的叫 Tepi。
比如 Bungkus Ikat Mati Nga Lat Kopi O Siu Dai Gau Peng Bubble 就是外带塑料袋装打死结大杯少糖浓缩咖啡加冰块和珍珠。
我们可以在 ebnf playground 测试这个语法。这个网站甚至能生成随机的符合某个语法(比如这里 Drink)的字符串,来让我们人工检查这个语法对不对。于是截止这里我们还可以实现一个自动生成饮料名的生成器:https://www.bilibili.com/video/BV1SK4y197hc
Ruby 中有一个 gem 叫 ebnf 可以读取 ebnf 文件然后生成对应的 parser,然后我们写一个输出中文的 generator 即可将马来西亚语翻译成中文。
由于 ebnf 的语法并没有规范,Ruby ebnf 库和我们刚刚 playground 中的语法有细微不同,这里做了一些变更。
require "ebnf"
TEA_GRAMMER = <<-EOF
Water ::= "Kopi" | "Teh" | "Milo"
Sugar ::= "Kosong" | "Siu Dai" | "Gah Dai"
Milk ::= "O" | "C"
Thickness ::= "Po" | "Gau"
Extra ::= "Peng" | "Bubble" | "Halia"
Upsize ::= "Nga Lat"
Knot ::= "Mati" | "Tepi"
Plastic ::= "Ikat" Knot?
Takeout ::= "Bungkus" Plastic?
Drink ::= Takeout? Upsize? Water Milk? Sugar? Thickness? Extra*
EOF
由于这个库执行 parse 需要先转换成解析表达文法(Parsing Expression Grammar)从而生成更多的子规则,我们先打印一下自动生成的子规则便于之后开发 generator。
EBNF.parse(TEA_GRAMMER).make_peg.ast
=begin
(rule Water (alt "Kopi" "Teh" "Milo"))
(rule Sugar (alt "Kosong" "Siu Dai" "Gah Dai"))
(rule Milk (alt "O" "C"))
(rule Thickness (alt "Po" "Gau"))
(rule Extra (alt "Peng" "Bubble" "Halia"))
(rule Upsize (seq "Nga Lat"))
(rule Knot (alt "Mati" "Tepi"))
(rule Plastic (seq "Ikat" _Plastic_1))
(rule _Plastic_1 (opt Knot))
(rule Takeout (seq "Bungkus" _Takeout_1))
(rule _Takeout_1 (opt Plastic))
(rule Drink (seq _Drink_1 _Drink_2 Water _Drink_3 _Drink_4 _Drink_5 _Drink_6))
(rule _Drink_1 (opt Takeout))
(rule _Drink_2 (opt Upsize))
(rule _Drink_3 (opt Milk))
(rule _Drink_4 (opt Sugar))
(rule _Drink_5 (opt Thickness))
(rule _Drink_6 (star Extra))
=end
我们创建一个 generator 类,如下:
class MalayTea
include EBNF::PEG::Parser
attr_reader :rules
def initialize
@rules = EBNF.parse(TEA_GRAMMER).make_peg.ast
end
def evaluate(input)
parse(input, :Drink, @rules)
end
end
p MalayTea.new.evaluate(gets.chomp)
这时候输入一个句子,其会如实输出 AST 抽象语法树。而我们要做的就是根据 rule 名称来规约这些语法树直到某个特定输出。
production(:_Drink_1, clear_packrat: true) do |value|
if value.nil?
""
elsif value.length == 1
"外带"
else
"外带#{value[-1].values.join}"
end
end
production(:Plastic, clear_packrat: true) do |value|
if value.nil?
""
elsif value.length == 1
"塑料袋装"
else
"塑料袋装#{value[-1].values.join}"
end
end
production(:_Plastic_1, clear_packrat: true) do |value|
value.nil? ? "" : { Mati: "打死结", Tepi: "侧面打结"}[value.to_sym]
end
production(:_Drink_2, clear_packrat: true) do |value|
value.nil? ? "" : "大杯"
end
production(:Water, clear_packrat: true) do |value|
{ Kopi: "咖啡", Teh: "红茶", Milo: "美禄" }[value.to_sym]
end
production(:_Drink_3, clear_packrat: true) do |value|
value.nil? ? "炼乳" : { O: "", C: "鲜奶" }[value.to_sym]
end
production(:_Drink_4, clear_packrat: true) do |value|
value.nil? ? "" : { Kosong: "无糖", "Siu Dai": "少糖", "Gah Dai": "加糖"}[value.to_sym]
end
production(:_Drink_5, clear_packrat: true) do |value|
value.nil? ? "" : { Gau: "浓缩", Po: "清淡"}[value.to_sym]
end
production(:_Drink_6, clear_packrat: true) do |value|
extras = value.map do |a|
{ Peng: "冰块", Bubble: "珍珠", Halia: "姜汁" }[a.to_sym]
end
extras.empty? ? "" : "加#{extras.join}"
end
我们规约最后一步的时候需要特殊处理鲜奶红茶和鲜奶咖啡这两个词语,因为一般我们直接叫奶茶和咖啡拿铁。
production(:Drink, clear_packrat: true) do |value|
h = value.inject(&:merge)
if h[:Water] == "红茶" and h[:_Drink_3] == "鲜奶"
h[:Water] = "奶茶"
h[:_Drink_3] = ""
elsif h[:Water] == "咖啡" and h[:_Drink_3] == "鲜奶"
h[:Water] = "咖啡拿铁"
h[:_Drink_3] = ""
end
"#{h[:_Drink_1]}#{h[:_Drink_2]}#{h[:_Drink_4]}#{h[:_Drink_3]}#{h[:_Drink_5]}#{h[:Water]}#{h[:_Drink_6]}"
end
最后我们得到了一个将马来西亚语的饮料名编译到汉语的编译(翻译)器。
完整代码如下:
require "ebnf"
TEA_GRAMMER = <<-EOF
Water ::= "Kopi" | "Teh" | "Milo"
Sugar ::= "Kosong" | "Siu Dai" | "Gah Dai"
Milk ::= "O" | "C"
Thickness ::= "Po" | "Gau"
Extra ::= "Peng" | "Bubble" | "Halia"
Upsize ::= "Nga Lat"
Knot ::= "Mati" | "Tepi"
Plastic ::= "Ikat" Knot?
Takeout ::= "Bungkus" Plastic?
Drink ::= Takeout? Upsize? Water Milk? Sugar? Thickness? Extra*
EOF
class MalayTea
include EBNF::PEG::Parser
attr_reader :rules
production(:_Drink_1, clear_packrat: true) do |value|
if value.nil?
""
elsif value.length == 1
"外带"
else
"外带#{value[-1].values.join}"
end
end
production(:Plastic, clear_packrat: true) do |value|
if value.nil?
""
elsif value.length == 1
"塑料袋装"
else
"塑料袋装#{value[-1].values.join}"
end
end
production(:_Plastic_1, clear_packrat: true) do |value|
value.nil? ? "" : { Mati: "打死结", Tepi: "侧面打结"}[value.to_sym]
end
production(:_Drink_2, clear_packrat: true) do |value|
value.nil? ? "" : "大杯"
end
production(:Water, clear_packrat: true) do |value|
{ Kopi: "咖啡", Teh: "红茶", Milo: "美禄" }[value.to_sym]
end
production(:_Drink_3, clear_packrat: true) do |value|
value.nil? ? "炼乳" : { O: "", C: "鲜奶" }[value.to_sym]
end
production(:_Drink_4, clear_packrat: true) do |value|
value.nil? ? "" : { Kosong: "无糖", "Siu Dai": "少糖", "Gah Dai": "加糖"}[value.to_sym]
end
production(:_Drink_5, clear_packrat: true) do |value|
value.nil? ? "" : { Gau: "浓缩", Po: "清淡"}[value.to_sym]
end
production(:_Drink_6, clear_packrat: true) do |value|
extras = value.map do |a|
{ Peng: "冰块", Bubble: "珍珠", Halia: "姜汁" }[a.to_sym]
end
extras.empty? ? "" : "加#{extras.join}"
end
production(:Drink, clear_packrat: true) do |value|
h = value.inject(&:merge)
if h[:Water] == "红茶" and h[:_Drink_3] == "鲜奶"
h[:Water] = "奶茶"
h[:_Drink_3] = ""
elsif h[:Water] == "咖啡" and h[:_Drink_3] == "鲜奶"
h[:Water] = "咖啡拿铁"
h[:_Drink_3] = ""
end
"#{h[:_Drink_1]}#{h[:_Drink_2]}#{h[:_Drink_4]}#{h[:_Drink_3]}#{h[:_Drink_5]}#{h[:Water]}#{h[:_Drink_6]}"
end
def initialize
@rules = EBNF.parse(TEA_GRAMMER).make_peg.ast
end
def evaluate(input)
parse(input, :Drink, @rules)
end
end
loop do
puts "输入饮料名:"
print "> "
puts "< #{MalayTea.new.evaluate(gets.chomp)}"
end
我们来运行一下看看:
❯ ruby main.rb
输入饮料名:
> Bungkus Ikat Mati Nga Lat Kopi O Siu Dai Gau Peng Bubble
< 外带塑料袋装打死结大杯少糖浓缩咖啡加冰块珍珠
欢迎大家带着这个程序去马来西亚/新加坡饮茶。
]]>krpc
是一个砍巴拉太空计划(Kerbel Space Program)中的插件。可以通过 RPC 来控制游戏。同时有第三方的 Ruby 客户端:krpc-rb。和真实飞机不同的是,在砍巴拉游戏中驾驶飞机是非常痛苦的,在没有插件辅助的情况下,你看不到具体的 GPS 坐标,很多时候看到的都是和飞机驾驶无关的轨道参数,而一些关键的控制参数缺很难获取。同时游戏中也没有自动驾驶仪。特别在航天飞机降落的控制非常难,虽然手动降落不会太有问题,游戏对于重着陆的容忍度很高,但是要控制飞机飞往机场的过程漫长而痛苦。于是我们试试看利用 krpc
来实现正常商用飞机都有的自动驾驶仪的功能。
PID 是自动化控制中最基础也是最常用的控制算法。
我们假设我们要控制汽车油门使得汽车的速度达到某个我们预想中的速度。最直接的想法就是基于距离目标速度的大小来调整油门。也就是说越接近目标速度,我们油门踩得越轻。
但这会产生一个问题,由于我们控制的是给油,油门到加速度控制存在一个延迟,使得我们放开油门后的几毫秒内可能速度还会上升;而当我们看到速度超过放开油门的量越来越大,随着车速越来越快,我们受到的空气阻力实际在增加,车速又会很快下降,而无法与加速度达到平衡,这种情况下我们就会在目标速度附近来回震荡。根据我们按比例控制的激进程度,振幅可能有所变化,最坏情况下我们会震动幅度越来越大,使得系统完全失控。
解决这个问题最直接的方法是引入一个积分项,不单单根据目前速度的误差,也要根据当前加速度的积分,也就是速度来判断。速度越大可能我们需要的油门也要更大一些。
最后我们实际控制还会遇到扰动的问题,我们可能还要根据过去一段时间内误差变化幅度来调节油门大小,比如遇到一个晃动速度快速下降,我们就要快速补一下油门来弥补这个误差。这意味着我们还要引入一个微分项。把这三个结合起来,我们可以得到公式: \(u(t) = K_pe(t) + K_i\int_0^te(\tau)d\tau+K_d\frac{d}{dt}e(t)\) 形成一个通用的 P(比例)I(积分)D(微分)控制器。不过虽然说是通用,这每一项前面的系数比例要想调好也是不容易的。我们会用这个控制器来分别控制飞机的节流阀、滚转和俯仰。
用 Ruby 实现出来是这样的:
class PIDController
def initialize(kp, ki, kd, clip_min=0.0, clip_max=1.0)
@prev_err = 0.0
@integral = 0.0
@kp = kp
@ki = ki
@kd = kd
@clip_min = clip_min
@clip_max = clip_max
@last_frame = 0.0
end
def trigger(goal, measured)
trigger_err(goal - measured)
end
def trigger_err(err)
current_frame = Time.now.to_f
dt = current_frame - @last_frame
if dt > 1.0
@last_frame = current_frame
return 0.0
end
@integral = @integral + err * dt
@integral = @clip_min if @integral < @clip_min
@integral = @clip_max if @integral > @clip_max
d = (err - @prev_err) / dt
res = @kp * err + @ki * @integral + @kd * d
@prev_err = err
@last_frame = current_frame
return @clip_min if res < @clip_min
return @clip_max if res > @clip_max
res
end
end
实际实现和公式有一些细微差异。积分项不能比可以控制的最大值更大,也不能比最小值更小。否则当误差和控制不在一个数量级的时候,往往会出现很难控制或者反应迟钝的问题。
起飞控制比较简单,在打开 SAS 保持稳定的情况下,我们只需要控制系统加速到抬轮速度,然后将飞机抬头到俯仰 10 度,收起起落架。最后爬升到给定的高度。
class TakeoffProcess
def initialize(vessel, vr, velocity, height)
@vessel = vessel
@control = vessel.control
@vr = vr
@velocity = velocity
@height = height
@throttle_controller = PIDController.new(0.1, 0.01, 0.2)
@pitch_controller = PIDController.new(0.05, 0.05, 0.1, -1.0, 1.0)
end
def run
@control.brakes = false
@control.sas = true
loop do
orbit = @vessel.flight(@vessel.orbit.body.reference_frame)
surface = @vessel.flight(@vessel.surface_reference_frame)
break unless orbit.speed < @vr
throttle = @throttle_controller.trigger(@velocity, orbit.speed)
@control.throttle = throttle
end
# Rotate
puts "Rotate, Gear Up!"
@control.gear = false
loop do
orbit = @vessel.flight(@vessel.orbit.body.reference_frame)
surface = @vessel.flight(@vessel.surface_reference_frame)
break unless orbit.mean_altitude < @height - 100
throttle = @throttle_controller.trigger(@velocity, orbit.speed)
@control.throttle = throttle
pitch = @pitch_controller.trigger(10, surface.pitch)
@control.pitch = pitch
end
puts "Takeoff Process Finished."
end
end
比较麻烦的地方是 ksp 有一个 reference_frame 的概念,很多参数需要的是相对于 reference_frame 参数的概念。比如当你需要速度的时候,有可能获得轨道速度,也可能得到的是地面速度。再比如俯仰,如果是相对于轨道的俯仰,那么理论上始终应该是在 0 附近,因为轨道会随着俯仰变化而变化。所以计算的时候要小心。
起飞后一般飞机自动驾驶仪一个最重要的功能就是根据预先在飞行电脑上设定好的飞行路线来飞行了。路点一般有几个关键参数:高度、速度、经纬度。高度和速度的写法和我们起飞的时候差不多。但转向就比较复杂。
我们首先需要确定的是,飞机要转几度才能转到目标点,也就是计算方位角。我们先来考虑,如果这个地球是一个平面,经纬度是 x 和 y 轴上的坐标,方向角应当怎么算呢?其实我们可以很容易得到方向向量: \((x, y) = (x_b-x_a, y_b-y_a)\) 那么方向角(自正北作为 0 度的顺时针角度)就是 \(tan(\theta)=\frac{x_b-x_a}{y_b-y_a}\) \(\theta = atan(\frac{x_b-x_a}{y_b-y_a})\) 然而在球面上计算这个问题要复杂一些,本质上我们需要知道大圆上任意两点的距离,我们需要用到球面三角学中重要的半正矢公式(Harversine Formula)。
根据半正矢定理,我们有: \(hav(c) = hav(a-b)+sin(a)sin(b)hav(C)\) 其中: \(hav(\theta)=sin^2\frac{\theta}{2}=\frac{1-cos\theta}{2}\)
进一步我们可以得到:
\(A = (lat_a, lng_a)\)
\(B = (lat_b, lng_b)\)
\(N = (\frac{\pi}{2}, 0)\)
\(hav(NB) = hav(AB-AN)+sin(AB)sin(AN)hav(\angle NAB)\)
再往下算基本就吐了,这公式长到打在 Wolfram Alpha 上直接不识别。只好上网找了个算好的结果:
\(tan(\theta)=\frac{|lng_b-lng_a|}{ln(\frac{tan(\frac{lat_B}{2}+\frac{\pi}{4})}{tan(\frac{lat_A}{2}+\frac{\pi}{4})})}\)
需要特别注意,当
\(lat_a = lat_b\)
的时候分母为 0,针对这类情况,大多数编程语言都有 atan2(y, x)
函数,当 x = 0
时返回 90 度。最后我们有程序:
delta_phi = Math.log(Math.tan((@latitude / 180 * Math::PI) / 2 + Math::PI / 4) / Math.tan((orbit.latitude / 180 * Math::PI) / 2 + Math::PI / 4))
delta_lon = (@longitude - orbit.longitude) / 180 * Math::PI
theta = Math.atan2(delta_lon, delta_phi)
target_heading = theta * 180 / Math::PI
飞机上有两个可以控制飞机转向的方法,一个是偏航(Yaw),另一个是组合使用滚转(Roll)和俯仰(Pitch)。偏航通常是通过垂直尾翼来产生偏航力矩。而滚转和俯仰由副翼和水平尾翼得到。显然副翼和水平尾翼比垂直尾翼大很多,控制也会更快速、灵敏。事实上我们只需要控制滚转即可,因为当我们控制滚转,飞机的升力面会减少,从而导致升力下降、飞机下降;为了保持飞行高度,我们基于飞行高度控制的俯仰自然会提高俯仰,从而产生偏航方向的力矩。
不过需要注意两个关键点:
在控制程序中我直接限制了俯仰的范围是 [-5.0, 10] 度,而滚转限制在正负 25 度以内。整体程序如下:
class WaypointProcess
def initialize(vessel, velocity, height, latitude, longitude)
@vessel = vessel
@control = vessel.control
@velocity = velocity
@height = height
@longitude = longitude
@latitude = latitude
@throttle_controller = PIDController.new(0.1, 0.01, 0.2)
@pitch_controller = PIDController.new(0.05, 0.05, 0.1, -1.0, 1.0)
@roll_controller = PIDController.new(0.0005, 0.01, 0.7, -1.0, 1.0)
end
def run
@control.sas = false
loop do
orbit = @vessel.flight(@vessel.orbit.body.reference_frame)
surface = @vessel.flight(@vessel.surface_reference_frame)
break if (orbit.latitude - @latitude).abs < 1e-4 and (orbit.longitude - @longitude).abs < 1e-4
throttle = @throttle_controller.trigger(@velocity, orbit.speed)
@control.throttle = throttle
pitch = @pitch_controller.trigger(@height, orbit.mean_altitude)
if pitch > 0.0 and surface.pitch > 10
pitch = -0.1
elsif pitch < 0.0 and surface.pitch < -5
pitch = 0.1
end
@control.pitch = pitch
delta_phi = Math.log(Math.tan((@latitude / 180 * Math::PI) / 2 + Math::PI / 4) / Math.tan((orbit.latitude / 180 * Math::PI) / 2 + Math::PI / 4))
delta_lon = (@longitude - orbit.longitude) / 180 * Math::PI
theta = Math.atan2(delta_lon, delta_phi)
target_heading = theta * 180 / Math::PI
target_heading = 360 + target_heading if target_heading < 0
delta_heading = target_heading - surface.heading
delta_heading = 360 - delta_heading if delta_heading > 180
delta_heading = delta_heading + 360 if delta_heading < -180
bank_angle = delta_heading
bank_angle = 25 if delta_heading > 25
bank_angle = -25 if delta_heading < -25
roll = @roll_controller.trigger(bank_angle, surface.roll)
@control.roll = roll
current_roll = surface.roll
end
puts "Waypoint lat: #{@latitude}, lng: #{@longitude} Reached."
end
end
我们使用如下的飞行路径设定:
require 'matrix'
require 'krpc'
require './libs/controller/pid'
require './libs/process/takeoff'
require './libs/process/waypoint'
KRPC.connect do |client|
vessel = client.space_center.active_vessel
# VR: 100m/s, Climb at 390 knots to 2000m
TakeoffProcess.new(vessel, 100, 200, 2000).run
# Fly to North pole (N90, S0.0) at 330 knots at FL300
WaypointProcess.new(vessel, 170, 7000, 90.0, 0.0).run
end
在 100m/s 时抬轮,以 390 节速度爬升到 2000 米,然后爬升到 30000 英尺转向飞往北极点。测试完美。
我用的是自带存档强翼 A300 进行的测试。该飞机被设计成大气层内飞行的重型运输机。由于 Kerbin 星球比起地球大气稀薄,到 FL300 飞机已经难以维持 300 节的时速了。不过如果使用空天飞机进行测试则没有这样的问题。
目前的 PID 曲线的调教比较保守,在高度控制上震荡比较厉害。这可以针对具体飞机进行调整,如果在安装 Ferram Aerospace Research 的情况下,可以获得具体的「升力系数参数」,其实通过控制升力系数来控制高度是最为可靠的方法,现实中的飞机也是这么做的。
之后如果能获取机场 GPS 坐标实现自动进近和降落应该会更有意思。
不过这个项目还是很有意义的,不但练习了 Ruby 编程,还学习了立体几何、物理、飞行原理和自动化控制的相关知识。
]]>I wrote an article in July 2020, Ruby 3 Fiber changes preview (in Chinese),
and followed up by another post in August A Walkthrough of Ruby 3 Scheduler.
Ruby 3 has updated lots of versions during these months, including ruby-3.0.0-preview1
ruby-3.0.0-preview2
and ruby-3.0.0-rc1
,
which makes lots of improvements to the Fiber Scheduler API.
But as I mentioned before, what Ruby 3 implements is the interface. It would not use the scheduler, unless a scheduler implementation is included.
I am very busy working and studying in the past four months, and I took some time in the recent days to get updated with the API.
GitHub: Evt
Suppose we have a pair of fds generated by IO.pipe
. When we write Hello World
to one of them, we could read it from the other side of the pipe.
We would have code like this:
rd, wr = IO.pipe
wr.write("Hello World")
wr.close
message = rd.read(20)
puts message
rd.close
This program has lots of limitations. For example, you can’t write a string longer than the buffer size. Since the other side is not reading at the same time, it would get stuck if the string is too long. You would also have to write first, otherwise it would also get stuck. Of course, we could use multi-threading to solve this problem.
require 'thread'
rd, wr = IO.pipe
t1 = Thread.new do
message = rd.read(20)
puts message
rd.close
end
t2 = Thread.new do
wr.write("Hello World")
wr.close
end
t1.join
t2.join
But as we all know, using threads to solve I/O problems is very inefficient. The OS context switch is slow. The fairness of thread scheduling is still a very hard problem in the field of OS. For an I/O problem, which is not CPU-bound, all we need is to halt it and wait for the proper callback. In this case, all you need is to call Ruby 3 scheduler.
require 'evt'
rd, wr = IO.pipe
scheduler = Evt::Scheduler.new
Fiber.set_scheduler scheduler
Fiber.schedule do
message = rd.read(20)
puts message
rd.close
end
Fiber.schedule do
wr.write("Hello World")
wr.close
end
scheduler.run
In general, an async function requires keywords like callback
, async
, or await
.
But this is not necessary in Ruby 3.
Ruby 3 lists all common situations where you need async functions: I/O multiplexing, process halting, kernel sleep, and mutex.
Ruby 3 exposes all of these interfaces for scheduler to improve the performance without adding any new keywords.
My project evt is such a scheduler to meet the needs of Ruby 3 Scheduler.
Comparing to the simple example above, here is an example of HTTP/1.1 server
require 'evt'
@scheduler = Evt::Scheduler.new
Fiber.set_scheduler @scheduler
@server = Socket.new Socket::AF_INET, Socket::SOCK_STREAM
@server.bind Addrinfo.tcp '127.0.0.1', 3002
@server.listen Socket::SOMAXCONN
def handle_socket(socket)
until socket.closed?
line = socket.gets
until line == "\r\n" || line.nil?
line = socket.gets
end
socket.write("HTTP/1.1 200 OK\r\nContent-Length: 0\r\n\r\n")
end
end
Fiber.schedule do
loop do
socket, addr = @server.accept
Fiber.schedule do
handle_socket(socket)
end
end
end
@scheduler.run
We could see from this that, the code is almost the same with synchronous development.
All you need to do is to setup the scheduler with Fiber.set_scheduler
,
and add Fiber.scheduler
where you usually have to solve with multithreading.
Finally, use scheduler.run
to start the scheduler.
io_uring
SupportNot only the Ruby API gets lots of updates in the recent months, but also my scheduler. Especially for a better I/O multiplexing backend support.
io_uring
is included since Linux 5.4.
Since the io_uring
could reduce the syscalls and could have direct iov
calls to acheive better performance comparing to epoll
,
the support of io_uring
is important.
Direct iov
support requires Ruby Fiber scheduler for some further changes.
These changes are introduced by ioquatix since Ruby 3.0.0-preview2.
What we need to implement is two parts.
One of them is epoll
compatible API:
#include <liburing.h>
#define URING_ENTRIES 64
#define URING_MAX_EVENTS 64
struct uring_data {
bool is_poll;
short poll_mask;
VALUE io;
};
void uring_payload_free(void* data);
size_t uring_payload_size(const void* data);
static const rb_data_type_t type_uring_payload = {
.wrap_struct_name = "uring_payload",
.function = {
.dmark = NULL,
.dfree = uring_payload_free,
.dsize = uring_payload_size,
},
.data = NULL,
.flags = RUBY_TYPED_FREE_IMMEDIATELY,
};
void uring_payload_free(void* data) {
io_uring_queue_exit((struct io_uring*) data);
xfree(data);
}
size_t uring_payload_size(const void* data) {
return sizeof(struct io_uring);
}
VALUE method_scheduler_init(VALUE self) {
int ret;
struct io_uring* ring;
ring = xmalloc(sizeof(struct io_uring));
ret = io_uring_queue_init(URING_ENTRIES, ring, 0);
if (ret < 0) {
rb_raise(rb_eIOError, "unable to initalize io_uring");
}
rb_iv_set(self, "@ring", TypedData_Wrap_Struct(Payload, &type_uring_payload, ring));
return Qnil;
}
VALUE method_scheduler_register(VALUE self, VALUE io, VALUE interest) {
VALUE ring_obj;
struct io_uring* ring;
struct io_uring_sqe *sqe;
struct uring_data *data;
short poll_mask = 0;
ID id_fileno = rb_intern("fileno");
ring_obj = rb_iv_get(self, "@ring");
TypedData_Get_Struct(ring_obj, struct io_uring, &type_uring_payload, ring);
sqe = io_uring_get_sqe(ring);
int fd = NUM2INT(rb_funcall(io, id_fileno, 0));
int ruby_interest = NUM2INT(interest);
int readable = NUM2INT(rb_const_get(rb_cIO, rb_intern("READABLE")));
int writable = NUM2INT(rb_const_get(rb_cIO, rb_intern("WRITABLE")));
if (ruby_interest & readable) {
poll_mask |= POLL_IN;
}
if (ruby_interest & writable) {
poll_mask |= POLL_OUT;
}
data = (struct uring_data*) xmalloc(sizeof(struct uring_data));
data->is_poll = true;
data->io = io;
data->poll_mask = poll_mask;
io_uring_prep_poll_add(sqe, fd, poll_mask);
io_uring_sqe_set_data(sqe, data);
io_uring_submit(ring);
return Qnil;
}
VALUE method_scheduler_deregister(VALUE self, VALUE io) {
// io_uring runs under oneshot mode. No need to deregister.
return Qnil;
}
The other part is direct iov support:
VALUE method_scheduler_io_read(VALUE self, VALUE io, VALUE buffer, VALUE offset, VALUE length) {
struct io_uring* ring;
struct uring_data *data;
char* read_buffer;
ID id_fileno = rb_intern("fileno");
// @iov[io] = Fiber.current
VALUE iovs = rb_iv_get(self, "@iovs");
rb_hash_aset(iovs, io, rb_funcall(Fiber, rb_intern("current"), 0));
// register
VALUE ring_obj = rb_iv_get(self, "@ring");
TypedData_Get_Struct(ring_obj, struct io_uring, &type_uring_payload, ring);
struct io_uring_sqe *sqe = io_uring_get_sqe(ring);
int fd = NUM2INT(rb_funcall(io, id_fileno, 0));
read_buffer = (char*) xmalloc(NUM2SIZET(length));
struct iovec iov = {
.iov_base = read_buffer,
.iov_len = NUM2SIZET(length),
};
data = (struct uring_data*) xmalloc(sizeof(struct uring_data));
data->is_poll = false;
data->io = io;
data->poll_mask = 0;
io_uring_prep_readv(sqe, fd, &iov, 1, NUM2SIZET(offset));
io_uring_sqe_set_data(sqe, data);
io_uring_submit(ring);
VALUE result = rb_str_new(read_buffer, strlen(read_buffer));
if (buffer != Qnil) {
rb_str_append(buffer, result);
}
rb_funcall(Fiber, rb_intern("yield"), 0); // Fiber.yield
return result;
}
VALUE method_scheduler_io_write(VALUE self, VALUE io, VALUE buffer, VALUE offset, VALUE length) {
struct io_uring* ring;
struct uring_data *data;
char* write_buffer;
ID id_fileno = rb_intern("fileno");
// @iov[io] = Fiber.current
VALUE iovs = rb_iv_get(self, "@iovs");
rb_hash_aset(iovs, io, rb_funcall(Fiber, rb_intern("current"), 0));
// register
VALUE ring_obj = rb_iv_get(self, "@ring");
TypedData_Get_Struct(ring_obj, struct io_uring, &type_uring_payload, ring);
struct io_uring_sqe *sqe = io_uring_get_sqe(ring);
int fd = NUM2INT(rb_funcall(io, id_fileno, 0));
write_buffer = StringValueCStr(buffer);
struct iovec iov = {
.iov_base = write_buffer,
.iov_len = NUM2SIZET(length),
};
data = (struct uring_data*) xmalloc(sizeof(struct uring_data));
data->is_poll = false;
data->io = io;
data->poll_mask = 0;
io_uring_prep_writev(sqe, fd, &iov, 1, NUM2SIZET(offset));
io_uring_sqe_set_data(sqe, data);
io_uring_submit(ring);
rb_funcall(Fiber, rb_intern("yield"), 0); // Fiber.yield
return length;
}
But in some cases, the iov would not be called. I’m still figuring out the bug. But at least the performance is very close to epoll
.
Another problem is to support Windows IOCP. I tried to implement somethine like this:
VALUE method_scheduler_init(VALUE self) {
HANDLE iocp = CreateIoCompletionPort(INVALID_HANDLE_VALUE, NULL, 0, 0);
rb_iv_set(self, "@iocp", TypedData_Wrap_Struct(Payload, &type_iocp_payload, iocp));
return Qnil;
}
VALUE method_scheduler_register(VALUE self, VALUE io, VALUE interest) {
HANDLE iocp;
VALUE iocp_obj = rb_iv_get(self, "@iocp");
struct iocp_data* data;
TypedData_Get_Struct(iocp_obj, HANDLE, &type_iocp_payload, iocp);
int fd = NUM2INT(rb_funcallv(io, rb_intern("fileno"), 0, 0));
HANDLE io_handler = (HANDLE)rb_w32_get_osfhandle(fd);
int ruby_interest = NUM2INT(interest);
int readable = NUM2INT(rb_const_get(rb_cIO, rb_intern("READABLE")));
int writable = NUM2INT(rb_const_get(rb_cIO, rb_intern("WRITABLE")));
data = (struct iocp_data*) xmalloc(sizeof(struct iocp_data));
data->io = io;
data->is_poll = true;
data->interest = 0;
if (ruby_interest & readable) {
interest |= readable;
}
if (ruby_interest & writable) {
interest |= writable;
}
HANDLE res = CreateIoCompletionPort(io_handler, iocp, (ULONG_PTR) data, 0);
printf("IO at address: 0x%08x\n", (void *)data);
return Qnil;
}
VALUE method_scheduler_wait(VALUE self) {
ID id_next_timeout = rb_intern("next_timeout");
ID id_push = rb_intern("push");
VALUE iocp_obj = rb_iv_get(self, "@iocp");
VALUE next_timeout = rb_funcall(self, id_next_timeout, 0);
int readable = NUM2INT(rb_const_get(rb_cIO, rb_intern("READABLE")));
int writable = NUM2INT(rb_const_get(rb_cIO, rb_intern("WRITABLE")));
HANDLE iocp;
OVERLAPPED_ENTRY lpCompletionPortEntries[IOCP_MAX_EVENTS];
ULONG ulNumEntriesRemoved;
TypedData_Get_Struct(iocp_obj, HANDLE, &type_iocp_payload, iocp);
DWORD timeout;
if (next_timeout == Qnil) {
timeout = 0x5000;
} else {
timeout = NUM2INT(next_timeout) * 1000; // seconds to milliseconds
}
DWORD NumberOfBytesTransferred;
LPOVERLAPPED pOverlapped;
ULONG_PTR CompletionKey;
BOOL res = GetQueuedCompletionStatus(iocp, &NumberOfBytesTransferred, &CompletionKey, &pOverlapped, timeout);
// BOOL res = GetQueuedCompletionStatusEx(
// iocp, lpCompletionPortEntries, IOCP_MAX_EVENTS, &ulNumEntriesRemoved, timeout, TRUE);
VALUE result = rb_ary_new2(2);
VALUE readables = rb_ary_new();
VALUE writables = rb_ary_new();
rb_ary_store(result, 0, readables);
rb_ary_store(result, 1, writables);
if (!result) {
return result;
}
printf("--------- Received! ---------\n");
printf("Received IO at address: 0x%08x\n", (void *)CompletionKey);
printf("dwNumberOfBytesTransferred: %lld\n", NumberOfBytesTransferred);
// if (ulNumEntriesRemoved > 0) {
// printf("Entries: %ld\n", ulNumEntriesRemoved);
// }
// for (ULONG i = 0; i < ulNumEntriesRemoved; i++) {
// OVERLAPPED_ENTRY entry = lpCompletionPortEntries[i];
// struct iocp_data *data = (struct iocp_data*) entry.lpCompletionKey;
// int interest = data->interest;
// VALUE obj_io = data->io;
// if (interest & readable) {
// rb_funcall(readables, id_push, 1, obj_io);
// } else if (interest & writable) {
// rb_funcall(writables, id_push, 1, obj_io);
// }
// xfree(data);
// }
return result;
}
But the I/O scheduler receives the wrong pointers when callback. After some researches, to support IOCP, you have to initialize the I/O with FILE_FLAG_OVERLAPPED
flag.
This may need some changes in Ruby win32/win32.c
to support IOCP.
But at least I solved the problems of the IO.select
fallback.
The problem is fine, since nobody cares about Windows production performance…
kqueue
ImprovementsAnother Improvement is to macOS kqueue
.
kqueue
on FreeBSD is good. Bug the performance on macOS is really weird.
Since all of our I/O registration is in one-shot, I used one-shot mode of kqueue
to reduce the number of syscalls.
VALUE method_scheduler_register(VALUE self, VALUE io, VALUE interest) {
struct kevent event;
u_short event_flags = 0;
ID id_fileno = rb_intern("fileno");
int kq = NUM2INT(rb_iv_get(self, "@kq"));
int fd = NUM2INT(rb_funcall(io, id_fileno, 0));
int ruby_interest = NUM2INT(interest);
int readable = NUM2INT(rb_const_get(rb_cIO, rb_intern("READABLE")));
int writable = NUM2INT(rb_const_get(rb_cIO, rb_intern("WRITABLE")));
if (ruby_interest & readable) {
event_flags |= EVFILT_READ;
}
if (ruby_interest & writable) {
event_flags |= EVFILT_WRITE;
}
EV_SET(&event, fd, event_flags, EV_ADD|EV_ENABLE|EV_ONESHOT, 0, 0, (void*) io);
kevent(kq, &event, 1, NULL, 0, NULL); // TODO: Check the return value
return Qnil;
}
At last, we support almost all I/O multiplexing backends of mostly used OS:
Linux | Windows | macOS | FreeBSD | |
---|---|---|---|---|
io_uring | ✅ (See 1) | ❌ | ❌ | ❌ |
epoll | ✅ (See 2) | ❌ | ❌ | ❌ |
kqueue | ❌ | ❌ | ✅ (⚠️See 5) | ✅ |
IOCP | ❌ | ❌ (⚠️See 3) | ❌ | ❌ |
Ruby (IO.select ) |
✅ Fallback | ✅ (⚠️See 4) | ✅ Fallback | ✅ Fallback |
FILE_FLAG_OVERLAPPED
is included in I/O initialization process.kqueue
performance in Darwin is very poor. MAY BE DISABLED IN THE FUTURE.How is the overall performance?
The benchmark is running under v0.2.2
version and Ruby 3.0.0-rc1.
See evt-server-benchmark for test code, the test is running under a single-thread server.
The test command is wrk -t4 -c8192 -d30s http://localhost:3001
.
All of the systems have set their file descriptor limit to maximum.
OS | CPU | Memory | Backend | req/s |
---|---|---|---|---|
Linux | Ryzen 2700x | 64GB | epoll | 54680.08 |
Linux | Ryzen 2700x | 64GB | io_uring | 50245.53 |
Linux | Ryzen 2700x | 64GB | IO.select (using poll) | 44159.23 |
macOS | i7-6820HQ | 16GB | kqueue | 37855.53 |
macOS | i7-6820HQ | 16GB | IO.select (using poll) | 28293.36 |
Very impressive. The results improvements are from lots of aspects. Current async frameworks like Falcon uses nio4r. The backend of nio4r is libev. The performance of libev is average due to the extreme compatibility design. Existing async frameworks also requires lots of meta-programming. But this extension is almost written in C, with only the features the scheduler need.
Comparing to my previous tests on preview 1, this version uses long connection, and Ruby nonblock I/O also has fixed a lot.
The wrk
results are very error-sensitive. All of these things makes our performance 10 times faster comparing to what we have done 3 months ago.
wrk is very error-sensitive, the parser in the benchmark is incorrect, which could not close the socket properly. I updated my Midori to a Ruby 3 Scheduler project, the performance could reach 247k req/s with kqueue and 647k req/s with epoll, which is more than 100x times faster comparing to blocking I/O.
I also wrote a post on November about Ractor Ruby 3 Ractor Dev Guide (in Chinese) Combining Fiber with Ractor is always a interesting thing. We have two routes for that:
SO_REUSEPORT
feature to let all Ractor listen to the port at the same time, which is very easy to deal with with exisiting server archs.Unfortunately, either of these are functioning correctly now. Some Fiber features are not available in Ractor. I suppose this is a bug, and have submitted a patch GitHub #3971. According to my previous benchmarks, Ractor my increase about 4 times the performance by multi-core.
But since API servers are usually stateless, these improvements could be acheived by multi-processes. Ractor’s majot contribution may be fewer memory consumption.
I would test it with Ruby 3.0 future updates.
We acheived a 10 times performance improvement comparing to preview 1, and almost 36 times faster comparing to blocking I/O. The major performance issue of Ruby servers are I/O blocking instead of VM performance. With the I/O scheduler is included, we could improve the I/O performance of Ruby 3 into a new era. The future work is to wait for the updates of some C extension libraries like database connections. Then if we use an async scheduler with a Fiber based Web server like Falcon, you don’t have to do anything about your business code to get dozens of times of performance improvements.
Let’s continue happy programming with Ruby.
]]>我在 2020 年 7 月写过一篇文章 《Ruby 3 Fiber 变化前瞻》,以及后来 8 月又写过一篇文章 《尝试使用 Ruby 3 调度器》,简单介绍了 Fiber 调度器。Ruby 3 在这几个月中更新了数个版本,包括 ruby-3.0.0-preview1
ruby-3.0.0-preview2
和 ruby-3.0.0-rc1
,其对于 Fiber 调度器的 API 做了更多的改进。
不过正如我之前所说,Ruby 3 调度器实现的只有接口,如果没有配套的接口实现,默认是不会启动的。最近四个月工作实在很忙,抽出了点时间来跟上 API 更新的脚步。这个项目得以进一步更新。
项目地址:Evt
我们假设我们现在有一对 IO.pipe
,我们往一个里写入 Hello World
,然后从另一个里读出来。我们可能会写这样一份代码:
rd, wr = IO.pipe
wr.write("Hello World")
wr.close
message = rd.read(20)
puts message
rd.close
不过这个程序有很多限制,比如写入不能超过 buffer
,否则另一端由于没有异步读取,会卡死。以及必须要先写再读,否则也会卡死。当然我们可以使用多线程来解决这个问题:
require 'thread'
rd, wr = IO.pipe
t1 = Thread.new do
message = rd.read(20)
puts message
rd.close
end
t2 = Thread.new do
wr.write("Hello World")
wr.close
end
t1.join
t2.join
但我们知道,使用线程来实现 I/O 的多路复用是效率极低的。操作系统的线程切换代价非常大,甚至对于线程之间调度的公平性,至今都是操作系统研究领域的噩梦。然而对于一个 I/O 问题,并不是 CPU-bound 的,只是需要调度器提供合适的睡眠和回调。这时,你只需要调用 Ruby 3 的调度器接口来替代线程就可以了。
require 'evt'
rd, wr = IO.pipe
scheduler = Evt::Scheduler.new
Fiber.set_scheduler scheduler
Fiber.schedule do
message = rd.read(20)
puts message
rd.close
end
Fiber.schedule do
wr.write("Hello World")
wr.close
end
scheduler.run
一般来说异步代码需要写 callback
或者引入 async
await
的关键字。但是在 Ruby 3 中这是不必要的。Ruby 3 列举了所有常见的需要进行上下文切换调度的场景:I/O 多路复用、等待进程退出、内核睡眠、自旋锁。把这些接口暴露出来,让开发者可以通过自行开发调度器来进行处理,从而无需引入任何额外的关键字。而我这几个月写的 Evt 就是这样一个调度器。
比起上面这个简单的例子,下面这个例子是一个 HTTP/1.1 的服务器
require 'evt'
@scheduler = Evt::Scheduler.new
Fiber.set_scheduler @scheduler
@server = Socket.new Socket::AF_INET, Socket::SOCK_STREAM
@server.bind Addrinfo.tcp '127.0.0.1', 3002
@server.listen Socket::SOMAXCONN
def handle_socket(socket)
until socket.closed?
line = socket.gets
until line == "\r\n" || line.nil?
line = socket.gets
end
socket.write("HTTP/1.1 200 OK\r\nContent-Length: 0\r\n\r\n")
end
end
Fiber.schedule do
loop do
socket, addr = @server.accept
Fiber.schedule do
handle_socket(socket)
end
end
end
@scheduler.run
可以看出来,开发的过程基本上和同步阻塞的线程开发没有任何区别,只需要 Fiber.set_scheduler
来设置你的调度器,然后在每个原先需要多线程来处理的 I/O 阻塞场景用 Fiber.scheduler
来替代。最后触发 scheduler.run
来启动调度器即可。
io_uring
支持这几个月不止 Ruby API 进行了很多优化,我的调度器也做了很多优化,比如做了许多 I/O 多路复用后端的优化。一个是 Linux 5.4 开始引入的 io_uring
多路复用的支持。由于 io_uring
可以减少 syscall
调用次数以及直接的 iov
调用理论上能比 epoll
达到更好的性能。直接的 iov
调用需要 Ruby Fiber 调度器接口上的额外支持。在和 ioquatix 讨论后,Ruby 3.0.0-preview2 开始引入了相关的接口。于是整个 io_uring
的实现需要两个部分,一个是和 epoll
模式兼容的 one-shot polling 相关的代码:
#include <liburing.h>
#define URING_ENTRIES 64
#define URING_MAX_EVENTS 64
struct uring_data {
bool is_poll;
short poll_mask;
VALUE io;
};
void uring_payload_free(void* data);
size_t uring_payload_size(const void* data);
static const rb_data_type_t type_uring_payload = {
.wrap_struct_name = "uring_payload",
.function = {
.dmark = NULL,
.dfree = uring_payload_free,
.dsize = uring_payload_size,
},
.data = NULL,
.flags = RUBY_TYPED_FREE_IMMEDIATELY,
};
void uring_payload_free(void* data) {
io_uring_queue_exit((struct io_uring*) data);
xfree(data);
}
size_t uring_payload_size(const void* data) {
return sizeof(struct io_uring);
}
VALUE method_scheduler_init(VALUE self) {
int ret;
struct io_uring* ring;
ring = xmalloc(sizeof(struct io_uring));
ret = io_uring_queue_init(URING_ENTRIES, ring, 0);
if (ret < 0) {
rb_raise(rb_eIOError, "unable to initalize io_uring");
}
rb_iv_set(self, "@ring", TypedData_Wrap_Struct(Payload, &type_uring_payload, ring));
return Qnil;
}
VALUE method_scheduler_register(VALUE self, VALUE io, VALUE interest) {
VALUE ring_obj;
struct io_uring* ring;
struct io_uring_sqe *sqe;
struct uring_data *data;
short poll_mask = 0;
ID id_fileno = rb_intern("fileno");
ring_obj = rb_iv_get(self, "@ring");
TypedData_Get_Struct(ring_obj, struct io_uring, &type_uring_payload, ring);
sqe = io_uring_get_sqe(ring);
int fd = NUM2INT(rb_funcall(io, id_fileno, 0));
int ruby_interest = NUM2INT(interest);
int readable = NUM2INT(rb_const_get(rb_cIO, rb_intern("READABLE")));
int writable = NUM2INT(rb_const_get(rb_cIO, rb_intern("WRITABLE")));
if (ruby_interest & readable) {
poll_mask |= POLL_IN;
}
if (ruby_interest & writable) {
poll_mask |= POLL_OUT;
}
data = (struct uring_data*) xmalloc(sizeof(struct uring_data));
data->is_poll = true;
data->io = io;
data->poll_mask = poll_mask;
io_uring_prep_poll_add(sqe, fd, poll_mask);
io_uring_sqe_set_data(sqe, data);
io_uring_submit(ring);
return Qnil;
}
VALUE method_scheduler_deregister(VALUE self, VALUE io) {
// io_uring runs under oneshot mode. No need to deregister.
return Qnil;
}
另一部分则是直接的 iov 支持:
VALUE method_scheduler_io_read(VALUE self, VALUE io, VALUE buffer, VALUE offset, VALUE length) {
struct io_uring* ring;
struct uring_data *data;
char* read_buffer;
ID id_fileno = rb_intern("fileno");
// @iov[io] = Fiber.current
VALUE iovs = rb_iv_get(self, "@iovs");
rb_hash_aset(iovs, io, rb_funcall(Fiber, rb_intern("current"), 0));
// register
VALUE ring_obj = rb_iv_get(self, "@ring");
TypedData_Get_Struct(ring_obj, struct io_uring, &type_uring_payload, ring);
struct io_uring_sqe *sqe = io_uring_get_sqe(ring);
int fd = NUM2INT(rb_funcall(io, id_fileno, 0));
read_buffer = (char*) xmalloc(NUM2SIZET(length));
struct iovec iov = {
.iov_base = read_buffer,
.iov_len = NUM2SIZET(length),
};
data = (struct uring_data*) xmalloc(sizeof(struct uring_data));
data->is_poll = false;
data->io = io;
data->poll_mask = 0;
io_uring_prep_readv(sqe, fd, &iov, 1, NUM2SIZET(offset));
io_uring_sqe_set_data(sqe, data);
io_uring_submit(ring);
VALUE result = rb_str_new(read_buffer, strlen(read_buffer));
if (buffer != Qnil) {
rb_str_append(buffer, result);
}
rb_funcall(Fiber, rb_intern("yield"), 0); // Fiber.yield
return result;
}
VALUE method_scheduler_io_write(VALUE self, VALUE io, VALUE buffer, VALUE offset, VALUE length) {
struct io_uring* ring;
struct uring_data *data;
char* write_buffer;
ID id_fileno = rb_intern("fileno");
// @iov[io] = Fiber.current
VALUE iovs = rb_iv_get(self, "@iovs");
rb_hash_aset(iovs, io, rb_funcall(Fiber, rb_intern("current"), 0));
// register
VALUE ring_obj = rb_iv_get(self, "@ring");
TypedData_Get_Struct(ring_obj, struct io_uring, &type_uring_payload, ring);
struct io_uring_sqe *sqe = io_uring_get_sqe(ring);
int fd = NUM2INT(rb_funcall(io, id_fileno, 0));
write_buffer = StringValueCStr(buffer);
struct iovec iov = {
.iov_base = write_buffer,
.iov_len = NUM2SIZET(length),
};
data = (struct uring_data*) xmalloc(sizeof(struct uring_data));
data->is_poll = false;
data->io = io;
data->poll_mask = 0;
io_uring_prep_writev(sqe, fd, &iov, 1, NUM2SIZET(offset));
io_uring_sqe_set_data(sqe, data);
io_uring_submit(ring);
rb_funcall(Fiber, rb_intern("yield"), 0); // Fiber.yield
return length;
}
不过目前不知道为什么 iov
调用没有被 Ruby Scheduler 识别到,目前还在修复相关的问题。不过好消息是至少达到了接近 epoll
的性能了。
另一个麻烦的地方是 Windows IOCP 支持。我试图写了一个 IOCP 的调度器:
VALUE method_scheduler_init(VALUE self) {
HANDLE iocp = CreateIoCompletionPort(INVALID_HANDLE_VALUE, NULL, 0, 0);
rb_iv_set(self, "@iocp", TypedData_Wrap_Struct(Payload, &type_iocp_payload, iocp));
return Qnil;
}
VALUE method_scheduler_register(VALUE self, VALUE io, VALUE interest) {
HANDLE iocp;
VALUE iocp_obj = rb_iv_get(self, "@iocp");
struct iocp_data* data;
TypedData_Get_Struct(iocp_obj, HANDLE, &type_iocp_payload, iocp);
int fd = NUM2INT(rb_funcallv(io, rb_intern("fileno"), 0, 0));
HANDLE io_handler = (HANDLE)rb_w32_get_osfhandle(fd);
int ruby_interest = NUM2INT(interest);
int readable = NUM2INT(rb_const_get(rb_cIO, rb_intern("READABLE")));
int writable = NUM2INT(rb_const_get(rb_cIO, rb_intern("WRITABLE")));
data = (struct iocp_data*) xmalloc(sizeof(struct iocp_data));
data->io = io;
data->is_poll = true;
data->interest = 0;
if (ruby_interest & readable) {
interest |= readable;
}
if (ruby_interest & writable) {
interest |= writable;
}
HANDLE res = CreateIoCompletionPort(io_handler, iocp, (ULONG_PTR) data, 0);
printf("IO at address: 0x%08x\n", (void *)data);
return Qnil;
}
VALUE method_scheduler_wait(VALUE self) {
ID id_next_timeout = rb_intern("next_timeout");
ID id_push = rb_intern("push");
VALUE iocp_obj = rb_iv_get(self, "@iocp");
VALUE next_timeout = rb_funcall(self, id_next_timeout, 0);
int readable = NUM2INT(rb_const_get(rb_cIO, rb_intern("READABLE")));
int writable = NUM2INT(rb_const_get(rb_cIO, rb_intern("WRITABLE")));
HANDLE iocp;
OVERLAPPED_ENTRY lpCompletionPortEntries[IOCP_MAX_EVENTS];
ULONG ulNumEntriesRemoved;
TypedData_Get_Struct(iocp_obj, HANDLE, &type_iocp_payload, iocp);
DWORD timeout;
if (next_timeout == Qnil) {
timeout = 0x5000;
} else {
timeout = NUM2INT(next_timeout) * 1000; // seconds to milliseconds
}
DWORD NumberOfBytesTransferred;
LPOVERLAPPED pOverlapped;
ULONG_PTR CompletionKey;
BOOL res = GetQueuedCompletionStatus(iocp, &NumberOfBytesTransferred, &CompletionKey, &pOverlapped, timeout);
// BOOL res = GetQueuedCompletionStatusEx(
// iocp, lpCompletionPortEntries, IOCP_MAX_EVENTS, &ulNumEntriesRemoved, timeout, TRUE);
VALUE result = rb_ary_new2(2);
VALUE readables = rb_ary_new();
VALUE writables = rb_ary_new();
rb_ary_store(result, 0, readables);
rb_ary_store(result, 1, writables);
if (!result) {
return result;
}
printf("--------- Received! ---------\n");
printf("Received IO at address: 0x%08x\n", (void *)CompletionKey);
printf("dwNumberOfBytesTransferred: %lld\n", NumberOfBytesTransferred);
// if (ulNumEntriesRemoved > 0) {
// printf("Entries: %ld\n", ulNumEntriesRemoved);
// }
// for (ULONG i = 0; i < ulNumEntriesRemoved; i++) {
// OVERLAPPED_ENTRY entry = lpCompletionPortEntries[i];
// struct iocp_data *data = (struct iocp_data*) entry.lpCompletionKey;
// int interest = data->interest;
// VALUE obj_io = data->io;
// if (interest & readable) {
// rb_funcall(readables, id_push, 1, obj_io);
// } else if (interest & writable) {
// rb_funcall(writables, id_push, 1, obj_io);
// }
// xfree(data);
// }
return result;
}
但实际发现收到的 I/O 全部都是错误的指针。一番研究后发现,如果要让 IOCP 调度对应的 I/O,该 I/O 在初始化时就要有 FILE_FLAG_OVERLAPPED
Flag 的支持。这意味着还需要 Ruby 的 win32/win32.c
中做出一些改进,才能在调度器中正确调度 IOCP。不过 Windows 上的 fallback IO.select
调度器还是能正常使用的,这问题就不大,毕竟谁在乎 Windows 的生产性能呢…
kqueue
支持改进另一个做出的改进是在 macOS 的 kqueue
上。kqueue
在 FreeBSD 上的性能相当好,但是在 macOS 上就比较拉跨。只能通过减少 syscall
来提高性能。这几个月的一个改进是使用了 kqueue
的 one-shot 模式,来减少一次 deregister 需要的 syscall
。
VALUE method_scheduler_register(VALUE self, VALUE io, VALUE interest) {
struct kevent event;
u_short event_flags = 0;
ID id_fileno = rb_intern("fileno");
int kq = NUM2INT(rb_iv_get(self, "@kq"));
int fd = NUM2INT(rb_funcall(io, id_fileno, 0));
int ruby_interest = NUM2INT(interest);
int readable = NUM2INT(rb_const_get(rb_cIO, rb_intern("READABLE")));
int writable = NUM2INT(rb_const_get(rb_cIO, rb_intern("WRITABLE")));
if (ruby_interest & readable) {
event_flags |= EVFILT_READ;
}
if (ruby_interest & writable) {
event_flags |= EVFILT_WRITE;
}
EV_SET(&event, fd, event_flags, EV_ADD|EV_ENABLE|EV_ONESHOT, 0, 0, (void*) io);
kevent(kq, &event, 1, NULL, 0, NULL); // TODO: Check the return value
return Qnil;
}
最后我们把主流的操作系统 I/O 多路复用都写了一遍集成到了我们的事件处理库中,整体情况如下:
Linux | Windows | macOS | FreeBSD | |
---|---|---|---|---|
io_uring | ✅ (见 1) | ❌ | ❌ | ❌ |
epoll | ✅ (见 2) | ❌ | ❌ | ❌ |
kqueue | ❌ | ❌ | ✅ (⚠️见 5) | ✅ |
IOCP | ❌ | ❌ (⚠️见 3) | ❌ | ❌ |
Ruby (IO.select ) |
✅ Fallback | ✅ (⚠️见 4) | ✅ Fallback | ✅ Fallback |
liburing-dev
已被安装FILE_FLAG_OVERLAPPED
flag 被引入前 无法工作。kqueue
在 Darwin 下的一些特殊情况性能很烂,可能会在未来被禁用。那么总体性能如何呢?
下面的测试是在 evt v0.2.2
和 Ruby 3.0.0-rc1 上运行的,详细的测试代码见 evt-server-benchmark。测试仅使用单线程服务器。
测试命令是 wrk -t4 -c8192 -d30s http://localhost:3001
.
操作系统 | CPU | 内存 | 后端 | 请求/秒 |
---|---|---|---|---|
Linux | Ryzen 2700x | 64GB | epoll | 54680.08 |
Linux | Ryzen 2700x | 64GB | io_uring | 50245.53 |
Linux | Ryzen 2700x | 64GB | Ruby (使用 poll) | 44159.23 |
macOS | i7-6820HQ | 16GB | kqueue | 37855.53 |
macOS | i7-6820HQ | 16GB | Ruby (使用 poll) | 28293.36 |
相当惊人。这个结果有几方面因素。现在的 Falcon 等异步框架使用的都是基于 nio4r 来实现的,其背后是 libev。libev 在各个异步事件库中的性能本来就是比较一般的,再加上其为了更好的兼容性做了大量的妥协。另一方面,以前的调度库需要大量 Ruby 元编程帮助,而现在几乎都是在 C extension 间完成的,性能也有了很大的提升。
另外比起我们之前在 preview1 上做的测试,这个版本的 Fiber 调度器修复了大量的错误,而 wrk 的测试结果是非常错误敏感的,这使得我们最终的请求速度比起之前又提升了 10 倍。
wrk 对于错误非常敏感,这个 benchmark 中的 parser 有问题,无法准确关闭 socket。把我的 Midori 重新捡起来改成了 Ruby 3 Scheduler 项目。性能达到了 247k req/s 单线程使用 kqueue!使用 epoll 更是达到了 647k req/s!达到了上百倍的性能提升。
我在 2020 年 11 月 17 日写过一篇关于 Ractor 的扫盲贴 《Ractor 下多线程 Ruby 程序指南》,Ractor 和 Fiber 的结合始终是一个有意思的话题。目前情况下 Fiber 与 Ractor 结合来实现 Web 服务器有两个可能的路径:
SO_REUSEPORT
特性让多个 Ractor 同时监听请求,即可直接将单线程服务器扩展成多线程服务器。比较可惜的是,目前这两者都是无法实现的。因为目前 Fiber 的一些特性无法在 Ractor 中使用。我个人倾向认为这是误报,目前已提交了一个 patch GitHub #3971。根据我之前的测试,Ractor 的加入在实际上应该还能再提升 4 倍左右的吞吐量。不过由于 API 服务器通常是无状态的,主要矛盾也不是 CPU-bound,所以这些吞吐量也是可以由多进程来实现的,Ractor 的引入更多是比起多进程实现的内存消耗降低。
等 Ruby 3.0 更新后我们可以进一步测试。
这比起 preview1 10 倍的性能提升,和比起以前阻塞 I/O 近 36 倍的性能提升足以证明 Ruby 目前服务器的性能问题的本质是 I/O 阻塞问题,而不是 Ruby CPU 执行慢的问题。而随着 I/O 调度器的引入,Ruby 3 的 I/O 性能能更上一个台阶。接下来我们要等待的就是一些使用 C 原生组件的,比如数据库驱动和 Redis 驱动的更新。然后使用一个基于 Fiber 的 Web 服务器,例如 Falcon。无需任何业务上代码的变化,就能得到数倍甚至数十倍的性能提升。
让我们继续享受 Ruby 的快乐编程。
]]>Ractor 是 Ruby 3 新引入的特性。Ractor 顾名思义是 Ruby 和 Actor 的组合词。Actor 模型是一个基于通讯的、非锁同步的并发模型。基于 Actor 的并发模型在 Ruby 中有很多应用,比如 concurrent-ruby
中的 Concurrent::Actor
。Concurrent Ruby 虽然引入了大量的抽象模型,允许开发高并发的应用,但是它并不能摆脱 Ruby 的 GIL (Global Interpreter Lock),这使得同一时间,只有一个线程是活跃的。所以通常 concurrent-ruby
需要搭配无锁的 JRuby 解释器使用。然而,直接解除 GIL 锁会导致大量默认 GIL 可用的依赖出现问题,在多线程开发中会产生难以预料的线程竞争问题。
去年在 RubyConf China 的时候,我问 matz 说 90 年代多核的小型机以及超级计算机已经变得非常普遍了,为什么会把 Ruby 的多线程设计成这样呢?matz 表示,他当时还在用装着 Windows 95 的 PC,如果他知道以后多核会那么普遍,他也不会把 Ruby 设计成这样。
但是,历史遗留问题依然需要解决。随着 Fiber Scheduler 在 Ruby 3 引入来提高 I/O 密集场景下单一线程利用率极低的问题;我们需要进一步解决,计算密集场景下,多线程的利用率。
为了解决这一问题,Ruby 3 引入了 Ractor 模型。Ractor 本质来说还是 Thread 线程,但是 Ractor 做了一系列的限制。首先,锁是不会在 Ractor 之间共享的;也就是说,不可能有两个线程争抢同一个锁。Ractor 和 Ractor 之间可以传递消息。Ractor 内部具有全局锁,确保 Ractor 内的行为和原先 Thread 是一致的。传递消息必须是值类型的,这意味着不会有指针跨 Ractor 生存,也会避免数据竞争问题。简而言之,Ractor 把每个 Thread 当作一个 Actor。
但 Ruby 没有真正的值类型。但值类型的本质就是用拷贝来替代引用。我们要做的就是确保 Ruby 对象的可拷贝性。我们查看 Ractor 的文档,我们可以看到这个的严格描述:
Ractors don't share everything, unlike threads.
* Most objects are *Unshareable objects*, so you don't need to care about thread-safety problem which is caused by sharing.
* Some objects are *Shareable objects*.
* Immutable objects: frozen objects which don't refer to unshareable-objects.
* `i = 123`: `i` is an immutable object.
* `s = "str".freeze`: `s` is an immutable object.
* `a = [1, [2], 3].freeze`: `a` is not an immutable object because `a` refer unshareable-object `[2]` (which is not frozen).
* Class/Module objects
* Special shareable objects
* Ractor object itself.
* And more...
为了测试出 Ractor 的效果,我们需要一个计算密集的场景。最计算密集的场景,当然就是做数学计算本身。比如我们有下面一个程序:
DAT = (0...72072000).to_a
p DAT.map { |a| a**2 }.reduce(:+)
这个程序计算 0 到 72072000 的平方和。我们运行一下这个程序,得到运行时间是 8.17s。
如果我们用传统的多线程来写,我们可以把程序写成这样:
THREADS = 8
LCM = 72072000
t = []
res = []
(0...THREADS).each do |i|
r = Thread.new do
dat = (((LCM/THREADS)*i)...((LCM/THREADS)*(i+1))).to_a
res << dat.map{ |a| a ** 2 }.reduce(:+)
end
t << r
end
t.each { |t| t.join }
p res.reduce(:+)
运行后,我们发现,虽然确实创建了 8 个系统线程,但是总运行时间变成了 8.21s。没有显著的性能提升。
使用 Ractor 重写程序,主要需要改变我们子线程内需要访问外面的 i
变量,我们用消息的方法传递进去,改进后的代码会变成这样:
THREADS = 8
LCM = 72072000
t = []
(0...THREADS).each do |i|
r = Ractor.new i do |j|
dat = (((LCM/THREADS)*j)...((LCM/THREADS)*(j+1))).to_a
dat.map{ |a| a ** 2 }.reduce(:+)
end
t << r
end
p t.map { |t| t.take }.reduce(:+)
其结果如何呢?我们根据不同的线程数量进行了测试。
Threads | AMD Ryzen 7 2700x | Intel i7-6820HQ |
---|---|---|
1 | 8.171 | 12.027 |
2 | 4.483 | 6.913 |
3 | 4.874 | 6.755 |
4 | 2.353 | 6.188 |
5 | 2.429 | 5.154 |
6 | 2.259 | 5.320 |
7 | 1.908 | 5.368 |
8 | 2.156 | 5.754 |
9 | 2.136 | |
10 | 3.159 | |
11 | 2.577 | |
12 | 2.679 | |
13 | 2.787 | |
14 | 2.615 | |
15 | 2.197 | |
16 | 2.303 |
Ractor 确实改善了多线程全局解释锁的问题。
我使用了 AMD uProf(对于 Intel CPU,可以使用 Intel VTune)进行 CPU 运算情况的统计。为了降低睿频对单线程性能的影响,我将 AMD Ryzen 7 2700x 全核心锁死 4.2GHz。
对于 AMD Ryzen 7 2700x,4 线程比单一线程快了 3 倍多。到 4 线程,比单一线程快了约 4 倍。AMD Ryzen 7 2700x 是一款 8 核心 16 线程的 CPU。同时,每 4 个核心组成一个 CCX,跨 CCX 的内存访问有额外的代价。这使得 4 线程内性能提升很显著,超过 4 线程后受限于 CCX 和 SMT,性能提升变得比较有限。其表现是随着线程数的增加,IPC(每时钟周期指令数)开始下降。在单线程运算时,每时钟周期 CPU 可以执行 2.42 个指令;但到了 16 线程运算时,每时钟周期 CPU 只能执行 1.40 个指令。同时,更多的线程意味着更复杂的操作系统的线程调度,使得多核的利用率越来越低。
同样,对于 Intel i7-6820HQ,我们得到了类似的结论。这是一款 4 核 8 线程的 CPU,由于第 5 个线程开始需要使用 HT,从而提升变得很有限。
Ractor 的引入除了可以改善计算密集场景下的运算效率,对于现有大型 Ruby Web 程序的内存占用也是有积极意义的。现有 Web 服务器,比如 puma,由于 I/O 多路复用性能极其低下,通常会使用多线程 + 多进程的形式来提升性能。由于 Web 服务器可以自由水平扩展,使用多进程的形式来管理,可以完全解开 GIL 锁的问题。
但是 fork 指令效率低下。微软在 2019 年 HOTOS 上给出了一篇论文:A fork() in the road,和 spawn 相比,fork 模式会导致启动速度变得非常慢。为了缓解这一问题,在 Ruby 2.7 引入 GC.compact
后,通常需要执行多次 compact
来降低 fork 启动的消耗。进一步地,使用 Ractor 来替代多进程管理,可以更容易地传递消息,复用可冻结的常量,从而降低内存占用。
Ruby 3 打开了多线程的潘多拉盒子。我们可以更好利用多线程来改善性能。但是看着 CPU Profiler 下不同线程调用会导致 CPU IPC 下降和缓存命中下降,对程序调优也提出了更高的要求。
我们边走边看吧。
]]>眼镜镜片减少反射的主要原理是利用镀膜。为什么镀膜可以减少反射呢?我们需要一些简单易懂的物理课。我们将光看成一束波(这里就先不讲量子力学里光的特性,太复杂了),当光打在镜片上时,大部分的光会透过去,少部分会发生反射。如果现在在镜片前还有一层镀膜,由于这是两个不同介质的物质,光线会先发生折射,再发生刚刚的透过和反射的过程。同时在镀膜的表面,还有可能再发生一次反射。于是我们得到了两束平行的反射光,反射 1 和反射 2,这两束光有一个相位差 φ,如果这个相位差恰好能使两束光线的波峰和波谷叠加(发生干涉),那么这部分能量就会被抵消,反射光减小;但是又由于能量守恒,所以透过镜片的光线就会增加。所以这一类镀膜也称为「增透膜」。
我们现在从原理上了解了增透膜如何减少反光,接下来的问题是,镀膜减少的反射光由什么决定?基本上就是膜层的折射率乘以厚度(即光学厚度)。因为这个光学厚度的两倍是光线 2 比光线 1 多走的路程,如果这个厚度恰好是半波长的一半(或者半波长的整倍数加 1/4 波长),那么这束光线就能被完全抵消。
那么问题来了,最常见的摄影灯光是白光,白光不是单一波长的光,显然不可能由一层增透膜抵消掉。事实上镜片会使用多层镀膜,多层镀膜的效果比较难计算,但要想抵消掉全部白光里各种波长的光也是困难重重。
那么为什么配了这个减少反光的眼镜后,反光从绿色变成了紫色呢?其实也很好理解,因为 400nm 波长的紫光,比 550nm 左右的绿光半波长更短,这要求镀膜材料的厚度和反射率都要更低。波长越短的光对于材料和镀膜的技术要求都更高,自然就更难处理。
事实上,像是相机镜头中,由于通常需要多组镜片,问题会叠加,提高透光率的要求更高。但随便拿哪个再贵的相机镜头出来,放在太阳地下还是能看到绿色、紫色或者红色的反光,没有能完全抵消掉反光的镜片。
不过平心而论,各个镜片厂商对于这个镀膜减反射特性的描述都是很详细的。比如说我随便找了一下蔡司的镀膜规格,说的是主要和传统镀膜比起来降低了 1% 左右的绿光反射率,由于人眼对绿光更敏感,所以比较有效果。但是看着这张图就知道在紫色光的 400nm 部分,反射率还是高达 5%,和传统镀膜并没有明显差异。
所以这个问题的最后,为什么这个几万元的眼镜明明只是降低了 1% 绿光反射,会让人觉得能降低大部分反射,最后花了冤枉钱呢?其实镜片厂没有骗人,详细的资料列得很清楚,但买镜片的客户并不是从镜片厂直接获取的资讯,经过了一层眼镜店老板。眼镜店老板又不用学习光学,他的职责就是怎么卖出更大的利润,结果就是被眼镜店老板骗了。
]]>In a day of 2016, we found that our users could not pass the CDN authentication with their iPhones. We then took several days to debug. The situation is that we need to upload three files at the same time. We use the token of the user to generate three random ids. The CDN server would use these ids to authenticate the upload of user files. In this case, we don’t have to transfer the files to CDN on our server.
But soon, iOS users found a weird problem. Users could only upload one of the three files. After debugging, we found that after uploading the first file, the next two ids become illegal. Furthermore, we found the three ids fetched by Safari are precisely the same?!
I soon designed a reproduction of this bug:
require 'sinatra'
get '/' do
<<-EOF
<html>
<script type="text/javascript">
function reqListener () {
console.log(this.responseText);
}
for (i = 0; i < 3; i++) {
var oReq = new XMLHttpRequest();
oReq.addEventListener("load", reqListener);
oReq.open("GET", "/count");
oReq.send();
}
</script>
</html>
EOF
end
count = 0
get '/count' do
count += 1
count.to_s
end
On Firefox, you would get 1 2 3, but on Safari, you would get three 1s.
For the same API request, if the parameters are identical. Safari may return same results of all these requests if they are sent asynchronously.
In general, we may think that GET requests of HTTP/1.1 are Idempotent. If we treat x as the status of the server, and f is the GET request, we would have:
\[f(f(x)) = f(x)\]The idempotence ensures that the side effects of multiple calls are identical to a single call. We could infer that all responses of the same GET request should also be exact.
But if we check rfc7231 carefully, the definition of the Idempotence is to ensure resend of a failure request safe instead of not allowing the backend to do any non-idempotent operations.
What if we change GET to POST?
require 'sinatra'
get '/' do
<<-EOF
<html>
<script type="text/javascript">
function reqListener () {
console.log(this.responseText);
}
for (i = 0; i < 3; i++) {
var oReq = new XMLHttpRequest();
oReq.addEventListener("load", reqListener);
oReq.open("POST", "/count");
oReq.send();
}
</script>
</html>
EOF
end
post '/count' do
count += 1
count.to_s
end
IT IS STILL THREE 1s! There’s not any HTTP spectification to define the Idempotence of POST action. This must cause serious problems due to the basic concepts of HTTP actions.
If we check the output from the backend, there is only one 1, which means the three POST requests are cached by Safari?!
If we assume the reliability of idempotence, we could hash the parameters to reduce the response time with cache and improve the performance of callbacks in the event engine of a browser. But apparently, this assumption is incorrect, and Safari does make such optimizations, which causes the bug.
If we check this problem carefully on the Internet, people started asking questions about Safari cache POST requests and Safari cache GET requests with cache disabled from 2012.
I submitted this bug from Apple’s feedback system in 2016. After four years, the feedback system has evolved to Feedback Assistant; Mac OS X has been renamed to macOS; El Capitan has been upgraded to Big Sur. But this bug is still in the latest Safari (16610.2.8.1.1). My ticket is still open, with NO RESPONSE.
Safari is fast, efficient, and power-saving. But if Safari can’t keep essential compatibility with W3C Web API standards, how dare we using this browser? But due to the monopoly of iOS and App Store, iOS developers are not allowed to use third-party Webview, including Chrome and Firefox. Before iOS, nobody cares about Safari. But now, we, the web developers, have to compromise with the incorrect implementation of Safari. Even the evil IE, didn’t use the monopoly of the operating system to force users to accept the specification of a browser.
Safari is not only the new IE, but it is also more evil than IE. Apple is the destroyer of the free Internet system.
F**k you, Apple.
]]>2016 年的一天,当我们发现 iPhone 上的浏览器不能正确通过我们的 CDN 鉴权后,我们花了数天的时间来 debug。简单来说当时的情况是,我们需要同时上传 3 个文件,我们会用用户 token 来换 3 个独立的随机数 id,这三个 id 会被 CDN 服务器认为合法,用户可以直接上传到 CDN 上而无需在我们自己服务器上中转。
但 iOS 用户很快就出现了一个奇怪的问题,用户 3 个文件只能成功上传 1 个,剩下 2 个无法正常上传。再进一步调试后我们发现,在上传任意一个文件后,剩下两个 id 变成了非法。再进一步地,我们发现 Safari 获得的 3 个 id 竟然是完全相同的?!
我很快设计出了能够构建出这个问题的重现:
require 'sinatra'
get '/' do
<<-EOF
<html>
<script type="text/javascript">
function reqListener () {
console.log(this.responseText);
}
for (i = 0; i < 3; i++) {
var oReq = new XMLHttpRequest();
oReq.addEventListener("load", reqListener);
oReq.open("GET", "/count");
oReq.send();
}
</script>
</html>
EOF
end
count = 0
get '/count' do
count += 1
count.to_s
end
在 Firefox 上,你会得到 1 2 3 的输出,而在 Safari 上你会得到 3 个 1。
针对同一个 API 接口,只要请求参数完全一致,并且在一个请求返回前,相同的请求已经被发出,那么这些请求都会得到完全相同的结果。
HTTP/1.1 规格上 API 的 GET 是幂等的,如果我们把 x 当作服务器的状态,f 是 GET 请求操作,那么我们有:
\[f(f(x)) = f(x)\]这确保了多次调用接口产生的副作用,和一次调用是一致的。我们似乎可以得到推论认为每次 GET 请求的返回都应该是一样的。
但如果我们仔细来看 rfc7231 对于 HTTP 幂等的定义,其只是为了确保请求重新发送的可靠性,而不是不允许后端进行任何非幂等的操作。
如果我们把 GET 换成 POST 结果如何呢?
require 'sinatra'
get '/' do
<<-EOF
<html>
<script type="text/javascript">
function reqListener () {
console.log(this.responseText);
}
for (i = 0; i < 3; i++) {
var oReq = new XMLHttpRequest();
oReq.addEventListener("load", reqListener);
oReq.open("POST", "/count");
oReq.send();
}
</script>
</html>
EOF
end
post '/count' do
count += 1
count.to_s
end
竟然得到的也是三个 1!任何关于 HTTP 的规格都没有这样的描述,这对于 HTTP 动词的基本概念相违背,显然会带来非常严重的问题。
我们检查一下后端的输出,只有一个 1。也就是这三个 POST 请求被 Safari 缓存了?!
一些接口必然无法满足幂等的要求,比如统计接口和随机数接口。
如果我们假设这个幂等的可靠性,那么我们自然可以把请求参数进行哈希,缓存降低响应时间,以及提高事件回调时事件引擎的处理速度。但是显然这个假设是错的。但显然 Safari 做了相关的优化,从而导致了问题。
如果仔细找一下会发现,2012 年左右开始几乎每年都有人在网上问 Safari cache POST 请求和 Safari cache GET requests with cache disabled 的问题。
我在 2016 年通过 Apple 当时非常丑的 Feedback 系统提交了这个 bug。然而从这个 Feedback 系统升级到了 Feedback Assistant,Mac OS X 改名成了 macOS,从 El Capitan 升级到了 Big Sur,这个 Bug 不但在最新的 Safari 14.0.1 (16610.2.8.1.1) 依然存在。这个 Ticket 也没有得到任何回复。
Safari 很快,Safari 效率很高,Safari 很省电。但如果连基本的 W3C Web API 的兼容性、可靠性都不能保证,我们怎么敢使用这个浏览器?但好在由于 App Store 的垄断性,iOS 设备被要求不允许使用第三方 Webview,包括 iOS 上的 Chrome 和 Firefox。在 iOS 之前不会有人理 Safari 的,但现在我们这些 Web 开发者被迫为 Safari 的无下限进行妥协。就连邪恶的微软 IE,也没有敢利用操作系统的垄断来强制浏览器的规格。
Safari 何止是新的 IE,它比 IE 邪恶多了。Apple 根本就是自由互联网的摧毁者,这和中国开发者痛恨的微信浏览器又有什么本质上的区别呢?
F**k you, Apple.
]]>之前需要在某个单片机下塞点阵字库,为了能多覆盖一些字,准备在字库上做一点压缩。由于常用字相邻编码通常是按形码编码的,所以形状上有很多相似性,因此应该是比较可以压缩的。读取的时候,把一整块相邻编码解压塞到内存里,在内存里做个 LRU 缓存。由于常用字的编码也比较靠近,所以可以一定程度上在覆盖生僻字的同时,达到比较好的读取性能。
不过单片机上写个压缩算法比较麻烦。单片机本身的性能就很差,压缩算法本身约简单越好。最好实现的压缩算法恐怕就是 LZW 了。由于 LZW 的字典是自解释的,也不需要单独构建霍夫曼树,one-pass 一遍读完就解决。于是就考虑写个 LZW。
LZW 压缩需要两件东西,一个是字符集一个是字典。最常见的字符集就是 0x00
到 0xff
的 256 个字符当作字符集,即我们把所以 8-bit 数据看成一个单独字符。字典是这个字符集的组合。字典构成一个更大的空间,通常教科书上的例子的是 14-bit 的字典空间。也就是 0x0000-0x1fff
,其中 0x0000-0x00ff
是基础字符集,0x0100-0x1fff
7935 个编号作为可编码的字典。这对于压缩率是比较好的,不过 14-bit 的读取实在太麻烦了,因为它不是 byte 的整倍数。于是我把可编码的字典空间放达到 2 个 bytes 也就是 0x0100-0xffff
65279 个字符,和字符集一起构成一个 65536 字符的空间。
头文件和函数签名非常简单:
#include <stdlib.h>
#include <assert.h>
#include <string.h>
#include <stdio.h>
#include <stdbool.h>
struct dictionary {
size_t count;
size_t entries_sizes[65536];
char* entries[65536];
};
struct dictionary* dictionary_init();
void dictionary_free(struct dictionary* dict);
size_t dictionary_insert(struct dictionary* dict, const char* word, size_t size); // The string would be copied, be sure to free the word.
size_t dictionary_find(struct dictionary* dict, const char* word, size_t size);
void compress(const char* source, size_t size, const char* filepath);
char* decompress(size_t* size, const char* filepath);
维护字典大小、插入字典、根据字符查找在字典的位置、执行压缩和解压缩。
其中字典的创建、释放和插入是基本的 C 语言常识:
struct dictionary* dictionary_init() {
struct dictionary* dict = (struct dictionary*) malloc(sizeof(struct dictionary));
dict->count = 256;
for (size_t i = 0; i < 256; i++) {
dict->entries[i] = (char*) malloc(sizeof(char));
dict->entries[i][0] = (char) i;
dict->entries_sizes[i] = 1;
}
return dict;
}
void dictionary_free(struct dictionary* dict) {
for (size_t i = 0; i < dict->count; i++) {
free(dict->entries[i]);
}
free(dict);
}
size_t dictionary_insert(struct dictionary* dict, const char* word, size_t size) {
char* w = (char*) malloc(sizeof(char) * size);
size_t idx = dict->count;
assert(idx < 65536);
memcpy(w, word, size);
dict->entries[idx] = w;
dict->entries_sizes[idx] = size;
dict->count++;
return idx;
}
查找的一个比较好的实现方法是使用哈希(unordered_map
)。不过由于我在使用 C 语言,没有炫酷的 C++ STL 标准库。考虑到字典最大就 65536 个,而且用 size 大小就能很好做初步的筛选,我还是整个遍历一遍算了。
size_t dictionary_find(struct dictionary* dict, const char* word, size_t size) {
size_t i;
for (i = 0; i < dict->count; i++) {
if (dict->entries_sizes[i] == size) {
bool found = true;
for (size_t j = 0; j < size; j++) {
if (dict->entries[i][j] != word[j]) {
found = false;
break;
}
}
if (found) { return i; }
}
}
return i;
}
LZW 之所以不需要单独维护字典是因为 LZW 对于如何建立字典这件事情是 隐含 在算法中的。对于当前压缩过程,如果前一个字典字和当前的字符的组合没有出现在字典中,那么就插入到字典中。为了理解这个概念,我们先简化一下模型。我们假设基本字符集只有 0
和 1
两个字符,然后我们编码这个序列:01001101
前一个字典字 | 当前字符 | 构成的组合 | 是否出现在字典中? | 输出 |
---|---|---|---|---|
- | 0 | 0 | 出现(基本字符 0 -> 0) | - |
0 | 1 | 01 | 没有(插入字典 2 -> 01) | 0 |
1 | 0 | 10 | 没有(插入字典 3 -> 10) | 1 |
0 | 0 | 00 | 没有(插入字典 4 -> 00) | 0 |
0 | 1 | 01 | 出现(2 -> 01) | - |
01 | 1 | 011 | 没有(插入字典 5 -> 011) | 2 |
1 | 1 | 11 | 没有(插入字典 6 -> 11) | 1 |
1 | 0 | 10 | 出现(3 -> 10) | - |
10 | 1 | 101 | 没有(插入字典 7 -> 101) | 3 |
1 | - | 1 | 1 |
于是我们把 01001101
编码成了 0102131
,这个序列的压缩率是非常糟糕的,因为 01 可以用一个 bit 表示,01001101
只有 1 byte,但压缩后的 0102131
至少需要 2 个 bit 来表示一个字符,虽然整体数量减少到了 7 个字符,但是实际上需要 14-bit(1.75 bytes)才能存储。但随着字符变长,字典覆盖会越来越好,压缩率也会越来越低。
我在实际实现的时候还标注了一个元信息,就是用第一个 size_t
来标注源文件的大小,以便于之后解压的时候申请内存。
void compress(const char* source, size_t size, const char* filepath) {
FILE* destination = fopen(filepath, "wb+");
struct dictionary* dict = dictionary_init();
// First size_t indicates the file size
fwrite(&size, sizeof(size_t), 1, destination);
char* p = NULL;
size_t p_size = 0;
char c;
for (size_t i = 0; i < size; i++) {
char* ppc = (char*) malloc(p_size + 1); // p + c
c = source[i];
memcpy(ppc, p, p_size);
ppc[p_size] = c;
size_t idx = dictionary_find(dict, ppc, p_size+1);
if (idx < dict->count) {
// Found
if (p != NULL) { free(p); }
p_size = p_size + 1;
p = malloc(sizeof(char) * p_size);
memcpy(p, ppc, p_size);
} else {
// Not found
assert(p != NULL);
unsigned short p_res = dictionary_find(dict, p, p_size);
fwrite(&p_res, sizeof(unsigned short), 1, destination);
free(p); p = NULL;
dictionary_insert(dict, ppc, p_size + 1);
p = (char*) malloc(sizeof(char));
p[0] = c;
p_size = 1;
if (i == size - 1) {
unsigned short c_res = dictionary_find(dict, p, 1);
fwrite(&c_res, sizeof(unsigned short), 1, destination);
}
}
free(ppc);
}
if (p != NULL) {
free(p);
}
dictionary_free(dict);
fclose(destination);
}
LZW 的压缩是比较简单的,但是解压却是有点 tricky 的。如果我们顺着压缩的思路来解压,我们会认为,顺着我们解压的过程,我们会慢慢构建出我们需要的字典。我们考虑下面一个序列 10101
的压缩过程:
前一个字典字 | 当前字符 | 构成的组合 | 是否出现在字典中? | 输出 |
---|---|---|---|---|
- | 1 | 1 | 出现(基本字符 1 -> 1) | - |
1 | 0 | 10 | 没有(插入字典 3 -> 10) | 1 |
0 | 1 | 01 | 没有(插入字典 4 -> 01) | 0 |
1 | 0 | 10 | 出现(3 -> 10) | - |
10 | 1 | 101 | 没有(插入字典 5 -> 101) | 3 |
1 | - | 1 | 1 |
输出的结果是 1031
。如果我们解压的话过程如下:
前一个字典字 | 当前字符 | 构成的组合 | 是否出现在字典中? | 输出 |
---|---|---|---|---|
- | 1 | 1 | 出现(基本字符 1 -> 1) | 1 |
1 | 0 | 10 | 没有(插入字典 3 -> 10) | 0 |
0 | 3 (10) | 010 | 没有(4 -> 010) | 10 |
显然我们把第四个字典的字符值插入错了,应该是 01
而我们却组合出了 010
。这会进一步导致之后的解压出错。而且我们很容易思考到一个问题,在压缩过程中我们组合的都是前一个字典字和一个 单一字符,这里我们让 0 和 10 相加,后者显然不是单一字符。进一步思考我们会发现,在压缩过程中如果一个字典被创建,那么这个步骤的前一个插入字典的尾部,必然是这个匹配到的前缀,也就是这个字典值的第一个字符。所以这里的 4
应该是前一个字典字 0
和 10
的第一个字符 1 的组合,也就是 01
。
于是我们以此正确构建我们的解压代码:
char* decompress(size_t* size, const char* filepath) {
FILE* source = fopen(filepath, "rb");
struct dictionary* dict = dictionary_init();
// First size_t indicates the file size
fread(size, sizeof(size_t), 1, source);
char* result = malloc(sizeof(char) * (*size));
size_t counter = 0;
if (size == 0) {
dictionary_free(dict);
return result;
}
unsigned short p_index;
unsigned short c_index;
char* c_word;
size_t c_word_size;
char* p_word = NULL;
size_t p_word_size = 0;
fread(&p_index, sizeof(unsigned short), 1, source);
p_word = dict->entries[p_index];
p_word_size = dict->entries_sizes[p_index];
memcpy(result+counter, p_word, p_word_size);
counter += p_word_size;
while (counter < *size) {
fread(&c_index, sizeof(unsigned short), 1, source);
if (c_index < dict->count) {
// Found
c_word = dict->entries[c_index];
c_word_size = dict->entries_sizes[c_index];
memcpy(result+counter, c_word, c_word_size);
counter += c_word_size;
char* ppc = malloc(sizeof(char) * (p_word_size + 1));
memcpy(ppc, p_word, p_word_size);
memcpy(ppc+p_word_size, c_word, 1);
dictionary_insert(dict, ppc, p_word_size + 1);
free(ppc);
} else {
char* ppc = malloc(sizeof(char) * (p_word_size + 1));
memcpy(ppc, p_word, p_word_size);
memcpy(ppc+p_word_size, p_word, 1);
c_index = dictionary_insert(dict, ppc, p_word_size + 1);
memcpy(result+counter, ppc, p_word_size + 1);
c_word = dict->entries[c_index];
c_word_size = p_word_size + 1;
counter += p_word_size + 1;
free(ppc);
}
p_word = c_word;
p_word_size = c_word_size;
}
dictionary_free(dict);
fclose(source);
return result;
}
我尝试用马丁路德金的《I have a dream》作为例子实验了一下这个实现:
int main() {
const char path[] = "/tmp/compressed.lzw";
char test[] = "I am happy to ...";
compress(test, sizeof(test), path);
size_t s = 0;
FILE* f = fopen(path, "rb");
fseek(f, 0, SEEK_END); // seek to end of file
size_t file_size = ftell(f);
fclose(f);
char* res = decompress(&s, path);
assert(sizeof(test) == s);
assert(strcmp(test, res) == 0);
printf(" Raw Text: %s\n", test);
printf("Decompressed Text: %s\n", res);
printf(" Compression Rate: %d%%\n", (unsigned int)(file_size * 1.0 / sizeof(test) * 100));
return 0;
}
最后的输出如下:
/Users/delton/CLionProjects/playgound/cmake-build-debug/playgound
Raw Text: I am happy to ...
Decompressed Text: I am happy to ...
Compression Rate: 73%
整体压缩率有 73%,对比了一下 zip 42% 的压缩率,实在是望尘莫及。
如果我们把文本重复 5 遍,压缩率能提升到 49%。不过这条件下,zip 能提高到 9% 的压缩率。果然还是不能和 DEFLATE 这种 LZW + 霍夫曼树这样的怪物比啊。不过简简单单 200 行代码就能实现一个性能上还不错的压缩、解压缩算法我已经比较满意了。
]]>在准备 RubyConf China 2020 的时候,我仔细检查了 Fiber 调度器 提出的补丁。当我看调度器的样例代码的时候,我发现其调用的是 Ruby 中的 IO.select
API。IO.select
API 在 Ruby 内部有多种实现,它可能调用 poll
、大尺寸 select
、POSIX 兼容的 select
取决于不同的操作系统。于是我想用一些更快的 syscall 来实现,比如 epoll
kqueue
和 IOCP
。
我做了一个相关的提案但是被拒绝了。主要问题是 Ruby 的 IO.select
API 是无状态的。如果没有含状态的注册,这些新 API 的性能甚至会不如 poll
。在 Koichi Sasada 跑了 banchmark 证明了这一点后,提案被正式拒绝。在和 Samuel Williams 在 Twitter 上讨论后,它建议我从 Scheduler
的实现上来进行注入,因为 Scheduler
本身是有状态的。于是我开始写一个 gem 作为 Ruby 3 调度器接口的概念证明。
本文中的 Ruby 版本是:
ruby 2.8.0dev (2020-08-18T10:10:09Z master 172d44e809) [x86_64-linux]
基本的 Scheduler 例子来自于 Ruby 的单元测试。这是 Ruby 3 调度器的测试,而不是真正用于生产的,因此是使用 IO.select
进行 I/O 多路复用。因此我们可以基于此,开发一个性能更好的 Ruby 调度器。
我们需要做一些 C 开发来支持其它 syscall,因此第一件事是兼容原始的实现。
IO.select
对于 select/poll API, 不需要预先创建文件描述符,也不需要在运行时注册文件描述符。所以唯一要做的就是处理调度器触发时的行为。
VALUE method_scheduler_wait(VALUE self) {
// return IO.select(@readable.keys, @writable.keys, [], next_timeout)
VALUE readable, writable, readable_keys, writable_keys, next_timeout;
ID id_select = rb_intern("select");
ID id_keys = rb_intern("keys");
ID id_next_timeout = rb_intern("next_timeout");
readable = rb_iv_get(self, "@readable");
writable = rb_iv_get(self, "@writable");
readable_keys = rb_funcall(readable, id_keys, 0);
writable_keys = rb_funcall(writable, id_keys, 0);
next_timeout = rb_funcall(self, id_next_timeout, 0);
return rb_funcall(rb_cIO, id_select, 4, readable_keys, writable_keys, rb_ary_new(), next_timeout);
}
我们花了 10 行 C 干了原来 1 行 Ruby 就干好了的事。主要是这允许我们用 C 的宏定义来控制,从而使用其它 I/O 多路复用方法,例如 epoll
and kqueue
。我们需要实现 4 个 C 方法:
Scheduler.backend
scheduler = Scheduler.new
scheduler.register(io, interest)
scheduler.deregister(io)
scheduler.wait
#include <ruby.h>
VALUE Evt = Qnil;
VALUE Scheduler = Qnil;
void Init_evt_ext();
VALUE method_scheduler_init(VALUE self);
VALUE method_scheduler_register(VALUE self, VALUE io, VALUE interest);
VALUE method_scheduler_deregister(VALUE self, VALUE io);
VALUE method_scheduler_wait(VALUE self);
VALUE method_scheduler_backend();
void Init_evt_ext()
{
Evt = rb_define_module("Evt");
Scheduler = rb_define_class_under(Evt, "Scheduler", rb_cObject);
rb_define_singleton_method(Scheduler, "backend", method_scheduler_backend, 0);
rb_define_method(Scheduler, "init_selector", method_scheduler_init, 0);
rb_define_method(Scheduler, "register", method_scheduler_register, 2);
rb_define_method(Scheduler, "deregister", method_scheduler_deregister, 1);
rb_define_method(Scheduler, "wait", method_scheduler_wait, 0);
}
Scheduler.backend
是专门给调试用的,剩下 4 个 API 会注入到调度器的 Scheduelr#run
, Scheduelr#wait_readable
, Scheduelr#wait_writable
, Scheduelr#wait_any
中。
epoll
和 kqueue
epoll 的三个核心 API 是 epoll_create
epoll_ctl
epoll_wait
。很好理解,我们只要在调度器初始化的时候初始化 epoll
fd,然后在注册 I/O 事件的时候调用 epoll_ctl
,最后用 epoll_wait
替换掉 IO.select
。
#if defined(__linux__) // TODO: Do more checks for using epoll
#include <sys/epoll.h>
#define EPOLL_MAX_EVENTS 64
VALUE method_scheduler_init(VALUE self) {
rb_iv_set(self, "@epfd", INT2NUM(epoll_create(1))); // Size of epoll is ignored after Linux 2.6.8.
return Qnil;
}
VALUE method_scheduler_register(VALUE self, VALUE io, VALUE interest) {
struct epoll_event event;
ID id_fileno = rb_intern("fileno");
int epfd = NUM2INT(rb_iv_get(self, "@epfd"));
int fd = NUM2INT(rb_funcall(io, id_fileno, 0));
int ruby_interest = NUM2INT(interest);
int readable = NUM2INT(rb_const_get(rb_cIO, rb_intern("WAIT_READABLE")));
int writable = NUM2INT(rb_const_get(rb_cIO, rb_intern("WAIT_WRITABLE")));
if (ruby_interest & readable) {
event.events |= EPOLLIN;
} else if (ruby_interest & writable) {
event.events |= EPOLLOUT;
}
event.data.ptr = (void*) io;
epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &event);
return Qnil;
}
VALUE method_scheduler_deregister(VALUE self, VALUE io) {
ID id_fileno = rb_intern("fileno");
int epfd = NUM2INT(rb_iv_get(self, "@epfd"));
int fd = NUM2INT(rb_funcall(io, id_fileno, 0));
epoll_ctl(epfd, EPOLL_CTL_DEL, fd, NULL); // Require Linux 2.6.9 for NULL event.
return Qnil;
}
VALUE method_scheduler_wait(VALUE self) {
int n, epfd, i, event_flag, timeout;
VALUE next_timeout, obj_io, readables, writables, result;
ID id_next_timeout = rb_intern("next_timeout");
ID id_push = rb_intern("push");
epfd = NUM2INT(rb_iv_get(self, "@epfd"));
next_timeout = rb_funcall(self, id_next_timeout, 0);
readables = rb_ary_new();
writables = rb_ary_new();
if (next_timeout == Qnil) {
timeout = -1;
} else {
timeout = NUM2INT(next_timeout);
}
struct epoll_event* events = (struct epoll_event*) xmalloc(sizeof(struct epoll_event) * EPOLL_MAX_EVENTS);
n = epoll_wait(epfd, events, EPOLL_MAX_EVENTS, timeout);
// TODO: Check if n >= 0
for (i = 0; i < n; i++) {
event_flag = events[i].events;
if (event_flag & EPOLLIN) {
obj_io = (VALUE) events[i].data.ptr;
rb_funcall(readables, id_push, 1, obj_io);
} else if (event_flag & EPOLLOUT) {
obj_io = (VALUE) events[i].data.ptr;
rb_funcall(writables, id_push, 1, obj_io);
}
}
result = rb_ary_new2(2);
rb_ary_store(result, 0, readables);
rb_ary_store(result, 1, writables);
xfree(events);
return result;
}
VALUE method_scheduler_backend() {
return rb_str_new_cstr("epoll");
}
#endif
kqueue
是类似的。唯一不同的是,BSD 的注册和等待用的是同一个 API,只是参数不同,所以有点难懂。
#if defined(__FreeBSD__) || defined(__NetBSD__) || defined(__APPLE__)
#include <sys/event.h>
#define KQUEUE_MAX_EVENTS 64
VALUE method_scheduler_init(VALUE self) {
rb_iv_set(self, "@kq", INT2NUM(kqueue()));
return Qnil;
}
VALUE method_scheduler_register(VALUE self, VALUE io, VALUE interest) {
struct kevent event;
u_short event_flags = 0;
ID id_fileno = rb_intern("fileno");
int kq = NUM2INT(rb_iv_get(self, "@kq"));
int fd = NUM2INT(rb_funcall(io, id_fileno, 0));
int ruby_interest = NUM2INT(interest);
int readable = NUM2INT(rb_const_get(rb_cIO, rb_intern("WAIT_READABLE")));
int writable = NUM2INT(rb_const_get(rb_cIO, rb_intern("WAIT_WRITABLE")));
if (ruby_interest & readable) {
event_flags |= EVFILT_READ;
} else if (ruby_interest & writable) {
event_flags |= EVFILT_WRITE;
}
EV_SET(&event, fd, event_flags, EV_ADD|EV_ENABLE, 0, 0, (void*) io);
kevent(kq, &event, 1, NULL, 0, NULL); // TODO: Check the return value
return Qnil;
}
VALUE method_scheduler_deregister(VALUE self, VALUE io) {
struct kevent event;
ID id_fileno = rb_intern("fileno");
int kq = NUM2INT(rb_iv_get(self, "@kq"));
int fd = NUM2INT(rb_funcall(io, id_fileno, 0));
EV_SET(&event, fd, 0, EV_DELETE, 0, 0, NULL);
kevent(kq, &event, 1, NULL, 0, NULL); // TODO: Check the return value
return Qnil;
}
VALUE method_scheduler_wait(VALUE self) {
int n, kq, i;
u_short event_flags = 0;
struct kevent* events; // Event Triggered
struct timespec timeout;
VALUE next_timeout, obj_io, readables, writables, result;
ID id_next_timeout = rb_intern("next_timeout");
ID id_push = rb_intern("push");
kq = NUM2INT(rb_iv_get(self, "@kq"));
next_timeout = rb_funcall(self, id_next_timeout, 0);
readables = rb_ary_new();
writables = rb_ary_new();
events = (struct kevent*) xmalloc(sizeof(struct kevent) * KQUEUE_MAX_EVENTS);
if (next_timeout == Qnil || NUM2INT(next_timeout) == -1) {
n = kevent(kq, NULL, 0, events, KQUEUE_MAX_EVENTS, NULL);
} else {
timeout.tv_sec = next_timeout / 1000;
timeout.tv_nsec = next_timeout % 1000 * 1000 * 1000;
n = kevent(kq, NULL, 0, events, KQUEUE_MAX_EVENTS, &timeout);
}
// TODO: Check if n >= 0
for (i = 0; i < n; i++) {
event_flags = events[i].filter;
if (event_flags & EVFILT_READ) {
obj_io = (VALUE) events[i].udata;
rb_funcall(readables, id_push, 1, obj_io);
} else if (event_flags & EVFILT_WRITE) {
obj_io = (VALUE) events[i].udata;
rb_funcall(writables, id_push, 1, obj_io);
}
}
result = rb_ary_new2(2);
rb_ary_store(result, 0, readables);
rb_ary_store(result, 1, writables);
xfree(events);
return result;
}
VALUE method_scheduler_backend() {
return rb_str_new_cstr("kqueue");
}
#endif
在实现好调度器后,我们要测试调度器的性能。因此我写了一个简单的 HTTP 服务器 benchmark。
require 'evt'
puts "Using Backend: #{Evt::Scheduler.backend}"
Thread.current.scheduler = Evt::Scheduler.new
@server = Socket.new Socket::AF_INET, Socket::SOCK_STREAM
@server.bind Addrinfo.tcp '127.0.0.1', 3002
@server.listen Socket::SOMAXCONN
@scheduler = Thread.current.scheduler
def handle_socket(socket)
line = socket.gets
until line == "\r\n" || line.nil?
line = socket.gets
end
socket.write("HTTP/1.1 200 OK\r\nContent-Length: 0\r\n\r\n")
socket.close
end
Fiber.new(blocking: false) do
while true
socket, addr = @server.accept
Fiber.new(blocking: false) do
handle_socket(socket)
end.resume
end
end.resume
@scheduler.run
比起原先阻塞的 I/O,使用 Ruby 3 非阻塞 I/O 后可以达到 3.33x 的性能,而使用 epoll
后可以达到 4.21x。服务器的例子很简单,所以当 JIT 启动时,不容易造成 ICache 不命中,因此性能进一步提升到了 4.54x。
测试是基于 Intel(R) Xeon(R) CPU E3-1220L V2 @ 2.30GHz CPU 的,而且程序是单线程的。如果有更好的 CPU,epoll
和 poll
的差距会更大。欢迎尝试,相关 gem 代码已开源。
未来工作主要是两部分。一个是提升现有 API 的稳定性,还有就是加入 io_uring
和 IOCP
的支持。io_uring
倒是还好,但我是一点都不懂 Windows 开发。所以欢迎大家来提供意见和贡献。