Technology

  • 將 Windows IIS 用的 *.pfx 轉換成 Linux 使用的 SSL 憑證
openssl pkcs12 -in ssl.pfx -nodes -out ssl.pem
openssl rsa -in ssl.pem -out ssl.key
openssl x509 -in ssl.pem -out ssl.crt

MongoDB 3.6 & mongoose Issue

使用的 MongoDB server 版本:3.6.7

Crash Log:

pthread_create failed: Resource temporarily unavailable in sharding cluster

Terminating session due to error: InternalError: failed to create service entry worker thread

簡單來說就是 OS 的 connection 用完了,vm.max_map_count 預設上限是 65530,MongoDB Operations Checklist 有提到 production 環境參考設定,顯然不是設定值調校的問題。

繼續找 root cause,發現 log 有大量的連線沒有完全被關掉,一直誤以為是正常的。

[thread4] Starting new replica set monitor for rs/172.31.15.27:27017,172.31.5.133:27017,172.31.5.84:27017
[thread4] Successfully connected to 172.31.5.133:27017 (1 connections now open to 172.31.5.133:27017 with a 5 second timeout)
[ReplicaSetMonitor-TaskExecutor-0] Successfully connected to 172.31.5.84:27017 (1 connections now open to 172.31.5.84:27017 with a 5 second timeout)
[listener] connection accepted from 172.31.15.27:49040 #4 (4 connections now open)
[conn4] received client metadata from 172.31.15.27:49040 conn4: { driver: { name: "MongoDB Internal Client", version: "3.6.7" }, os: { type: "Linux", name: "Ubuntu", architecture: "x86_64", version: "16.04" } }
[thread4] Successfully connected to 172.31.15.27:27017 (1 connections now open to 172.31.15.27:27017 with a 5 second timeout)
[thread4] Successfully connected to 172.31.5.84:27017 (1 connections now open to 172.31.5.84:27017 with a 0 second timeout)
[thread4] scoped connection to 172.31.5.84:27017 not being returned to the pool
[thread4] Starting new replica set monitor for rs/172.31.15.27:27017,172.31.5.133:27017,172.31.5.84:27017
[thread4] Successfully connected to 172.31.5.84:27017 (2 connections now open to 172.31.5.84:27017 with a 0 second timeout)
[thread4] scoped connection to 172.31.5.84:27017 not being returned to the pool
[thread4] Starting new replica set monitor for rs/172.31.15.27:27017,172.31.5.133:27017,172.31.5.84:27017
[thread4] Successfully connected to 172.31.5.84:27017 (3 connections now open to 172.31.5.84:27017 with a 0 second timeout)
[thread4] scoped connection to 172.31.5.84:27017 not being returned to the pool

接下來就是無止盡的 scoped connection not being returned to the pool,測試環境累積了三個月也達到了 12.5 萬個 connection 沒有被關掉,但實際上只有 11 connections now open

最後查到了幾個 ticket,似乎是個 3.6 系列版本的 bug,直到 3.6.8 (2018-09-19) 才解掉。

Related issues:

但似乎還不能解釋為什麼會有 65k 個連線開著…

隔天早上遠端的神隊友丟來了一個連結,訴說著 mongoose 的故事:

看起來問題發生的時機是 reconnect 時,所以平常連線正常的使用情境下也遇不到,可能觸發的時機是 MongoDB 掛掉或是正在 failover,導致 mongoose 需要 reconnect,此時 connection 就會暴增。

解法也很簡單,把 mongoose 升級到 5.2.9 / 2018-08-17 以上的版本即可。

Other