MySQL 產(chǎn)生大量小relay log的故障一例: 一、案例來(lái)源和現(xiàn)象 這個(gè)案例是朋友 @peaceful遇到的線上問(wèn)題,最終線索也是他自己找到的。現(xiàn)象如下: 1、出現(xiàn)了大量很小的relay log如下,堆積量大約2600個(gè): ... -rw-r----- 1 mysql dba 12827 Oct 11 12:28 mysql-relay-bin.036615 -rw-r----- 1 mysql dba 4908 Oct 11 12:28 mysql-relay-bin.036616 -rw-r----- 1 mysql dba 1188 Oct 11 12:28 mysql-relay-bin.036617 -rw-r----- 1 mysql dba 5823 Oct 11 12:29 mysql-relay-bin.036618 -rw-r----- 1 mysql dba 507 Oct 11 12:29 mysql-relay-bin.036619 -rw-r----- 1 mysql dba 1188 Oct 11 12:29 mysql-relay-bin.036620 -rw-r----- 1 mysql dba 3203 Oct 11 12:29 mysql-relay-bin.036621 -rw-r----- 1 mysql dba 37916 Oct 11 12:30 mysql-relay-bin.036622 -rw-r----- 1 mysql dba 507 Oct 11 12:30 mysql-relay-bin.036623 -rw-r----- 1 mysql dba 1188 Oct 11 12:31 mysql-relay-bin.036624 -rw-r----- 1 mysql dba 4909 Oct 11 12:31 mysql-relay-bin.036625 -rw-r----- 1 mysql dba 1188 Oct 11 12:31 mysql-relay-bin.036626 -rw-r----- 1 mysql dba 507 Oct 11 12:31 mysql-relay-bin.036627 -rw-r----- 1 mysql dba 507 Oct 11 12:32 mysql-relay-bin.036628 -rw-r----- 1 mysql dba 1188 Oct 11 12:32 mysql-relay-bin.036629 -rw-r----- 1 mysql dba 454 Oct 11 12:32 mysql-relay-bin.036630 -rw-r----- 1 mysql dba 6223 Oct 11 12:32 mysql-relay-bin.index 2、主庫(kù)錯(cuò)誤日志有如下錯(cuò)誤 2019-10-11T12:31:26.517309+08:00 61303425 [Note] While initializing dump thread for slave with UUID <eade0d03-ad91-11e7-8559-c81f66be1379>, found a zombie dump thread with the same UUID. Master is killing the zombie dump thread(61303421). 2019-10-11T12:31:26.517489+08:00 61303425 [Note] Start binlog_dump to master_thread_id(61303425) slave_server(19304313), pos(, 4) 2019-10-11T12:31:44.203747+08:00 61303449 [Note] While initializing dump thread for slave with UUID <eade0d03-ad91-11e7-8559-c81f66be1379>, found a zombie dump thread with the same UUID. Master is killing the zombie dump thread(61303425). 2019-10-11T12:31:44.203896+08:00 61303449 [Note] Start binlog_dump to master_thread_id(61303449) slave_server(19304313), pos(, 4) 二、slave_net_timeout參數(shù)分析 實(shí)際上第一眼看這個(gè)案例我也覺(jué)得很奇怪,因?yàn)楹苌儆腥藭?huì)去設(shè)置slave_net_timeout參數(shù),同樣我們也沒(méi)有設(shè)置過(guò),因此關(guān)注較少。但是 @peaceful自己找到了可能出現(xiàn)問(wèn)題的設(shè)置就是當(dāng)前從庫(kù)slave_net_timeout參數(shù)設(shè)置為10。我就順著這個(gè)線索往下分析,我們先來(lái)看看slave_net_timeout參數(shù)的功能。 當(dāng)前看來(lái)從庫(kù)的slave_net_timeout有如下兩個(gè)功能: 1、設(shè)置IO線程在空閑情況下(沒(méi)有Event接收的情況下)的連接超時(shí)時(shí)間。 這個(gè)參數(shù)5.7.7過(guò)后是60秒,以前是3600秒,修改后需要重啟主從才會(huì)生效。 2、如果change master沒(méi)有指定MASTER_HEARTBEAT_PERIOD的情況下會(huì)設(shè)置為slave_net_timeout/2 一般我們配置主從都沒(méi)有去指定這個(gè)心跳周期,因此就是slave_net_timeout/2,它控制的是如果在主庫(kù)沒(méi)有Event產(chǎn)生的情況下,多久發(fā)送一個(gè)心跳Event給從庫(kù)的IO線程,用于保持連接。但是一旦我們配置了主從(change master)這個(gè)值就定下來(lái)了,不會(huì)隨著slave_net_timeout參數(shù)的更改而更改,我們可以在slave_master_info表中找到相應(yīng)的設(shè)置如下: mysql> select Heartbeat from slave_master_info G *************************** 1. row *************************** Heartbeat: 30 1 row in set (0.01 sec) 如果我們要更改這個(gè)值只能重新 change master才行。 三、原因總結(jié) 如果滿(mǎn)足下面三個(gè)條件,將會(huì)出現(xiàn)案例中的故障: 主從中的MASTER_HEARTBEAT_PERIOD的值大于從庫(kù)slave_net_timeout 主庫(kù)當(dāng)前壓力很小持續(xù)slave_net_timeout設(shè)置時(shí)間沒(méi)有產(chǎn)生新的Event 之前主從有一定的延遲 那么這種情況下在主庫(kù)心跳Event發(fā)送給從庫(kù)的IO線程之前,IO線程已經(jīng)斷開(kāi)了。斷開(kāi)后IO線程會(huì)進(jìn)行重連,每次重連將會(huì)生成新的relay log,但是這些relay log由于延遲問(wèn)題不能清理就出現(xiàn)了案例中的情況。 下面是官方文檔中關(guān)于這部分說(shuō)明: If you are logging master connection information to tables, MASTER_HEARTBEAT_PERIOD can be seen as the value of the Heartbeat column of the mysql.slave_master_info table. Setting interval to 0 disables heartbeats altogether. The default value for interval is equal to the value of slave_net_timeout divided by 2. Setting @@global.slave_net_timeout to a value less than that of the current heartbeat interval results in a warning being issued. The effect of issuing RESET SLAVE on the heartbeat interval is to reset it to the default value. 四、案例模擬 有了理論基礎(chǔ)就很好了模擬了,但是延遲這一點(diǎn)不太好模擬,因此我模擬的時(shí)候關(guān)閉了從庫(kù)的SQL線程來(lái)模擬積壓的情況。 提前配置好主從,查看當(dāng)前的心跳周期和slave_net_timeout參數(shù)如下: mysql> show variables like '%slave_net_timeout%'; +-------------------+-------+ | Variable_name | Value | +-------------------+-------+ | slave_net_timeout | 60 | +-------------------+-------+ 1 row in set (0.01 sec) mysql> select Heartbeat from slave_master_info G *************************** 1. row *************************** Heartbeat: 30 1 row in set (0.00 sec) 1、停止從庫(kù)的SQL線程 stop slave sql_thread; 2、設(shè)置slave_net_timeout為10 mysql> set global slave_net_timeout=10; Query OK, 0 rows affected, 1 warning (0.00 sec) mysql> show warnings; +---------+------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Level | Code | Message | +---------+------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Warning | 1704 | The requested value for the heartbeat period exceeds the value of `slave_net_timeout' seconds. A sensible value for the period should be less than the timeout. | +---------+------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+ 1 row in set (0.00 sec) 可以看到這里實(shí)際上已經(jīng)有一個(gè)警告了。 3、重啟IO線程 這樣才會(huì)讓slave_net_timeout參數(shù)生效 mysql> stop slave ; Query OK, 0 rows affected (0.01 sec) mysql> start slave io_thread; Query OK, 0 rows affected (0.01 sec) 4、觀察現(xiàn)象 大概每10秒會(huì)生成一個(gè)relay log文件如下: -rw-r----- 1 mysql mysql 500 2019-09-27 23:48:32.655001361 +0800 relay.000142 -rw-r----- 1 mysql mysql 500 2019-09-27 23:48:42.943001355 +0800 relay.000143 -rw-r----- 1 mysql mysql 500 2019-09-27 23:48:53.293001363 +0800 relay.000144 -rw-r----- 1 mysql mysql 500 2019-09-27 23:49:03.502000598 +0800 relay.000145 -rw-r----- 1 mysql mysql 500 2019-09-27 23:49:13.799001357 +0800 relay.000146 -rw-r----- 1 mysql mysql 500 2019-09-27 23:49:24.055001354 +0800 relay.000147 -rw-r----- 1 mysql mysql 500 2019-09-27 23:49:34.280001827 +0800 relay.000148 -rw-r----- 1 mysql mysql 500 2019-09-27 23:49:44.496001365 +0800 relay.000149 -rw-r----- 1 mysql mysql 500 2019-09-27 23:49:54.789001353 +0800 relay.000150 -rw-r----- 1 mysql mysql 500 2019-09-27 23:50:05.485001371 +0800 relay.000151 -rw-r----- 1 mysql mysql 500 2019-09-27 23:50:15.910001430 +0800 relay.000152 大概每10秒主庫(kù)的日志會(huì)輸出如下日志: 2019-10-08T02:27:24.996827+08:00 217 [Note] While initializing dump thread for slave with UUID <010fde77-2075-11e9-ba07-5254009862c0>, found a zombie dump thread with the same UUID. Master is killing the zombie dump thread(216). 2019-10-08T02:27:24.998297+08:00 217 [Note] Start binlog_dump to master_thread_id(217) slave_server(953340), pos(, 4) 2019-10-08T02:27:35.265961+08:00 218 [Note] While initializing dump thread for slave with UUID <010fde77-2075-11e9-ba07-5254009862c0>, found a zombie dump thread with the same UUID. Master is killing the zombie dump thread(217). 2019-10-08T02:27:35.266653+08:00 218 [Note] Start binlog_dump to master_thread_id(218) slave_server(953340), pos(, 4) 2019-10-08T02:27:45.588074+08:00 219 [Note] While initializing dump thread for slave with UUID <010fde77-2075-11e9-ba07-5254009862c0>, found a zombie dump thread with the same UUID. Master is killing the zombie dump thread(218). 2019-10-08T02:27:45.589814+08:00 219 [Note] Start binlog_dump to master_thread_id(219) slave_server(953340), pos(, 4) 2019-10-08T02:27:55.848558+08:00 220 [Note] While initializing dump thread for slave with UUID <010fde77-2075-11e9-ba07-5254009862c0>, found a zombie dump thread with the same UUID. Master is killing the zombie dump thread(219). 2019-10-08T02:27:55.849442+08:00 220 [Note] Start binlog_dump to master_thread_id(220) slave_server(953340), pos(, 4) 這個(gè)日志就和案例中的一模一樣了。 解決問(wèn)題 知道原因后解決也就很簡(jiǎn)單了我們只需設(shè)置slave_net_timeout參數(shù)為MASTER_HEARTBEAT_PERIOD的2倍就可以了,設(shè)置后重啟主從即可。 五、實(shí)現(xiàn)方式 這里我們將通過(guò)簡(jiǎn)單的源碼調(diào)用分析來(lái)看看到底slave_net_timeout參數(shù)和MASTER_HEARTBEAT_PERIOD對(duì)主從的影響。 1、從庫(kù)使用參數(shù)slave_net_timeout 從庫(kù)IO線程啟動(dòng)時(shí)候會(huì)通過(guò)參數(shù)slave_net_timeout設(shè)置超時(shí): ->connect_to_master -> mysql_options case MYSQL_OPT_CONNECT_TIMEOUT: //MYSQL_OPT_CONNECT_TIMEOUT mysql->options.connect_timeout= *(uint*) arg; break; 而在建立和主庫(kù)的連接時(shí)候會(huì)使用這個(gè)值 connect_to_master ->mysql_real_connect -> get_vio_connect_timeout timeout_sec= mysql->options.connect_timeout; 因此我們也看到了slave_net_timeout參數(shù)只有在IO線程重啟的時(shí)候才會(huì)生效 2、從庫(kù)設(shè)置MASTER_HEARTBEAT_PERIOD值 在每次使用從庫(kù)change master時(shí)候會(huì)設(shè)置這個(gè)值如下,默認(rèn)為slave_net_timeout/2: ->change_master ->change_receive_options mi->heartbeat_period= min<float>(SLAVE_MAX_HEARTBEAT_PERIOD, (slave_net_timeout/2.0f)); 因此我們看到只有change master才會(huì)重新設(shè)置這個(gè)值,重啟主從是不會(huì)重新設(shè)置的。 3、使用MASTER_HEARTBEAT_PERIOD值 每次IO線程啟動(dòng)時(shí)候會(huì)將這個(gè)值傳遞給主庫(kù)的DUMP線程,方式應(yīng)該是通過(guò)構(gòu)建語(yǔ)句‘SET @master_heartbeat_period’來(lái)完成的。如下: ->handle_slave_io ->get_master_version_and_clock if (mi->heartbeat_period != 0.0) { char llbuf[22]; const char query_format[]= "SET @master_heartbeat_period= %s"; char query[sizeof(query_format) - 2 + sizeof(llbuf)]; 主庫(kù)啟動(dòng)DUMP線程的時(shí)候會(huì)通過(guò)搜索的方式找到這個(gè)值如下 ->Binlog_sender::init ->Binlog_sender::init_heartbeat_period user_var_entry *entry= (user_var_entry*) my_hash_search(&m_thd->user_vars, (uchar*) name.str, name.length); m_heartbeat_period= entry ? entry->val_int(&null_value) : 0; 4、DUMP線程使用MASTER_HEARTBEAT_PERIOD發(fā)送心跳Event 這里主要是通過(guò)一個(gè)超時(shí)等待來(lái)完成,如下: ->Binlog_sender::wait_new_events ->Binlog_sender::wait_with_heartbeat set_timespec_nsec(&ts, m_heartbeat_period); //心跳超時(shí) ret= mysql_bin_log.wait_for_update_bin_log(m_thd, &ts);//等待 if (ret != ETIMEDOUT && ret != ETIME) //如果是正常收到則收到信號(hào),說(shuō)明有新的Event到來(lái),否則如果是超時(shí)則發(fā)送心跳Event break; //正常返回0 是超時(shí)返回ETIMEDOUT 繼續(xù)循環(huán) if (send_heartbeat_event(log_pos)) //發(fā)送心跳Event return 1; 5、重連會(huì)殺掉可能的存在的DUMP線程 根據(jù)UUID進(jìn)行比對(duì)如下: ->kill_zombie_dump_threads Find_zombie_dump_thread find_zombie_dump_thread(slave_uuid); THD *tmp= Global_THD_manager::get_instance()-> find_thd(&find_zombie_dump_thread); if (tmp) { /* Here we do not call kill_one_thread() as it will be slow because it will iterate through the list again. We just to do kill the thread ourselves. */ if (log_warnings > 1) { if (slave_uuid.length()) { sql_print_information("While initializing dump thread for slave with " "UUID <%s>, found a zombie dump thread with the " "same UUID. Master is killing the zombie dump " "thread(%u).", slave_uuid.c_ptr(), tmp->thread_id()); }//這里就是本案例中的日志了 ..... 這里我們看到了案例中的日志。 |
免責(zé)聲明:本站部分文章和圖片均來(lái)自用戶(hù)投稿和網(wǎng)絡(luò)收集,旨在傳播知識(shí),文章和圖片版權(quán)歸原作者及原出處所有,僅供學(xué)習(xí)與參考,請(qǐng)勿用于商業(yè)用途,如果損害了您的權(quán)利,請(qǐng)聯(lián)系我們及時(shí)修正或刪除。謝謝!
始終以前瞻性的眼光聚焦站長(zhǎng)、創(chuàng)業(yè)、互聯(lián)網(wǎng)等領(lǐng)域,為您提供最新最全的互聯(lián)網(wǎng)資訊,幫助站長(zhǎng)轉(zhuǎn)型升級(jí),為互聯(lián)網(wǎng)創(chuàng)業(yè)者提供更加優(yōu)質(zhì)的創(chuàng)業(yè)信息和品牌營(yíng)銷(xiāo)服務(wù),與站長(zhǎng)一起進(jìn)步!讓互聯(lián)網(wǎng)創(chuàng)業(yè)者不再孤獨(dú)!
掃一掃,關(guān)注站長(zhǎng)網(wǎng)微信