Restore missing IPv4 on IB interface and recover Weka
Last updated: September 12, 2025
Switch reboots or other network blips could result in the ipv4 address falling off of a IB interface. This guide is how to fix that without having to reboot the entire node.
Confirm the symptoms
If a customer notices before we do the usual symptom is df or ls type commands hanging, nodes failing to deploy in k8s (i.e. applied compute ray cluster), etc.
Run ip -br a to see the list of interfaces:
broomhead@c3b5bb36-02:~$ ip -br a
lo UNKNOWN 127.0.0.1/8 ::1/128
enp93s0f0 DOWN
enp90s0f0np0 UP 10.56.4.64/24 fe80::a288:c2ff:fe09:8044/64
enp93s0f1 DOWN
enp90s0f1np1 DOWN
ibp26s0 UP fe80::a288:c203:a:4404/64 <-------
ibp44s0 UP 172.16.64.2/16 fe80::a288:c203:a:47bc/64
ibp64s0 UP 172.16.64.3/16 fe80::a288:c203:d:a30/64
ibp101s0 UP 172.16.64.4/16 fe80::a288:c203:a:435c/64
ibp156s0 UP 172.16.64.5/16 fe80::a288:c203:d:938/64
ibp173s0 UP 172.16.64.6/16 fe80::a288:c203:a:443c/64
ibp192s0 UP 172.16.64.7/16 fe80::a288:c203:a:2ba4/64
ibp227s0 UP 172.16.64.8/16 fe80::a288:c203:d:918/64
docker0 DOWN 172.17.0.1/16 fe80::42:cff:fe71:fb0/64
tailscale0 UNKNOWN 100.64.39.12/32 fd7a:115c:a1e0::e301:270e/128 fe80::eec6:f5f8:7c3:e7a4/64If the above symptoms are seen and you see an ib interface missing an ipv4 address like this one
ibp26s0 UP fe80::a288:c203:a:4404/64 <-------If the weka agent as been rebooted at any while the ipv4 address is missing you'll likely see errors like this in the weka-agent logs (systemctl status weka-agent)
Sep 11 13:48:03 c3b5bb36-02.cloud.together.ai wekanode[3161586]: Failed to bind on 172.16.64.1:14000, errno=99
Sep 11 13:48:08 c3b5bb36-02.cloud.together.ai wekanode[3161586]: Failed to bind on 172.16.64.1:14000, errno=99
Sep 11 13:48:13 c3b5bb36-02.cloud.together.ai wekanode[3161586]: Failed to bind on 172.16.64.1:14000, errno=99To restore the ipv4 address to the interface run sudo netplan apply
sbroomhead@c3b5bb36-02:~$ sudo netplan apply
** (generate:3239692): WARNING **: 17:13:34.451: `gateway4` has been deprecated, use default routes instead.
See the 'Default routes' section of the documentation for more details.
WARNING:root:Cannot call Open vSwitch: ovsdb-server.service is not running.
** (process:3239689): WARNING **: 17:15:04.680: `gateway4` has been deprecated, use default routes instead.
See the 'Default routes' section of the documentation for more details.
** (process:3239689): WARNING **: 17:15:05.145: `gateway4` has been deprecated, use default routes instead.
See the 'Default routes' section of the documentation for more details.
** (process:3239689): WARNING **: 17:15:05.145: `gateway4` has been deprecated, use default routes instead.
See the 'Default routes' section of the documentation for more details.If df commands still hang and/or weka local ps says initializing after having run sudo netplan applyrun sudo weka local restart
sbroomhead@c3b5bb36-02:~$ sudo weka local restart
Restarting weka on this host
Container "client" is RUNNING (pid = 3240684)
client: Allocated network device "ibp26s0" (with identifier "0000:1a:00.0") to slots [1] on "cabb7117-54.cloud.together.ai":"client" (1/8)
client: Allocated network device "ibp44s0" (with identifier "0000:2c:00.0") to slots [2] on "cabb7117-54.cloud.together.ai":"client" (2/8)
client: Allocated network device "ibp64s0" (with identifier "0000:40:00.0") to slots [3] on "cabb7117-54.cloud.together.ai":"client" (3/8)
client: Allocated network device "ibp101s0" (with identifier "0000:65:00.0") to slots [4] on "cabb7117-54.cloud.together.ai":"client" (4/8)
client: Allocated network device "ibp156s0" (with identifier "0000:9c:00.0") to slots [5] on "cabb7117-54.cloud.together.ai":"client" (5/8)
client: Allocated network device "ibp173s0" (with identifier "0000:ad:00.0") to slots [6] on "cabb7117-54.cloud.together.ai":"client" (6/8)
client: Allocated network device "ibp192s0" (with identifier "0000:c0:00.0") to slots [7] on "cabb7117-54.cloud.together.ai":"client" (7/8)
client: Allocated network device "ibp227s0" (with identifier "0000:e3:00.0") to slots [8] on "cabb7117-54.cloud.together.ai":"client" (8/8)
client: Allocated core 1 to slot 1 on "cabb7117-54.cloud.together.ai":"client" (1/8)
client: Allocated core 2 to slot 3 on "cabb7117-54.cloud.together.ai":"client" (2/8)
client: Allocated core 3 to slot 5 on "cabb7117-54.cloud.together.ai":"client" (3/8)
client: Allocated core 4 to slot 7 on "cabb7117-54.cloud.together.ai":"client" (4/8)
client: Allocated core 32 to slot 2 on "cabb7117-54.cloud.together.ai":"client" (5/8)
client: Allocated core 35 to slot 8 on "cabb7117-54.cloud.together.ai":"client" (6/8)
client: Allocated core 34 to slot 6 on "cabb7117-54.cloud.together.ai":"client" (7/8)
client: Allocated core 33 to slot 4 on "cabb7117-54.cloud.together.ai":"client" (8/8)
client: Starting hugepages allocation for "cabb7117-54.cloud.together.ai":"client"
client: Allocated 11264MB hugepages memory from 2 NUMA nodes for "cabb7117-54.cloud.together.ai":"client"
client: Bandwidth of "cabb7117-54.cloud.together.ai":"client" set to unlimited
client: WekaFS driver attached by "NodeId<34980>" on "cabb7117-54.cloud.together.ai":"client"
Container "client" is ready (pid = 3240684)Confirm the ipv4 address is there and commands like df don't hang:
sbroomhead@c3b5bb36-02:~$ ip -br a
lo UNKNOWN 127.0.0.1/8 ::1/128
enp93s0f0 DOWN
enp90s0f0np0 UP 10.56.4.64/24 fe80::a288:c2ff:fe09:8044/64
enp93s0f1 DOWN
enp90s0f1np1 DOWN
ibp26s0 UP 172.16.64.1/16 fe80::a288:c203:a:4404/64
ibp44s0 UP 172.16.64.2/16 fe80::a288:c203:a:47bc/64
ibp64s0 UP 172.16.64.3/16 fe80::a288:c203:d:a30/64
ibp101s0 UP 172.16.64.4/16 fe80::a288:c203:a:435c/64
ibp156s0 UP 172.16.64.5/16 fe80::a288:c203:d:938/64
ibp173s0 UP 172.16.64.6/16 fe80::a288:c203:a:443c/64
ibp192s0 UP 172.16.64.7/16 fe80::a288:c203:a:2ba4/64
ibp227s0 UP 172.16.64.8/16 fe80::a288:c203:d:918/64
docker0 DOWN 172.17.0.1/16 fe80::42:cff:fe71:fb0/64
tailscale0 UNKNOWN 100.64.39.12/32 fd7a:115c:a1e0::e301:270e/128 fe80::eec6:f5f8:7c3:e7a4/64
sbroomhead@c3b5bb36-02:~$ df -h /data
Filesystem Size Used Avail Use% Mounted on
172.16.201.19/c3b5bb36_data 91T 80T 12T 88% /data