Restore missing IPv4 on IB interface and recover Weka

Last updated: September 12, 2025

Switch reboots or other network blips could result in the ipv4 address falling off of a IB interface. This guide is how to fix that without having to reboot the entire node.


Confirm the symptoms

If a customer notices before we do the usual symptom is df or ls type commands hanging, nodes failing to deploy in k8s (i.e. applied compute ray cluster), etc.

Run ip -br a to see the list of interfaces:

broomhead@c3b5bb36-02:~$ ip -br a
lo               UNKNOWN        127.0.0.1/8 ::1/128
enp93s0f0        DOWN
enp90s0f0np0     UP             10.56.4.64/24 fe80::a288:c2ff:fe09:8044/64
enp93s0f1        DOWN
enp90s0f1np1     DOWN
ibp26s0          UP             fe80::a288:c203:a:4404/64   <-------
ibp44s0          UP             172.16.64.2/16 fe80::a288:c203:a:47bc/64
ibp64s0          UP             172.16.64.3/16 fe80::a288:c203:d:a30/64
ibp101s0         UP             172.16.64.4/16 fe80::a288:c203:a:435c/64
ibp156s0         UP             172.16.64.5/16 fe80::a288:c203:d:938/64
ibp173s0         UP             172.16.64.6/16 fe80::a288:c203:a:443c/64
ibp192s0         UP             172.16.64.7/16 fe80::a288:c203:a:2ba4/64
ibp227s0         UP             172.16.64.8/16 fe80::a288:c203:d:918/64
docker0          DOWN           172.17.0.1/16 fe80::42:cff:fe71:fb0/64
tailscale0       UNKNOWN        100.64.39.12/32 fd7a:115c:a1e0::e301:270e/128 fe80::eec6:f5f8:7c3:e7a4/64

If the above symptoms are seen and you see an ib interface missing an ipv4 address like this one

ibp26s0          UP             fe80::a288:c203:a:4404/64   <-------

If the weka agent as been rebooted at any while the ipv4 address is missing you'll likely see errors like this in the weka-agent logs (systemctl status weka-agent)

Sep 11 13:48:03 c3b5bb36-02.cloud.together.ai wekanode[3161586]: Failed to bind on 172.16.64.1:14000, errno=99
Sep 11 13:48:08 c3b5bb36-02.cloud.together.ai wekanode[3161586]: Failed to bind on 172.16.64.1:14000, errno=99
Sep 11 13:48:13 c3b5bb36-02.cloud.together.ai wekanode[3161586]: Failed to bind on 172.16.64.1:14000, errno=99

To restore the ipv4 address to the interface run sudo netplan apply

sbroomhead@c3b5bb36-02:~$ sudo netplan apply

** (generate:3239692): WARNING **: 17:13:34.451: `gateway4` has been deprecated, use default routes instead.
See the 'Default routes' section of the documentation for more details.
WARNING:root:Cannot call Open vSwitch: ovsdb-server.service is not running.

** (process:3239689): WARNING **: 17:15:04.680: `gateway4` has been deprecated, use default routes instead.
See the 'Default routes' section of the documentation for more details.

** (process:3239689): WARNING **: 17:15:05.145: `gateway4` has been deprecated, use default routes instead.
See the 'Default routes' section of the documentation for more details.

** (process:3239689): WARNING **: 17:15:05.145: `gateway4` has been deprecated, use default routes instead.
See the 'Default routes' section of the documentation for more details.

If df commands still hang and/or weka local ps says initializing after having run sudo netplan applyrun sudo weka local restart

sbroomhead@c3b5bb36-02:~$ sudo weka local restart
Restarting weka on this host
Container "client" is RUNNING (pid = 3240684)
client: Allocated network device "ibp26s0" (with identifier "0000:1a:00.0") to slots [1] on "cabb7117-54.cloud.together.ai":"client" (1/8)
client: Allocated network device "ibp44s0" (with identifier "0000:2c:00.0") to slots [2] on "cabb7117-54.cloud.together.ai":"client" (2/8)
client: Allocated network device "ibp64s0" (with identifier "0000:40:00.0") to slots [3] on "cabb7117-54.cloud.together.ai":"client" (3/8)
client: Allocated network device "ibp101s0" (with identifier "0000:65:00.0") to slots [4] on "cabb7117-54.cloud.together.ai":"client" (4/8)
client: Allocated network device "ibp156s0" (with identifier "0000:9c:00.0") to slots [5] on "cabb7117-54.cloud.together.ai":"client" (5/8)
client: Allocated network device "ibp173s0" (with identifier "0000:ad:00.0") to slots [6] on "cabb7117-54.cloud.together.ai":"client" (6/8)
client: Allocated network device "ibp192s0" (with identifier "0000:c0:00.0") to slots [7] on "cabb7117-54.cloud.together.ai":"client" (7/8)
client: Allocated network device "ibp227s0" (with identifier "0000:e3:00.0") to slots [8] on "cabb7117-54.cloud.together.ai":"client" (8/8)
client: Allocated core 1 to slot 1 on "cabb7117-54.cloud.together.ai":"client" (1/8)
client: Allocated core 2 to slot 3 on "cabb7117-54.cloud.together.ai":"client" (2/8)
client: Allocated core 3 to slot 5 on "cabb7117-54.cloud.together.ai":"client" (3/8)
client: Allocated core 4 to slot 7 on "cabb7117-54.cloud.together.ai":"client" (4/8)
client: Allocated core 32 to slot 2 on "cabb7117-54.cloud.together.ai":"client" (5/8)
client: Allocated core 35 to slot 8 on "cabb7117-54.cloud.together.ai":"client" (6/8)
client: Allocated core 34 to slot 6 on "cabb7117-54.cloud.together.ai":"client" (7/8)
client: Allocated core 33 to slot 4 on "cabb7117-54.cloud.together.ai":"client" (8/8)
client: Starting hugepages allocation for "cabb7117-54.cloud.together.ai":"client"
client: Allocated 11264MB hugepages memory from 2 NUMA nodes for "cabb7117-54.cloud.together.ai":"client"
client: Bandwidth of "cabb7117-54.cloud.together.ai":"client" set to unlimited
client: WekaFS driver attached by "NodeId<34980>" on "cabb7117-54.cloud.together.ai":"client"
Container "client" is ready (pid = 3240684)

Confirm the ipv4 address is there and commands like df don't hang:

sbroomhead@c3b5bb36-02:~$ ip -br a
lo               UNKNOWN        127.0.0.1/8 ::1/128
enp93s0f0        DOWN
enp90s0f0np0     UP             10.56.4.64/24 fe80::a288:c2ff:fe09:8044/64
enp93s0f1        DOWN
enp90s0f1np1     DOWN
ibp26s0          UP             172.16.64.1/16 fe80::a288:c203:a:4404/64
ibp44s0          UP             172.16.64.2/16 fe80::a288:c203:a:47bc/64
ibp64s0          UP             172.16.64.3/16 fe80::a288:c203:d:a30/64
ibp101s0         UP             172.16.64.4/16 fe80::a288:c203:a:435c/64
ibp156s0         UP             172.16.64.5/16 fe80::a288:c203:d:938/64
ibp173s0         UP             172.16.64.6/16 fe80::a288:c203:a:443c/64
ibp192s0         UP             172.16.64.7/16 fe80::a288:c203:a:2ba4/64
ibp227s0         UP             172.16.64.8/16 fe80::a288:c203:d:918/64
docker0          DOWN           172.17.0.1/16 fe80::42:cff:fe71:fb0/64
tailscale0       UNKNOWN        100.64.39.12/32 fd7a:115c:a1e0::e301:270e/128 fe80::eec6:f5f8:7c3:e7a4/64

sbroomhead@c3b5bb36-02:~$ df -h /data
Filesystem                   Size  Used Avail Use% Mounted on
172.16.201.19/c3b5bb36_data   91T   80T   12T  88% /data