Redundant load balancing for outgoing traffic on OpenBSD with pf and ifstated

Most are using ifstated to do failover between two firewalls, this isn’t about that. I have a network connected to two different ISPs, two external interfaces, two WAN connections. What I’m doing is some simple loadbalancing of the outgoing traffic between those interfaces, using ifstated to “disable” one of the interfaces in case that ISP goes down and enable it again when it’s back up. Also, some NAT and redirecting of the incoming traffic like mail and http towards the proper server on the inside.

I’ve set up ifstated to periodically ping the gateway and a far-away-but-always-online control host on each of the interfaces. If either the gateway or the control host don’t respond, it deletes the route for that interface, causing packets to no longer travel through it unless specifically instructed to. When both the gateway and control host start responding to pings, the route is put back up.

Load balancing is handled by the OS, through equal-cost multipath routing, explained in the OpenBSD FAQ. Short version, add two default routes:

route add -mpath default 10.100.100.193
route add -mpath default 10.200.200.65

delete /etc/mygate and enable the net.inet.ip.multipath sysctl. Check the FAQ for details.

First, the config files. I’ve put all of them in a directory that I created, /etc/netconf, /etc/pf.conf and /etc/ifstated.conf are symlinked to the ones in this directory. That has no other purpose than to keep everything in one place. I have one file, ifdef.conf, that contains macros common to both pf.conf and ifstated.conf. It’s included in pf.conf, but including it in ifstated.conf doesn’t work, so I’m using shell scripts that run the necessary commands and that file is sourced in every shell script. This way, if I need to change something, a NIC for example, all I need to edit is that one file.

Here they are:

ifdef.conf:

# ifdef.conf
#
# this file contains macro definitions common between
# pf.conf and ifstated.conf&co

ext1_if="re0"
ext1_nat="10.100.100.207"
ext1_gw="10.100.100.193"
ext2_if="re1"
ext2_nat="10.200.200.86"
ext2_gw="10.200.200.65"
int_if="bge0"
dmz_if="sk0"
# used by ifstated to check for connectivity beyond gateway
control_host="`dig +short google.com | grep "^[0-9]" | head -1`"
# this is where the check command will log to, see ext*_chk.sh
chk1_log="/tmp/ext1_chk.log"
chk2_log="/tmp/ext2_chk.log"

Note the control_host line. That variable is only used in shell scripts executed by ifstated, shell commands work there. It also assumes a working DNS server is available, otherwise it will fail and the interface will be set to inactive. If it ever gets used in pf.conf, like for example in a line that makes sure the path to control_host is always available, or if relying on a DNS server that might fail doesn’t sound like such a good idea, change it to a static IP of something that will likely always be up, like a root DNS server or www.isc.org. Even if the control host goes down only one interface will be deactivated, the other one will still be up, since it won’t be checked until the deactivated interface is marked as active again.

pf.conf:

# pf.conf
#
# See pf.conf(5) for syntax and examples.
# Remember to set net.inet.ip.forwarding=1 and/or net.inet6.ip6.forwarding=1
# in /etc/sysctl.conf if packets are to be forwarded between interfaces.

### MACROS

# INTERFACES
include "/etc/netconf/ifdef.conf"

# ports open to the outside on this host (the firewall)
fw_tcp_ports = "{ 22 }"
# the range is for traceroute
fw_udp_ports = "{ 33433:33626 }"
# what kind of ICMP is allowed in. everything is allowed out.
# ecoreq - RFC 792
# unreach - RFC 1122
# trace - might aswell
icmp_types = "{ echoreq, unreach, trace }"

# SIGNIFICANT HOSTS
# webserver
www_in=172.16.1.2
www_out=10.100.100.200
www_tcp_ports="{ http, https }"
#www_udp_ports=" { } "
# mail
email_in=172.16.1.3
email_out=10.100.100.201
email_tcp_ports=" { domain, smtp, smtps, pop3, pop3s, imap, imaps, https } "
email_udp_ports=" { domain } "

### TABLES
table <rfc1918> const { 192.168.0.0/16, 172.16.0.0/12, 10.0.0.0/8 }
table <NoRouteIPs> const { 127.0.0.0/8, 192.168.0.0/16, 172.16.0.0/12, \
                          10.0.0.0/8, 169.254.0.0/16, 192.0.2.0/24, \
                          0.0.0.0/8, 240.0.0.0/4 }
# hosts that are allowed unrestricted access to the internet
table <nated> { \
    192.168.1.10    \ # inside_server1
    192.168.1.20    \ # inside_server2
    192.168.1.30    \ # IP_phone
    192.168.1.35    \ # some_guy
    192.168.10.0/24 \ # these guys all get NAT
    172.16.1.3      \ # mail
    172.16.1.2      \ # www
    } 

# IPs in this table have attempted bruteforce and will be blocked
# from accessing anything
table <brutes> persist

### OPTIONS
set skip on lo
# should it drop by default, or be polite about it?
set block-policy return

### NAT
#
match out log on $ext1_if from <nated> nat-to $ext1_nat
match out log on $ext2_if from <nated> nat-to $ext2_nat
#
# redirects
# www
match in log on $ext1_if proto tcp to $www_out port $www_tcp_ports rdr-to $www_in
# email
match in log on $ext1_if proto tcp to $email_out port $email_tcp_ports rdr-to $email_in
match in log on $ext1_if proto udp to $email_out port $email_udp_ports rdr-to $email_in

### FILTER RULES
#
# block all IPV6, we don't have that.
# rules will need to be written for it, if ever
block log quick inet6
# these should never be routed to the internet, whatever the following rules
block drop in log quick on egress from <NoRouteIPs> to any
block drop out log quick on egress from any to <NoRouteIPs>
# bruteforcers. entries expire in 1 hour, see root's crontab
# alternative: expiretables
block drop log quick from <brutes>

# block incoming by default
block in log on egress from any to any

# we allow everything out on egress, but
# the label on these rules will be used by ifstated 
# to clear states on a certain interface when it goes down
pass out log on $ext1_if label "out_$if"
pass out log on $ext2_if label "out_$if"

# allow traffic to this host
pass in log on egress proto tcp from any to (egress) port $fw_tcp_ports
pass in log on egress proto udp from any to (egress) port $fw_udp_ports
# allow certain ICMP
pass in log inet proto icmp all icmp-type $icmp_types

# redirects. the second rule is needed if traffic on inside interfaces is blocked by default
# web
pass in on $ext1_if proto tcp to $www_in port $www_tcp_ports reply-to ($ext1_if $ext1_gw)
##pass in on $dmz_if proto tcp to $www_in port $www_tcp_ports
# email
pass in on $ext1_if proto tcp to $email_in port $email_tcp_ports reply-to ($ext1_if $ext1_gw)
##pass in on $dmz_if proto tcp to $email_in port $email_tcp_ports
pass in on $ext1_if proto udp to $email_in port $email_udp_ports reply-to ($ext1_if $ext1_gw)
##pass in on $dmz_if proto udp to $email_in port $email_udp_ports

# SSH bruteforce protection
# maximum 15 concurrent connections
# maximum 10 connections in 5 seconds from the same IP
pass on egress inet proto tcp from any to (egress) port 22 \
        flags S/SA keep state \
    (max-src-conn 15, max-src-conn-rate 10/5, \
         overload <brutes> flush global)

# By default, do not permit remote connections to X11
block in log on ! lo0 proto tcp to port 6000:6010

Although one of the interfaces is named $dmz_if, there is no DMZ enforcing rule. One would need to block all traffic from DMZ to internal network and then allow only what is really, really, necessary.

As an example, the mail server is also a DNS server.

Traceroute uses udp packets destined for that port range (except on windows), I opened it up just because I don’t like it when a trace ends in oblivion even though the ping works and I’m not aware of any good reason to block it. Icmp type unreach can be further restricted to a smaller set, see ‘man icmp’.

Egress is the interface or group of interfaces that have a default route assigned to them. Basically, the external interface, in our case both $ext1_if and $ext2\_if.

There are two lines that do the NAT-ing, one for each interface. Used to have a table called &ltnatTo&gt and this line instead:

match out log on egress from <nated> to any nat-to <natTo>

My idea was that when an interface goes down I would delete that route and remove that IP from <natTo> so that packets don’t get translated to it. That was a mistake. First, the NAT is done post-routing and since I used the OS not pf to do the loadbalancing it would change the IP of packets destined for $ext1_if into $ext2_nat, causing connections to send a query on one interface and get the answer on the other one, so I needed to add rules to route those properly. Second, it didn’t work with ifstated, packets would end up getting a “no destination to host” error when an interface went down. This way, with two different rules, since routing is never done to the inactive interface, they will never get that IP and all is good.

The table is used for people that try to bruteforce ssh. It can obviously be used for other connections, but with different parameters, setting a limit of ten connections in five seconds for some protocols might not be such a good idea. To expire the entries in this table I use this line in root’s crontab:

*/10    *       *       *       *       pfctl -t brutes -T expire 3600 >/dev/null 2>&1

Every ten minutes it will delete entries older than one hour. Adjust accordingly.

Note these lines:

pass out log on $ext1_if label "out_$if"
pass out log on $ext2_if label "out_$if"

These two rules aren’t necessary since we allow all traffic to go out unrestricted, but these two rules mark each connection with a label so that ifstated can clear states for them when the interface is disabled. The label for $ext1_if would be out_re0.

One important statement is reply-to. At first I used something like this to manage redirects to inside hosts:

pass in on $ext1_if proto tcp to $www_out port $www_tcp_ports rdr-to $www_in

The result was that connections would start on one interface, $ext1_if and end up going out on the other, with the IP of $ext1_if. When this happened performance on such connections was so bad that OpenVPN was pretty much unusable.

The pf.conf man page has this to say about relpy-to:

reply-to - The reply-to option is similar to route-to, but routes packets that pass in the opposite direction (replies) to the specified interface. Opposite direction is only defined in the context of a state entry, and reply-to is useful only in rules that create state. It can be used on systems with multiple external connections to route all outgoing packets of a connection through the interface the incoming connection arrived through (symmetric routing enforcement).

which is exactly what I needed. So the above line became:

match in on $ext1_if proto tcp to $www_out port $www_tcp_ports rdr-to $www_in

[... snip ...]

pass in on $ext1_if proto tcp to $www_in port $www_tcp_ports reply-to ($ext1_if $ext1_gw)

Notice that in this case the “match” rule changes packets before they get to the “pass” rule, so “pass” contains the internal IP of the server, $www_in, which is 172.16.1.2. That can also be written as

pass in on $ext1_if proto tcp to $www_out port $www_tcp_ports rdr-to $www_in reply-to ($ext1_if $ext1_gw)

ifstated.conf:

# ifstated.conf

### Global Configuration
init-state initiate

### Macros
ext1_up = "re0.link.up"
ext2_up = "re1.link.up"

ext1_chk = "'/etc/netconf/ext1_chk.sh' every 60"
ext2_chk = "'/etc/netconf/ext2_chk.sh' every 60"

### State Definitions

state initiate {
    init {
        if $ext1_up {
            if $ext1_chk {
                run "/etc/netconf/ext1_ok.sh"
            }
            if !$ext1_chk {
                run "/etc/netconf/ext1_nok.sh"
            }
        }
        if !$ext1_up {
            run "/etc/netconf/ext1_nok.sh"
        }

        if $ext2_up {
            if $ext2_chk {
                run "/etc/netconf/ext2_ok.sh"
            }
            if !$ext2_chk {
                run "/etc/netconf/ext2_nok.sh"
            }
        }
        if !$ext2_up {
            run "/etc/netconf/ext2_nok.sh"
        }
    }
    set-state main_loop
}

state main_loop {
    if $ext1_up {
        if !$ext1_chk {
            set-state ext1_nok
        }
    }
    if !$ext1_up {
        set-state ext1_nok
    }

    if $ext2_up {
        if !$ext2_chk {
            set-state ext2_nok
        }
    }
    if !$ext2_up {
        set-state ext2_nok
    }

}

state ext1_nok {
    init {
        run "/etc/netconf/ext1_nok.sh"
    }
    if $ext1_chk {
        run "/etc/netconf/ext1_ok.sh"
        set-state main_loop
    }
}
        
state ext2_nok {
    init {
        run "/etc/netconf/ext2_nok.sh"
    }
    if $ext2_chk {
        run "/etc/netconf/ext2_ok.sh"
        set-state main_loop
    }
}

ext1_chk.sh and ext2_chk.sh are the scripts that test if the connection is up. They are run every 60 seconds. ext1_nok.sh or ext2_nok.sh are run when the ISP on the respective interface fails in some way, then ifstated keeps checking every minute is that ISP is back, ignoring the other interface meanwhile. ext1_ok or ext2_ok are run when the ISP works again, then ifstated goes back to main_loop, running both test scripts. The firewall will never be left with no default route since only one interface is brought down at a time. If both go down it will delete one of the routes, while new connections to the outside will try to go through the other one and simply fail as expected, so it shouldn’t brake anything in strange ways.

Use ifstated -dvv to see it in action.

ext*_chk.sh:

#!/bin/sh
# ext1_chk.sh

. /etc/netconf/ifdef.conf

# ping params:
# c - how many pings
# i - seconds between pings
# w - seconds to wait for answer
# I - make sure we use the proper interface

# test and log to $chk1_log, will be read by up/down scripts
rm $chk1_log > /dev/null ;\
    umask 077 ;\
    touch $chk1_log ;\
    ping -c 3 -i 3 -w 3 -I $ext1_nat $ext1_gw  >> $chk1_log \
 && ping -c 3 -i 3 -w 3 -I $ext1_nat $control_host >> $chk1_log

It pings the gateway and the control host three times. The -I parameter makes sure the proper interface is used. It sends 3 pings and waits 3 seconds for a reply from control_host, which means it will take 9 seconds for the test to end in case the host can’t be reached and ping times out. If the check is still running by the time ifstated want to start it again, it will be killed and it will never fail the test, so do not use time lower than 9 seconds in ifstated.conf. More than that, if it also needs to resolve the IP. Basically, make sure checks never overlap.

On FreeBSD, the ‘-w’ option as such doesn’t exist, instead it’s ‘-W’, adjust if needed.

ext*_nok.sh:

#!/bin/sh
# ext1_nok.sh

. /etc/netconf/ifdef.conf

route flush -iface $ext1_if
pfctl -k label -k "out_$ext1_if"

(echo "interface state test result:" ; \
 cat $chk1_log ; \
 echo ; \
 echo "IPV4 route table:" ; \
 netstat -rn -f inet ) \
| mail -s "$ext1_if is OFFLINE" root

Runs when the interface loses connectivity. Flushes all routes on the interface and all pf states related to it. This is where the label in pf.conf is used. Then it sends a mail to root warning that the interface has been disabled. That part could be more informative.

ext*_ok.sh:

#!/bin/sh
# ext1_ok.sh

. /etc/netconf/ifdef.conf

sh /etc/netstart $ext1_if

(echo "interface state test result:" ; \
 cat $chk1_log ; \
 echo ; \
 echo "IPV4 route table:" ; \
 netstat -rn -f inet ) \
| mail -s "$ext1_if is online" root

Restarts the interface and sends email telling the admin that all is good with the world again. Since ext*_nok.sh is flushing all routes on that interface, /etc/netstart will set up any additional routes in /etc/hostname.$ext1_if.

To enable ifstated on boot:

echo 'ifstated_flags="" ' >> /etc/rc.conf.local