blog.vyos.net

VyOS 1.2.0-rc2 is available for download, with fixes to wireguard and PBR

The second release candiate is available for download from https://downloads.vyos.io/?dir=testing/1.2.0-rc2

We are happy to see so many people test the release candidates! Some bugs were already found and fixed, and we are working on some more bugs found since the release of 1.2.0-rc1. To make already completed fixed available, we are making the second release candidate.

Resolved issues

Wireguard module not loading (T881).
PBR routes leaking into other tables (T882).
Unhandled exception in the wireguard op mode (T883).

Known issues

Fail to add an OpenVPN to a bridge group if cost is not specified (T884, let us know if you also see it).
commit-confirm doesn't cancel reboot properly (T870).
The GPG key for release builds is not included in the image

Stay tuned for the rc3!

First VyOS 1.2.0 release candidate is available for download

This month, the VyOS project turns five years old. In these five years, VyOS has been through highs and lows, up to speculation that the project is dead. Past year has been full of good focused work by the core team and community contributors, but the only way to make use of that work was to use nightly builds, and nightly builds are like a ~~chocolate box~~ a box of WWI era shells—you never know if it blows up when handled or not. Now the codebase has stabilized, and we are ready to present a release candidate. While it has some rough edges, a number of people, including us, are already using recent builds of VyOS 1.2.0 in production, and now it's time to make it public.

VyOS 1.2.0-rc1 is available for download from https://downloads.vyos.io/?dir=testing/1.2.0-rc1

VyOS 1.2.0 (Crux) is the feature expansion release based on Debian Jessie. The release candidate will be the basis for the future long term support release. You can read the full release notes at https://wiki.vyos.net/wiki/1.2.0/release_notes

New features include:

Wireguard support
PPPoE server
mDNS repeater and broadcast relay
Support for IPv6 VRRP and unicast VRRP operation
NPTv6
Standards-compliant QinQ ethertype option
Python APIs for accessing the running config and writing migration scripts (replacements of the Perl Vyatta::Config and XorpConfigParser)
New XML-based command definitions
New build system that makes it easy to create custom builds with additional repositories and packages
SR-IOV support for Intel and Mellanox cards

The following features have been removed:

Telnet server
p2p filtering

While the base system if Debian Jessie, multiple packages have been updated to much newer versions, for example, the 4.14.65 kernel, StrongSWAN 5.6, and keepalived 2.0.5.

Additionally, our old Quagga has been replaced with FRR, which opens a way to adding support for many more protocols, including multicast routing.

Known issues

Some people reported issues with DMVPN in hub mode (T848).

Some people report an issue with routers responding to all ARP requests when VTI is enabled (T852).

If you use DMVPN or VTI, you may either help with testing and debugging those issues, or wait until the issues are confirmed to be resolved.

What's next

VyOS 1.2.0 will become the LTS release after one or more release candidates.

We are preparing a release model change that will involve splitting VyOS into an LTS branch a (roughly) monthly rolling release made from the latest code from the current branch. Both branches will be entirely open source, but while the rolling release builds will be available free of charge to everyone, the LTS ISO image builds will be only available to those who either contribute to VyOS (code, documentation, and community activities all count) or purchase a subscription. There will always be an option to build the LTS image entirely from source or using package repositories at dev.packages.vyos.net, though commercial support will only be provided for official builds, or by special arrangement.

We are also working on new commercial support plans and pricing models.

The current branch will now be used for developing 1.3.0. Top priorities for 1.3.0 include migration to the next Debian release and rewriting more legacy code to enable better testing and easier addition of new features.

In a sense, VyOS 1.2.0 was a test whether the project can exist independently or not. While 1.1.x was an incremental expansion of the last Vyatta Core release, development of 1.2.0 coincided with mainstream Linux distributions switching to systemd, many packages such as StrongSWAN making big incompatible changes, and parts of VyOS itself reaching the point when bugs could no longer be fixed without a complete rewrite. The build system also had to be rewritten from scratch.

A lot of work went into developing the new infrastructure for Python rewrites, including the new system of command definitions and required libraries. By now a a few components including SSH, SNMP, cron, and DNS forwarding have been rewritten in the new way, and the rewrite movement is gaining momentum.

Let's test and polish the 1.2.0 release, and keep working on making VyOS a better, more easily maintainable platform in the future 1.3.0 release.

VyOS development news in August and September

Most importantly: all but one blockers for the 1.2.0 release candidate are now resolved. Quite obviously, for the release candidate, we want all features that worked in 1.1.8 to work fully.

New release naming scheme

While we are at it, I'd like to announce a small cosmetic change. Until now, our release branches were named after chemical elements. This naming scheme is getting a bit too common though (OpenDayLight is a well known example, but there are more), we decided to change it to something else to avoid confusion and be a bit more original.

The new branch theme is constellations sorted by area (in square degrees), from the smallest to the largest. The 1.2.0 release will be named Crux. Crux, also known as the Southern Cross, is a small but bright and iconic constellation that is depicted on flags of many countries of the southern hemisphere, such as Australia and New Zealand.

The 1.3.0 release will be named Equuleus, which is the latin for little horse (no relation to My Little Pony).

Migration to FRR from Quagga

We have resolved most of the migration problems and latest nightly builds already use FRR instead of our aged Quagga.

It will open a path to implementing many new protocols and features, such as BFD, PIM-SM, and more. What kept us from migrating was lack of support for multiple routing tables, which we need for PBR. FRR added it recently, and by now the last known issue that blocked migration (routes from the default table unintentionally leaking into non-default tables) has been resolved, so we finally can migrate without losing any features.

While I do feel somewhat uneasy about licensing of certain daemons, that are included in the source tree but use a permissive open source license even though they are linked against GPL libraries, we do not believe there's a GPL violation in it as long as the license of the binary package is GPL. Not sharing a modified source code of those daemons with users of the binary package would be a GPL violation, but we keep all source code of every VyOS component public.

New BGP address-family syntax

This is still in the works, but it will make it to the nightly builds soon.

Originally, VyOS used to have IPv6-specific BGP options under "address-family ipv6-unicast", but IPv4 options were directly under neighbor. The historical reason is that originally IPv6 BGP was not supported at all. This syntax was rather inconsistent, and made it hard to quickly see which options are address family specific. We used to stick with that inconsistent syntax just because it was always done that way.

One behaviour change in FRR made us reconsider that. As you may know, in BGP, routing information exchange is completely orthogonal to the session transport: IPv4 routes can be exchanged over a TCP connection established between IPv4 addresses and vice versa. The default behaviour of most, if not all, BGP implementation is to enable both address families regardless of the session transport.

That behaviour can be changed by an option, in VyOS, that's "set protocols bgp ... parameters default no-ipv4-unicast". The old behaviour of Quagga was to apply that only to sessions whose transport is IPv6, which is just as inconsistent. FRR takes that option literally and disabled IPv4 route advertisments for all peers if it's active, unless peers are explicitly activated for the IPv4 address family.

Making VyOS play well with that development requires an option to do that, and "address-family ipv4-unicast" is an obvious candidate, but introducing a special case doesn't feel write. I think moving original options to that subtree is a cleaner solution. Yes, it does require reprogramming your fingers, but when we start adding support for more address families, the original syntax will only start looking even more like an atavism.

This is what the new syntax will look like:

dmbaturin@vyos# show protocols bgp 
 bgp 64444 {
     address-family {
         ipv4-unicast {
             network 192.168.2.0/24 {
             }
         }
     }
     neighbor 10.91.19.1 {
         address-family {
             ipv4-unicast {
                 allowas-in {
                     number 3
                 }
                 as-override
                 default-originate {
                     route-map Foo
                 }
                 maximum-prefix 50
                 route-map {
                     export Bar
                     import Baz
                 }
                 weight 10
             }
         }
         ebgp-multihop 255
         remote-as 64793
     }
}

Node renaming in migration scripts

Renaming nodes is a very common task in config syntax migration, but until now it could only be done very indirectly. The old XorpConfigParser simply could not separate names from values and renaming nodes was usually done by regex replace. In the new vyos.configtree you'd need to delete the old node and recreate it from scratch.

Until now. Lately we introduced a function that does it one step. If you, for whatever reason, wanted to rename "service ssh" subtree to "service secure-shell", you could do it like this:

with open("/config/config.boot") as f:
    config_text = f.read()
config = vyos.configtree.ConfigTree(config.text)

config.rename(["service", "ssh"], "secure-shell")

print(config.to_string())

One of the reason for introducing it is to make it easier to clean up the DHCP server syntax.

DHCP server rewrite

While we are waiting for the FRR fixes, we (Christian Poessinger and I mainly) decided to eliminate one more bit of the legacy code and give DHCP server scripts a rewrite. We also decided to clean up its syntax.

One of the things that always annoyed me was nested nodes for address ranges: "subnet 192.0.2.0/24 start 192.0.2.100 stop 192.0.2.100". Now start and stop will be different nodes, so that they are easy to change independently: "subnet 192.0.2.0/24 range Foo start 192.0.2.100; ... stop 192.0.2.200".

We will also rename the unwieldy "shared-network-name" to "pool". Operational mode commands always used the "pool" terminology, so it will also improve command consistency.

Wireguard support

Thanks to our contributor who goes by hagbard, VyOS now supports wireguard. The work on it is nearly complete, and will be covered in a separate post.

TFTP server support

Thanks to Christian Poessinger, VyOS now has TFTP server. It was a frequently requested feature, and I think it makes sense for people who keep DHCP on the router and do not want to setup another machine for provisioning phones, think clients and so on.

This is an example of TFTP server with all options set:

service {
 tftp-server {
     allow-upload
     directory /config/tftp
     listen-address 192.0.2.10
     port 69
 }
}

DMVPN works again

Thanks to our contributor Runar Borge, we have identified the cause and fixed the issues that broke DMVPN after upgrading to the latest upstream StrongSWAN. It should now work as expected.

L2TP/IPsec works again

One of the blockers introduced by upgrade to StrongSWAN 5.6 was broken L2TP/IPsec. We've adjusted the config to use the new syntax and now it works again.

More to come

We are actively working on getting the codebase ready for the release candidate. Stay tuned for new updates!

Build server maintenance

We are doing some maintenance on ci.vyos.net and the build hosts. They may be inaccessible for an hour or two. We will notify you if it takes longer than expected.

VyOS 1.2.0 development news in July

Despite the slow news season and the RAID incident that luckily slowed us down only for a couple of days, I think we've made good progress in July.

First, Kim Hagen got cloud-init to work, even though it didn't make it to the mainline image, and WAAgent required for Azure is not working yet. Some more work, and VyOS will get a much wider cloud platform support. He's also working on Wireguard integration and it's expected to be merged into current soon.

The new VRRP CLI and IPv6 support is another big change, but it's got its own blog post, so I won't stop there and cover things that did not get their own blog posts instead.

IPsec and VTI

While I regard VTI as the most leaky abstraction ever created and always suggest using honest GRE/IPsec instead, I know many people don't really have any choice because their partners or service providers are using it. In older StrongSWAN versions it used to just work.

Updating StrongSWAN to the latest version had an unforeseen and very unpleasant side effect: VTI tunnels stopped working. A workaround in form of "install_routes = no" in /etc/strongswan.d/charon.conf was discovered, but it has an equally bad side effect: site to site tunnels stop working when it's applied.

The root cause of the problem is that for VTI tunnels to work, their traffic selectors have to be set to 0.0.0.0/0 for traffic to match the tunnel, even though actual routing decision is made according to netfilter marks. Unless route insertion is disabled entirely, StrongSWAN thus mistakenly inserts a default route through the VTI peer address, which makes all traffic routed to nowhere.

This is a hard problem without a workaround that is easy and effective. It's an architectural problem in the new StrongSWAN, according to our investigation of its source code and its developer responses, there is simply no way to control route insertion per peer. One developer responded to it with "why, site to site and VTI tunnels are never used on the same machine anyway" — yeah, people are reporting bugs just out of curiosity.

While there is no clean solution within StrongSWAN, this definitely has been a blocker for the release candidate. Reimplementing route insertion with an up/down script proved to be a hard problem since there are lots of cases to handle and complete information about the intended SA may not always be available to scripts. Switching to another IKE implementation seems like an attractive option, but needs a serious evaluation of the alternatives, and a complete rewrite of the IPsec config scripts — which is planned, but will take a while because the legacy scripts is an unmaintainable mess.

I think I've found a workable (even if far from perfect workaround) — instead of inserting missing routes, delete the bad routes. I've made a test setup and it seems to work reasonably well. The obvious issue is that it doesn't prevent bad things from happening, but rather undoes the damage, so there may still be a brief traffic disruption when VTI tunnels go up. Another problem is a possible race condition between StrongSWAN inserting routes and the script deleting them, though I haven't seen it in practice yet and I hope it doesn't exist. But, at least you can now use both VTI and site to site tunnels on the same machine.

For people who want to use VTI exclusively, there is now "set vpn ipsec options disable-route-autoinstall" option that disables route insertion globally, thus removing the possible disruption, at cost of making site to site tunnels impossible to use. That option is disabled by default.

I hope it will be good enough until we find a better solution. Your testing is needed to confirm that it is!

On "run reset vrrp master"

I've just exorcised a ghost of the old VRRP CLI implementation — the "reset vrrp master" command. I thought it would go away with the vyatta-vrrp package, but in fact it was in vyatta-op. It made me remember that I was going to write about it in the original blog post, but somehow I forgot about it.

Here is why that command was not reimplemented. First, it never worked with preempt to begin with, and with preempt being the default, its usefulness was already limited.

A more serious reason, however, is that it was a rather horrible (even if ingenious) kludge. This is how it worked: first it tried to locate the VRRP group in keepalived.conf, then it would remove it from the config, restart keepalived, insert it again, and restart keepalived again. It sort of worked, but you can see how fragile this approach is. If anything at any stage would go wrong, it would leave VRRP in an inconsistent state.

A much cleaner and general way to do it is to just disable the VRRP group in conf mode (set high-availability vrrp group Foo disable) and commit the change.

New VRRP CLI is here (with IPv6 support)

Ever since I started with Vyatta, I've had a problem with commands for features unrelated to interfaces being defined inside interfaces. I'm sure the person who came up with that arrangement meant well and thought it would be familiar for Cisco and Juniper users, but the more I lived with it, the more I thought it creates more problems than it solves.

From the user perspective, it's hard to easily view the complete configuration of those features. It's also much harder to clone a feature config to another machine. And if you ever want to move some connection to a different NIC, things get even more fun.

For developers, however, it's even worse. First, it means commands for those features needs to be duplicated for every interface type, which makes adding new interfaces much harder. Second, configuration scripts end up more complex due to paths that can be nested quite deeply. Third, with the current config backend specifically, lack of nested end nodes can lead to very interesting tracking of the state to avoid repeated service restarts.

Until recently there was a token excuse for leaving unfortunate UI decisions alone — the difficulty of writing migration scripts. Luckily, it's no longer the case, so we can start cleaning it up. Ok, it is hard and you need to take care of many details, but at least you are not wrestling with a library that is simply inadequate for the task. Now we can go on a quest to remove excessive nesting and redesign the UI is an easier to use, more logical fashion.

VRRP looked a good feature to start the clean up with — we need to get IPv6 VRRP support to work in the end, its scripts have accumulated quite some cruft, and, well, it really has nothing to do with interface settings since it's a protocol of its own implemented by a userspace daemon.

Today I've rolled out the new implementation and it is already in the latest rolling release image, ready for your testing. Let's walk through the changes.

Making first boot scripts just got easier (but building vyos-1x got a bit harder)

As you probably know already, we are working on integrating cloud-init into VyOS, which will allow us to support multiple cloud platforms, and get rid of the custom script for EC2. The hard part of this project is that just allowing cloud-init to do what it normally does in Debian would not produce desired results, we need to make it modify the config.

This raises a question when this should occur and how it should be done. Since modifying running config with scripts has its difficulties in the current backend, and even if it didn't, it still could potentially clash with user's commits, we thought we may want to modify the config.boot file before it's loaded instead.

One advantage is that once we have common functionality implemented, it can be reused not only in cloud-init, but also in the installer, and in custom first boot scripts if someone wants them.

To test this concept, I've added a library names vyos.initialsetup that includes a collection of functions for common settings such as user passwords and keys, host name, default route, name servers, and interface addresses.

Here's an example of a script you can run on your system for demonstration (adjust user name and do ssh-keygen if necessary):

#!/usr/bin/env python3

import vyos.configtree as vct
import vyos.initialsetup as vis

with open('/opt/vyatta/etc/config.boot.default') as f:
    config_string = f.read()

with open('/home/dmbaturin/.ssh/id_rsa.pub') as f:
    key_string = f.read()

config = vct.ConfigTree(config_string)

vis.set_user_password(config, 'vyos', 'qwerty')
vis.set_user_ssh_key(config, 'vyos', key_string)

# Default level is admin
vis.create_user(config, 'dmbaturin', password=None, key=key_string)

# Default type is ethernet
vis.set_interface_address(config, 'eth0', '192.0.2.10/24')

vis.set_default_gateway(config, '192.0.2.1')

vis.set_name_servers(config, ['203.0.113.10', '203.0.113.20'])

vis.set_host_name(config, 'vyos-test')

print(str(config))

The script will print a customized config based on the default config.

Building vyos-1x

This is the good thing. The bad, or rather somewhat inconvenient thing is that vyos-1x package build now depends on the libvyosconfig0 package that provides the library behind the vyos.configtree module, and it's essential for running unit tests for those modules.

You should add the "deb http://dev.packages.vyos.net/repositories/current/vyos/ current main" repository to the sources.list on your build machine and install libvyosconfig0 with APT, or simply take the file from the repo and install it by hand with dpkg.

I hope the increased reliability we gain from those unit test outweighs the inconvenience of additional setup.

Infrastructure failure resolved, downloads.vyos.io is back now

We have managed to resolve the infrastructure problems and bring the affected machines back intact. Now downloads.vyos.io and ci.vyos.net are back online, and our build hosts with all the build dependencies setup and uncommited code are back online too, so luckily we can resume the work from the point where it was stopped by the RAID issue.

We are not ready to give a complete post mortem on the actual issue yet (may not even be able to at all) since we have limited data and the service provider support was not exactly helpful. What we can say now is that what was thought to be RAID5 actually was a RAID1 with an additional hot spare drive, and the failure mode included a RAID controller glitch apart from drive failure. The hot spare drive is what allowed us to bring the data back intact.

While this issue was resolved without any data loss, it definitely prompts us to reconsider many things about our infrastructure, including backup strategy, deployment mechanisms, and service provider choice.

We are going to write new posts as we decide upon the options and roll out improvements. Infrastructure deployment is also a good area for contributions, and some people already offered help.

Infrastructure failure

Hi everyone,

We've had a catastrophic failure on one of our hosts: two drives in a RAID5 failed simultaneously. A number of VMs related to the development infrastructure are permanently lost and need to be restored since we didn't have backups of them.

The only piece public infrastructure affected by it is downloads.vyos.io. You can use the old packages.vyos.net server or one of the mirrors to download the stable (1.1.8) or historical VyOS releases.

The other pieces of infrastructure that need to be restored are the Jenkins server, the build machines, and the repositories server. We'll have to restore them before we can continue development. I hope we'll get it done by Monday.