VyOS 2.0 development digest #4: simple tasks for beginners, and the reasons to learn (and use) OCaml

Look, I lied again. This post is still not about the config and reference tree internals. People in the chat and elsewhere started showing some interest in learning OCaml and contributing to VyConf, and Hiroyuki Satou even already made a small pull request (it's regarding build docs rather than the code itself, but that's a good start and will help people with setting up the environment), so I decided to make a post to explain some details and address common concerns.

The easy tasks

There are a couple of tasks that can be done completely by analogy, so they are good for getting familiar with the code and making sure your build environment actually works.

The first one is about new properties of config tree nodes, "inactive" and "ephemeral", that will be used for JunOS-like activate/deactivate functionality, and for nodes that are meant to be temporary and won't make it to the saved config respectively.

The other one is about adding "hidden" and "secret" properties to the command definition schema and the reference tree, "hidden" is meant for hiding commands from completion (for feature toggles, or easter eggs ;), and "secret" is meant to hide sensitive data from unpriviliged users or for making public pastes.

Make sure you reference the task number in your commit description, as in "T225: add this and that" so that phabricator can automatically connect the commits with the task.

If you want to take them up and need any assistance, feel free to ask me in phabricator or elsewhere.

The reasons to learn (and use)

Main reasons are type system expressiveness and type safety. There is also convenience of the toolchain.

First fun things, is that OCaml is one of the few languages that can be compiled to native code and can be used interactively. It can also be compiled to bytecode. Interactive read-eval-print loop uses the bytecode implementation. Bytecode is also used to run OCaml on platforms that compiler doesn't target. It's also used by the js_of_ocaml tool that compiles OCaml to JS (and it's powerful enough to compile OCaml itself to JS, here you can try it out in your browser entirely on the client side: https://try.ocamlpro.com/

The default REPL (the "ocaml" executable) is pretty spartan (no history even), so I suggest that you install utop immediately, it makes a great REPL. First install opam, then run "opam install utop".

Now, some teasers about the type system. First, OCaml can umabiguously infer the types of expressions, including functions, through what's known as Hindley-Milner type inference, at compile time. You never have to enter type annotations (other than in module signatures), even though you can. It knows that 42 is an int, "true" is a bool, and "fun x → x mod 2 = 0" is a function of type "int → bool". It does come at cost, in particular, there is no ad-hoc polymorphism (aka function overloading). There are tools that let you view the inferred types of expressions in your editor: https://opam.ocaml.org/blog/turn-your-editor-into-an-ocaml-ide/

I'll give a seemingly trivial but demonstrative example, formatted output. We all know (or should know) that this C program segfaults:

#include <stdio.h>
int main(void) {
    printf("%d bottles of %s on the wall\n", "foo", 99);
}

In Python or Go this is a runtime error. OCaml can prevent this at compile time:

utop # Printf.printf "%d bottles of %s on the wall" "beer" 99;;
Error: This expression has type string but an expression was expected of type int

By partial application we can see the inferred type: from the format string it knows that the arguments should be int and string respectively (the mysterious "unit" type, you can think of it like "void" if you want, for now):

utop # Printf.printf "%d bottles of %s on the wall";;
- : int -> string -> unit 

While the implementation of familiar-looking printf with format strings is quite involved, strange-looking but equally type safe printf could be implemented in just a few lines, it's not some compiler magic.

Another serious reason is the way it handles the data. By default, all data is immutable. This means that when you bind say a list to a new name and start modifying it, the compiler will rearrange the pointers or copy just enough data for you, without you having to worry about making sure to copy the right things. The guarantee that nothing can be modified by other parts of the program allows the compiler to do this safely and efficiently. Only if you declare something as mutable, then you can destroy its original state.

Learn more

There are a bunch of resources you can use to learn the language:

Tutorials on ocaml.org are informative, but the structure may be a bit confusing for a newcomer.

This book is great, but not free (DRM-free PDF for $15). This would be my "editor's choice" though.

Real World OCaml is also great, but with one caveat: it uses an alternative standard library written by its authors (Core.Std) throughout, and that library is not compatible with the standard library that comes with the compiler. It's a bit like C++ with Boost imported into the main namespace: you are reading a book about C++ with Boost in the main namespace, rather than a book about C++/STL itself. I'm not saying Core.Std is bad, but I do think that one should learn about the standard library before using non-standard ones, and then decide if they want to use it. VyConf doesn't use that library.

Possible concerns

Difficulty of imperative programming

There is none. If your introduction to functional programming was in Haskell, you may remember that it's not easy to say insert a debug print or logging in the middle of the program. Well, OCaml doesn't have this issue, as it doesn't try to limit side effects in any way: if you want to modify a mutable reference, or add log or debug print, you can.

While the code people normally write is functional when they can and imperative only when they have to, the imperative "escape hatch" is always there. Also, OCaml uses strict rather than non-strict evaluation as opposed to Haskell, so you never have to use monads just to make sure things are executed in order. While a lot of libraries are monads, you rarely need more than one monad at a time (let's admit it, IO is the most used monad in Haskell programs), so you hardly ever end up with a pile of monad transformers.

Performance

Performance is a common concern, especially when it comes to unfamiliar languages, so I made up a microbenchmark that hopefully addesses it. For something more realistic, as you already know if you used "opam switch", OCaml compiler can bootstrap itself in a few minutes, and the binary it uses for it isn't even native code.

We all know that microbenchmarks are not the best performance indicators, but no performance discussion is complete without one. I selected a really dumb task solved in a really dumb way for this one: build a 10000000 long list of integers and sum it up. In real life anyone would use a stream for it, but the point is to demonstrate how it can handle large datastructures, rather than to sum up a bunch of number effectively. While the task itself is contrived, VyConf processes datastructure in pretty similar ways, so it may be a valid approach here.

The languages compared are OCaml, C++, and Python (one person suggested that I also add Go, so I did). In OCaml I used the single linked list type from the standard library, but reimplemented list folding myself because late at night I thought the List.fold_list is not tail recursive and would cause a stack overflow (actually, it is and I've done extra work). In C++ I used the std::list template and iterator. In Python, I used the standard list class. In Go, I used the "container/list" package from the standard library.

You can view the source code and the testing session protocol here. Here are aggregated resulsts:

Wall time (average of three runs), measured with the usual UNIX time utility:

OCaml: 0.76s
C++ (gcc default): 1.45s
C++ (gcc -O2): 0.62s
Python: 2.18s Go: 3.3s

Memory consumption measured with valgrind/massif (except for Go, which doesn't work with valgrind and required code modification to use its native pprof tool):

OCaml: 146MiB
C++: 230MiB
Python: 347MiB
Go: 190MiB

It should be noted that sum types are not objects, and a more "fair" comparison should have been done with a linked list implemented through OCaml objects, but they are expressive enough to let you have a linked list without objects, so...

Libraries

There aren't as many of them as for more popular languages, though not as bad as some make it to look. You can browse libraries available from the official OPAM repository here, the list is always growing.

You can also interface with C libraries without writing C code, through the ctypes library.

Limitations

There are some very genuine issues with OCaml (just like with any other language).

First, the standard library is quite... spartan. This is what spawned a few alternative "standard libraries" that haven't become really standard because they would break the old code. It's not like Python where you can find bindings for SQLite and a JSON parser right in the standard library. Installing libraries from OPAM is trivial so this isn't much of an issue, but you do need to make some decisions what to use.

Second, the garbage collector in mainline OCaml is single-threaded, and the runtime library has a global lock (Python users are familiar with GIL, this is the same thing). This means for effective parallel computations one needs to use subprocesses rather than threads (concurrency is a different thing). Runtimes without global locks do exist (for Python, OCaml, and Haskell alike), the problem is how to make them as fast as globcal lock runtimes for single-threaded applications, so they are not the default in any of those.

As I said, it also lacks ad-hoc polymorphism. It was sacrificed to make type inference decidable, but it can be annoying sometimes, for example, you cannot mix integers and floats in the same expression without explicit conversion. The alternative to is is "functors": modules that can be parameterized by other modules, which are powerful but not very concise or easy to write.

There is also no (easy) reflection, since it's normally compiled to native code and doesn't keep any type information in executables (apart from a special case called "polymorphic variants" that is used for implementing a form dynamic typing within a statically typed program). You can't address record fields or object members by names converted from strings, or iterate through them. Any metaprogramming can only happen at compile time (there are syntax extensions).