April 10, 2018
Prior to adding Python performance monitoring, we'd written monitoring agents for Ruby and Elixir. Our Ruby and Elixir agents had duplicated much of their code between them, and we didn't want to add a third copy of the agent-plumbing code. The overlapping code included things like JSON payload format, SQL statement parsing, temporary data storage and compaction, and a number of internal business logic components.
This plumbing code is about 80% of the agent code! Only 20% is the actual instrumentation of application code.
So, starting with Python, our goal became "how do we prevent more duplication". In order to do that, we decided to split the agent into two components. A
language agent and a
core agent. The language agent is the Python component, and the core agent is a standalone executable that contains most of the shared logic.
The executable we distribute will be running on servers we have no access to. It should never crash, or affect other processes on the system. Similarly, it's more difficult to upgrade, since it requires customer action.
The executable we want to build needed to be standalone. But, if the language we pick allows for linking as a Python, Ruby, or Elixir native extension, then we could shift even more of the common logic into shared code outside of the language agent. Most notably, we can reuse data-type definitions between the core agent, and the language agent processes for communication.
This isn't something that was wanted for the first version, but we wanted to leave our options open.
Overhead in our agents is something we always care about. If the language we pick allows tight control over memory and CPU usage, that is for the best. Every compiled language in the list gave enough flexibility with this point, so it wasn't a useful differentiator to help us pick.
Since the executable needed to be pre-built and downloadable as a binary on many host environments, we were immediately limited to compiled languages. That requirement quickly tossed out languages that we were already familiar with: Ruby, Elixir, and Java. Their requirement of a runtime made distribution a difficult proposition.
When investigating Crystal, it felt very new still. Basic libraries like networking and JSON were still under active development. We didn't want to take on a language who's ecosystem was still relatively weak.
We look forward to what Crystal will become, but it didn't fit for our immediate use.
C is the most portable of the languages. It compiles on practically every system, and compatibility between operating systems is well documented (even if you need to manually do it). But, C is very hard to write reliably, at least for newcomers to the language. By forcing manual memory management into the app, it was likely that we'd be hunting segfaults for a long time.
C++ is a similar situation, but the language is famously more complex (for better and worse).
C or C++ would be a strong choice if what we were doing was primarily bit-wise manipulation of data, or very highly performance oriented. The low level view of the system allows those problems to run with nothing between them and the system.
Since we're not experts at C, nor do we want to learn to be for this project, we didn't follow this further.
Only one member of our team had any Haskell experience (me), and the language is a bit too esoteric to be easily picked up by the rest of the team to fix any issues that cropped up.
Go already has a foothold in our company, running most of the ingestion pipeline. It's been rock solid once it was deployed, even with our relative inexperience with the language.
The cross-compilation story for Go is amazing, where it "just works" with a few compiler flags.
Idiomatic Go appears to be rather error resistant, but I found learning it to be tricky. The compiler isn't very aggressive at pushing me away from bad habits. Instead I kept finding myself using
if err != nil blocks after many calls. And if I missed one, then I'd be introducing a potentially app-crashing bug. This was due to my inexperience with the language, and nothing inherent in it, but I found it off-putting.
Packaging and vendoring third party packages is still weak in Go, which I personally find surprising for such a popular language. We could certainly work around it with tools, but the lack of officially blessed approach made me hesitate. From what I understand, there are proposals to fix this soon.
Rust has many of the benefits of languages above, combined into a single language:
It isn't without its downsides though. The borrow checker took a while to get used to when writing it. Without the head-start I had with Haskell's type system, many of the idioms and structures of Rust code would have required more learning up front. Similarly, Rust's strictness forced me to re-learn the low level systems knowledge (stack vs heap, pointers) that I hadn't thought about since college.
After considering the above, we narrowed it down to Go and Rust. They were the best supported of the options, and both have a good community to lean on for questions.
We went with Rust (if you look closely, I gave it away in the title of this post). The biggest tie breaker for me was the type system, and the lack of
nil. By avoiding nil, I'd avoid my most commonly written bug. The rest of the type system supported writing safer code as well. For instance, the code makes heavy use of newtypes around common data to prevent confusion.
I think that Rust has been an excellent language to use.
The first version of our Python agent is in tech preview right now. It includes about 5000 lines of Rust, and 1600 lines of Python. That code was written by me over the course of about 3 months, part-time in-between other small-company interruptions.
A few initial thoughts, I'll write another post diving deeper into how the language changes the style of the code.
Other than the annoyance of switching languages back and forth and hitting syntax differences, it was fairly easy to use both Python and Rust in parallel.
Both have solid collection manipulation libraries, easy to use JSON, and strong module systems. Keeping things organized was easy in both.
Python is more of a pure OO system, but Rust's structs + impl blocks end up feeling pretty close, with the code naturally ending up next to the data structures that it needs.
Rust's stricter type system helped me write code quickly, with the help of the compiler. By setting up guard rails of newtypes, I could be sure I wasn't accidentally passing the wrong values to a function. Python's unit tests provided similar help in several cases, but required more manual effort.
The instrumentation in language agent code is always trickier than the bookkeeping code to build a payload and send it up to our web service. Injecting instrumentation into existing code is hard, and relies on learning metaprogramming skills. That part of Python was by far the hardest, and most fiddly, and doesn't have a comparison in Rust.
The threading code in both languages was approximately the same. The Python code made use of the built-in Queue class to send data between threads in a safe manner. Similarly, Rust's insistence on safety enforced me to use the built-in Mutex type everywhere necessary to make data transfer between threads safe.
I found libraries for nearly every use case:
Rust has fit our use-case of a fast, crash-resistant, embeddable core performance agent well.