Telemetry has a Centerpiece of Your Software
Posted on: 2018-08-14
Since my arrival at Netflix, I have been working all my time on the new Partner Portal of Netflix Open Connect. The website is private, so do not worry if you cannot find a way to access its content. I built the new portal with a few key architectural concepts as the foundation and one of it is telemetry. In this article, I will explain what it consists of and why it plays a crucial role in the maintainability of the system as well as how to smartly iterate.
Telemetry is about gathering insight on your system. The most basic telemetry is a simple log that adds an entry to a system when an error occurs. However, a good telemetry strategy reaches way beyond capturing faulty operation. Telemetry is about collecting behaviors of the users, behaviors of the system, misbehavior of the correct programmed path and performance by defining scenario. The goal of investing time into a telemetry system is to raise awareness of what is going on on the client machine, like if you were behind the user's back. Once the telemetry system is in place, you must be able to know what the user did. You can see telemetry like having someone dropping breadcrumb everywhere.
A majority of systems collects errors and unhandled errors. Logging errors are crucial to clarify which one occurs to fix them. However, without a good telemetry system, it can be challenging to know how to reproduce. Recording which pages the user visited with a very accurate timestamp, as well as with which query string, on which browser, from which link is important. If you are using a framework like React and Redux, knowing which action was called, which middleware execute code and fetched data, as well as the timing of each of these steps, are necessary. Once the data in your system, you can extract different views. You can extract all errors by time and divide them by category of errors, you can see error trends going up and down when releasing a new piece of code.
Handling error is one perspective, but knowing how long a user waited to fetch data is as much important. Knowing the key percentiles (5th, 25th, 50th, 75th, 95th, 99th) of your scenarios indicate how the user perceives your software. Decisions about which part need improvement can be taken with certitude because that it is backed by real data from users that consume your system. It is easier to justify engineering time to improve code that hinders the experience of your customer when you can have hard data. Collecting about scenarios is a source of feature popularity as well. The aggregation of the count by users of a specific scenario can indicate if a feature is worth staying in the system or should be promoted to be more easy to discover. The conclusion of how to interpret the telemetry values are subjective most of the time, but is less opinionate then a raw gut feeling. Always keep in mind that a value may hide an undiscovered reality. For example, a feature may be popular but users hate using it -- they just do not have any other alternative.
There are many telemetries and when I unfolded my plan to collect them I created a TypeScript (client-side) library that is very thin with 4 access points. The first one is named "trackError". Its specialty is to track error and exception. It is simple as having an error name that allows to easily group the error (this is possible with handled error caught in try-catch block) and contains the stack trace. The second one is "trackScenario" which start collecting the time from the start to the end. This function returns a "Scenario" object which can be ended but also having the capability of adding markers. Each marker is within the scenario and allows fined grained sub-steps. The goal is to easily identify what inside a scenario involves slowness. The third access point is trackEvent which take an event name and a second parameter that contain an unstructured object. It allows collecting information about a user's behavior. For example, when a user sorts a list there is an event "sortGrid" with a data object that has a field that indicates which grid, the direction of the sort, which field is being sorted, etc. With the data of the event, we can generate many reports of how the user is using each grid or more generic which field etc. Finally, it is possible to "trackTrace" which allow specifying with many trace level (error, warning, info, verbose) information about the system. The library is thin, simple to use and has basic functionality like always sending the GIT hash of the code within the library, always sending the navigation information (browser info), having the user unique identifier, etc. It does not much more. In fact, one more thing which is to batch the telemetry and send them periodically to avoid hammering the backend. The backend is a simple Rest API that takes a collection of telemetry message and stores them in an Elastic Search persistence.
A key aspect, like many software architectures and process decision, is to start right from the beginning. There are hundreds of usage of telemetry at the moment in the system and it was not a burden to add them. The reason is that they were added continually during the creation of the website. Similar to writing unit tests, it is not a duty if you do not need to add to write them all at once. While coding the features, I had some reluctances in few decisions, I also had some ideas that were not unanimous.
The aftereffect of having all the data about the user sooth many hot topics by having a reality check about how really the users are using the system. Even when performing a thorough user session to ask how they are using the system, there is nothing like real data. For example, I was able to conclude that some user tries to sort empty grid of data. While this might not be the discovery of the century, I believe it is a great example that shows a behavior that no user would have raised. Another beneficial aspect is monitoring the errors and exceptions and fixing them before users report. In the last month, I have fixed many (minor) errors and less than 15% were raised or identified by a user. When an error occurs, there is no need to consult the user -- which is often hard since they can be remote around the world. My daily routine is to sort all error by count by day and see which one is rising. I take the top ones and search for a user who had the issue and looks at the user's breadcrumb to see how to reproduce locally on my developer's machine. I fix the bug and push it in production. The next day, I look back at the telemetry to see if the count is reducing. A proactive fixing bug approach is a great sensation. You feel way less trying to put water on fire allowing you to fix properly the issue. Finally, with the telemetry system in place, when my day to day job is getting boresome or that I have a slump in productivity, I take this opportunity to start querying the telemetry data to break into several dimensions with the goal to shed some light about how the system is really used and how it can be improved to provide the best user experience possible.
This article was not technical. I will follow up in a few days with more detail about how to implement telemetry with TypeScript in a React and Redux system.