Burning the last GIT commit into your telemetry/log

I enjoy knowing exactly what happens in the systems that I am actively working and that I need to maintain. One way to ease the process is to know precisely the version of the system when an error occurs. There are many ways to proceed like having a sequential number increasing, or having a version number (major, minor, path). I found that the easiest way is to leverage the GIT hash. The reason is that not only it point me into a unique place in the life of the code, but it also removes all manual incrementation that a version number has or to have to use/build something to increment a number for me.

The problem with the GIT hash is that you cannot run it locally. The reason is that every change you are doing must be committed and pushed. Hence the hash will always be at least one hash before the last. The idea is to inject the hash at build time in the continuous integration (CI) pipeline. This way, the CI is always running on the latest code (or a specific branch) and knows what is the code being compiled thus without having to save anything could inject the hash.

At the moment, I am working with Jenkins and React using the react-script-ts. I only had to change the build command to inject into a React environment variable a Git command.

"build": "REACT_APP_VERSION=$(git rev-parse --short HEAD) react-scripts-ts build",

In the code, I can get the version by using the process environment.

const applicationVersion = process.env.REACT_APP_VERSION;

The code is minimal and leverage Git system and environment variable that can be read inside React application easily. There is no mechanism to maintain, and the hash is a source of truth. When a bug occurs, it is easy to setup the development environment to the exact commit and to use the remaining of the logs to find out how the user reached the exception.

Google Analytic with React and Redux

I had to integrate Google Analytic in one of our website at Netflix. It’s been a while I had to use Google Analytic and the last time was simply copy-pasting the code snippet provided by Google when creating the Analytic “provider” account. The last time was a few years ago and the website was not a single-page application (spa). Furthermore, I the application is using Create-React App (TypeScript version) with Redux. I took a quick look and found few examples on the web but I wasn’t satisfied. The reason is that all examples I found were hooking Google Analytic at the component level. I despise having anything in the user interface (UI), React that is not related to the UI.

The first step is to use a library instead of dropping the JavaScript directly into the code.

npm install --save react-ga

Next step is to configure the library to set the unique identifier provided by Google. I am using the create-react-app scaffolding and I found the best place to initialize Google Analytic to be in the constructor of the App.ts file. It is a single call that needs to be executed once for the life of the spa.

class App extends React.Component {

  public constructor(props: {}) {
    super(props);
    ReactGA.initialize(process.env.NODE_ENV === Environment.Production ? "UA-1111111-3" : "UA-1111111-4");
  }
  public render(): JSX.Element {
    return <Provider store={store}>
      <ConnectedRouter history={history}>
        <AppRouted />
      </ConnectedRouter>
    </Provider>;
  }
}

export default App;

The last step is to have react-router to call the page change when the routing change. React-router is mainly configured in React, but I didn’t want to have any more ReactGA code in React. The application I am working uses Redux and I have a middleware that handles route. At the moment, it checks if the route change and analyzes the URL to start fetching data on the backend.

  return (api: MiddlewareAPI<AppReduxState>) =>
            (next: Dispatch<AppReduxState>) =>
                <A extends Action>(action: A): A => {
               // Logic here that check for action.type === LOCATION_CHANGE to fetch the proper data
               // ...
               // If action.type === LOCATION_CHANGE we also call the following line:
               ReactGA.pageview(actionTyped.payload.pathname);
};

The previous code is clean. Indeed, I would rather not have anything inside React, but App.tsx is the entry point and the initialize function injects into the DOM Google’s code. The Redux solution works well because the react-router-redux used gives the pathname which is the URL. By using the function “pageview” we are manually sending to Google Analytic a page change.

Improving the User Experience of a Complex Form with Animation

I am working at Netflix on one of our website dedicated to our partners around the world to get information as well as performing actions in their caches (Netflix’s CDN). One form allows to configure the BGP Configuration. It was present in the legacy portal. I found it complex to understand and while many users are well aware of how BGP works, some other partners have less knowledge. I am a strong believer that a user interface must guide the user to avoid writing bad information and then being alarmed by a red error message. My philosophy is to guide the user during the data input and make the experience enjoyable without fear of bad action or wrong inputs.

While I was learning how Netflix’s caches worked and how BGP was supposed to be configured, most people were drawing a diagram. Since the “human to human” natural explanation was to use simple graphical geometrical shape to explain the concept, I decided to not fight that natural behavior and to embrace it.

The first step was to produce a simplified version of different kind of sketches I received and to generalize the idea from several different states the BGP can be configured. The configuration in the old system was two forms for IPv4 and IPv6 which required the user to have a mental picture of the configuration not displayed. I decided to combine the two to avoid the user to open two browser windows (or tabs). I also wanted to avoid a completely new form per use cases. For example, a BGP configuration requires hops between the gateway and the peer when the gateway is not on the same subnet of the peer. The peer IP is configurable, hence can change which may or not have the hops count in an instant. 

BGP Configuration with IPv4 and IPv6

The screenshot above shows a configuration. On the left is an OCA for “Open Connect Appliance” which is the cache that has all the movies. On the right is the Peer. The gateway in the IPv4 and IPv6 is on the same subnet hence does not have any additional inputs to be filled. This diagram has all the IPv4 at the top and the IPv6 at the bottom. After one or two usages, it becomes easy to see what inputs are for which internet protocol as well as for which machine (cache or peer).

Another detail is the highlight of which input or information belongs to which part of the graphic. When the mouse hovers any portion of the user interface, you can see the internet protocol being highlighted. It guides the user into knowing which element is getting changed.

Hover highlight the impact of a change

Another detail shown in the previous image is that when you activate a section, there is inline-help, a circle with a question mark, that appears. The help that does not clutter the interface and appears and the right moment allows the user to get additional information about the values. After getting telemetry in these inline-helps, I can confirm that the idea is a success. People are reading them!

Inline help that appears only when relevant

You may notice, on the first screenshot, that the reset and submit button are disabled. The inline help next to the button explains the reason to the user. When the user interacts with the form, the buttons change state and also the inline-help. The help is dynamic everywhere. It means that it’s a lot more work in term of development because the messages require to be smarter but it also means that the user does not fall into the trap of generic message that does not help them — every message is right for the user’s scenario.

Gateway appears when the IP are not on the same subnet. Merge of the gateway when IPv4 and IPv6 under the same situation

In the last animation, we can see a more advanced scenario where the user interface guides the user by showing additional fields that must be entered: the hops count. The field has some business logic that indicates that it must be between 1 to 16 and the interface adapts to shows the input as well as where the information belongs. The hops are between the gateway and the peer. Also, you can see the gateway IP moving to the gateway which is not anymore the same as the Peer IP. Suddenly, anyone using the form sees that the gateway is an entity apart of the Peer and that it has an IP that cannot be changed, but the Peer IP can.

You may wonder how it was built? It was built using React, TypeScript and basic normal HTML. No SVG, no canvas. Working with SVG is trendy but it overly complex for even basic styling. For example, adding a drop shadow. It is also more complex to meddle with input fields. Using DIV and Inputs did the job perfectly. 

I already wrote about how to handle Animation with React, and the exact same technic is used. CSS3 animations conduct animations. Many more scenarios require parts to move, and every dance is orchestrated by the main form component and children components that add styles and CSS classes depending on every property that describes the business logic rules.

The graphic is wide and on small resolution could be a problem. I decided to fallback to a simple basic form with label on top of the inputs. Nothing flashy, but enough to have someone in a small tablet or a phone to be able to configure the BGP.

To conclude, the animation is a step into a more complex and fancy vision I had. Like I wrote in the performance article, I never had a dedicated task (or time) reserved to make this feature with animation — I had to do it. The more I work in software engineering, the more I realized that it is very rare that you have time for extra. This unfortunate rule applied for user interface extra, for performance tweaks and even for automated tests. You must find the time, squeeze these little extra between bigger initiative and always have some buffers for issues which could be used if everything went smooth into some polishing. In the next bug fix, I plan to improve the colors and how smooth the animation happen on slower computer, but this is story for another time.

Increasing Performance of thousand of LI with expandable row by React-virtualized

Like most of the application, everything is smooth with testing data. After a while of usage, the application gathers more and more bytes which change the whole picture. In some situation, the data does not expire, hence the collection of data grow out of proportion, way beyond the normal testing scenarios that the developer might anticipate at the inception of the code. The problem is the browser, the lack of understanding of the purpose of so many elements. Using HTML with overflow to scroll (or auto) and relying upon that the scrolling will always be smooth is a fallacy — the browser will need to render and to move all these elements.

There are a few options. The easiest one is to only display a subset of the whole collection of data. However, the solution penalizes users who may want to reach that information again in the future. A second popular option is to render only what is visible. It means that we need to remove from the screen elements that are hidden, out-of-range of the user. Everything above and under the viewport is removed. The challenge is to keep right the scrollbar position and size. Using a virtualization library that calculates the number of element out of the viewport and simulates the space while avoiding generating too much element is a valid solution. It works well with a fixed row. However, in my use case, I have the possibility to click the row which expandable this one with many inputs, graphs, and text. I also didn’t use a table, so I wasn’t sure how it would work. At my big surprise, not only variable row’s height is supported by the most popular virtualization library, it also supports a collection of an element that is not a table (for example UL>LI). I am talking about the React-Virtualized library.

In this article, I’ll show how to transform a list built with a single UL and many LI. Clicking the LI, open add another React component inside that increase the height of the row. The full source code is coming from the Data Access Gateway Chrome’s Extension. In the following image, you can see that we have many rows of 15px height followed with one that got expanded to 300px. The idea is that the code is smart enough to take the different heigh in its calculation.

Before explaining, the official library has many examples which are great. However, they are not all written in JSX which can be cumbersome to read. I hope that this article might help some people in understanding how you can do it with JSX as well as with your own React components. To explain my scenario, here how it was before and how we will modify the structure in term of React components hierarchy. The following diagram shows that I am simulating a header of the grid, outside the range of what can be scrolled with the first UL. The second UL, contains a collection of “ConsoleMessageLine” component which is a LI with many DIV. One DIV per column positioned with a FLEX display. When one row is clicked, it toggles the possibility to render an additional component the “ConsoleMessagesLineDetails” or to render undefined which do not display anything.

The division of the component satisfy my needs and I desired to alter the composition as little as possible. It ended that I only had to add two different components from the React-Virtualized library without having to change the UL>LI>DIV composition.

The first step is to get the library. This is straightforward if you are using NPM. Indeed, I am using TypeScript and the library has a @type package.

npm install react-virtualized --save
npm install @type/react-virtualized --save-dev

However, with TypeScript 3.0rc, I had an issue with the definition file. It has some conflicting type which required me to indicate to TypeScript to forgo analyzing the definition file. It doesn’t remove the strength of having types, but TypeScript won’t use it when compiling. In your tsconfig.json file, you shall add and enable “skipLibCheck”.

"skipLibCheck": true

The next step is to add the AutoSizer and the List component from the library to the existing code. Few important details. First, you must have a “ref” to the list. The reference gives us a pointer to force redraw when we know the user change the height of an existing component. We will see the code later. Second, we must provide a function that will return the row height. The property “rowHeight” is a must in the situation where the row can change depending of some characteristic of this one. In my case, the row will be either 15px or 300px depending if this one is collapsed or expanded.

<ul className="ConsoleMessage-items">
    <AutoSizer>
        {({ width, height }) => (
            <List
                ref={r => (this.list = r)}
                className="ConsoleMessage-virtual-list"
                height={height}
                rowCount={this.state.filteredData.length}
                rowHeight={p => this.calculateRowHeight(p)}
                rowRenderer={p => this.renderRow(p)}
                width={width}
            />
        )}
    </AutoSizer>
</ul>

The AutoSizer component has an unusual way to use its children by allowing a width and a height. You can see an extract of the definition file under this paragraph. The Size type has a width and a height determined by the AutoSizer.

children: (props: Size) => React.ReactNode;

The List component is wrapping our rows and will generate one row for every rowCount provided in the properties. It means that I had to modify my code because I was mapping my collection to render a ConsoleMessagesLine and that now it will to a for-loop and I’ll have to indicate by index what element to render.

  <ul className="ConsoleMessage-items">
                {this.props.listMessages
                    .filter((m) => this.logics.filterConsoleMessages(m, this.state.consoleMessageOptions.performance, this.state.consoleMessageOptions.size))
                    .map((m: MessageClient) => <ConsoleMessagesLine
                        key={this.logics.getMessageKey(m)}
                        message={m}
                        listMessages={this.props.listMessages}
                        demoModeEnabled={this.props.demoModeEnabled}
                    />)
                }
            </ul>

The first modification from the code above was that I was filtering in the rendering. Instead, I have to have this list somewhere to be sure to be able to refer by index to it. I leveraged the static React function to derive from the props into the state of the components a filtered collection.

public static getDerivedStateFromProps(
    props: ConsoleMessagesProps,
    state: ConsoleMessagesState
): ConsoleMessagesState {
    const allData = props.listMessages.filter(m =>
        ConsoleMessages.logics.filterConsoleMessages(
            m,
            state.consoleMessageOptions.performance,
            state.consoleMessageOptions.size
        )
    );
    return {
        ...state,
        filteredData: allData
    };
}

The next step was to create two functions. One for the dynamic height determination of each row and one to the actual render function of a single line that will use the data we just stored in the component’s state. The dynamic height is a single line function, that use on its turn the list as well.

private calculateRowHeight(params: Index): number {
    return this.openMessages[this.state.filteredData[params.index].uuid] === undefined ? 15 : 300;
}

As you can see, it goes into the filtered list, it uses the parameter that has an index and determines whether the unique identifier of the object is in a map that we populate when the user clicks the row. I have a uuid field, but it could be anything that is proper to your logic and code. For the curious about the “openMessages” map, it is a private member of the component that adds and remove element depending on the toggle of each row. Small but important detail, because we are modifying the height of an element of the virtualization, we must expliclity and manually invoke a recalculus of the rows. This is possible by calling “recomputeRowHeights” from the list. The reference to the list is handy because we can invoked in the function that toggle the height.

private lineOnClick(msg: MessageClient, isOpen: boolean): void {
    const unique = msg.uuid;
    if (isOpen) {
        this.openMessages[unique] = unique;
    } else {
        delete this.openMessages[unique];
    }
    if (this.list !== null) {
        this.list.recomputeRowHeights();
    }
}

The last crucial piece is the render of the element. Similarly to the function to determine the height of the row, the parameter gives the index of the element to render.

private renderRow(props: ListRowProps): React.ReactNode {
    const index = props.index;
    const m = this.state.filteredData[index];
    return (
        <ConsoleMessagesLine
            key={m.uuid}
            style={props.style}
            message={m}
            listMessages={this.props.listMessages}
            demoModeEnabled={this.props.demoModeEnabled}
            onClick={(msg, o) => this.lineOnClick(msg, o)}
            isOpen={this.openMessages[m.uuid] !== undefined}
            charTrimmedFromUrl={this.state.consoleMessageOptions.charTrimmedFromUrl}
        />
    );
}

The code is very similar to my code before the virtualization modification. However, it has to fetch the data from the filtered list by index.

Finally, the code is all set up, without major refactoring or constraint in term of how the components must be created. Two wrappers in place, a little bit of code moved around to comply with the contract of having rows rendered by function and we are all set. The performance is now day and night when the number of row increase.

Top 5 Improvements that Boost Netflix Partner Portal Website Performance

Netflix is all about speed. Netflix strives to give the best experience to all its customers — and no one like to wait. I am working in the Open Connect division which ensures that movies are streamed efficiently to everyone around the world. Many pieces of the puzzle are essential for a smooth streaming experience but at its core, Netflix’s caches act like a smart and tailored CDN (content delivery network). At Netflix, my first role was to create a new Partner Portal for all ISP (Internet service provider) to do monitoring of the caches as well as other administrative tasks. There is a public documentation about Partner Portal available here if you are interested to know more about it. In this blog post, I’ll talk about how I was able to take users’ a specific scenario that required many clicks and an average of 2 minutes 49 seconds to under 50 seconds (cold start) and under 19 seconds once the user visited more than once the website. An 88% reduction of waiting time is far more than just an engineering feat but a delight for our users.

#1: Tech Stack

The framework you are using has an initial impact. The former Partner Portal was made in AngularJS. That is right, the first version of Angular. No migration had been made for years. There were the typical problems in many areas with the digest of events, as well as how the code was getting harder to maintain. The maintenance aspect is out of scope of this actual article, but AngularJS always been hard to follow without types, and with the possibility to add values in a variety of places many functions and values in scope it becomes slowly a nightmare. Overall, Netflix is moving toward React and TypeScript (while not being a rule). I saw the same trend in my years at Microsoft and I was glad to take this direction as well.

React allows having a fine-grained control over optimization which I’ll discuss in further points. Other than React, I selected Redux. It is not only a very popular framework but also very flexible in how you can configure it and tweak the performance. Finally, I created the Data Access Gateway library to handle client-side request optimization with two levels of cache.

The summary of the tech stack point is that you can have a performant application with Angular or any other framework. However, you need to keep watering your code and libraries. By that, you must upgrade and make sure to use the best practices. We could have gone with Angular 6 and achieve a very similar result in my opinion. I will not go into detail about why I prefer the proximity to JavaScript of React to AngularJS templating engine. Let’s just end up that being as close of the browser and avoiding layers of indirection are appealing to me.

#2: More click, less content per page

The greatest fallacy of web UI is the optimization for the less amount of click. This is driven by research on shopping websites where the easiest and quickest the user can press “buy” that will result in a sell. Great, but not every website goal is to bring someone having one particular action in the less amount of click. Most website goal is to have the user enjoy the experience and to have the user fulfill his or her goal in a fast and pleasant way. For example, you may have a user interface that requires 3 clicks but each click takes 5 seconds. Or, you could have a user interface that requires 4 clicks with 2 seconds each. In the end, the experience is 15 seconds versus 8 seconds. Indeed, the user clicked one more click but got the result way faster. Not only that, the user had the impression of a way faster experience because he or she was interacting instead of waiting.

Let’s be clear, the goal is not to have the user click a lot more, but to be smart about the user interface. Instead of showing a very long page with 20 different pieces of information, I broke the interface into separated tabs or different pages. It reduces some pages that required to do a dozen of HTTP calls to 1-2 calls. Furthermore, clicks in the sequence of action could reuse data previously fetched giving fast steps. The gain was done automatically with the Data Access Gateway library which cache HTTP responses. Not only in term of performance it was better, in term of telemetry it is heaven. It is now possible to know what the user is looking at very accurately. Before we had a lot of information and it was hard to know which one was really consulted. Now, we have a way since we can collect information about which page, tabs, and section is open or drilled down.

#3: Collect Telemetry

I created a small private library that we now share across our website in our division at Netflix. It allows collecting telemetry. I wrote a small article about the principle in the past where you can find what is collected. In short, you have to know how users are interacting with your website as well as gathering performance data about their reality. Then, you can optimize. Not only I know what feature is used but I can establish patterns which allow to preload or position elements on the interface in a smart way. For example, in the past, we were fetching graphs on a page for every specific entity. It was heaving in term of HTTP calls, in term of rendering and in “spinner”. By separating into a “metric” pages with one type of graph per tabs we not only been able to establish which graph is really visualized but also which option etc. We removed the possibility to auto-load the graph by letting the user loading which graph he or she wants to see. Not having to wait for something you do not want seem to be a winner (and obvious) strategy.

To summarize, not only data is a keystone of knowing what to optimize, it is crucial for the developer to always have the information in his/her face. The library I wrote for the telemetry output in the console a lot of information with different color and font size to clearly give insights into the situation. It also injects itself into the Google Chrome Performance tooling (like React does) which allow seeing different “scenario” and “markers”. No excuses at the development phase, neither at production time to not knowing what is going on.

#4: Rendering Smartly

In a single-page application that optimizes for speed, not clicks, rendering smartly is crucial. React is build around virtualization but it still requires some patterns to be efficient. I wrote several months ago 4 patterns to boost your React and Redux performance. These patterns are still very relevant. Avoiding rendering helps the whole experience. In short, you can batch your Redux actions to avoid having several notifications that trigger potential view update. You can optimize the mapping of your normalized objects into denormalized objects by using a function in Redux-Connect to cancel the mapping. You can also avoid denormalizing by “selecting” the data if the normalize data in your reducers have not changed. Finally, you need to use React to leverage the immutable data and only render when data change without having to compute intense logic.

#5: Get only what you need

We had two issues in term of communication with the backend. First, we were doing a lot of calls. Second, we were performing the same call over and over again in a short period of time. I open-sourced a library that we are using intensively for all our need of data called Data Access Gateway library. It fixes the second issue right away by never performing two calls that are identical at the same time. When a request is performed and a second one wants the same information, the latter will subscribe to the first request. It means that all subsequent requesters are getting the information from the former requester — it receives it pretty fast. The problem with several calls could be in theory handled better by having less generic REST endpoints. However, I had low control over the APIs. The Data Access Gateway library offers memory cache and persisted cache with IndexDb for free. It means that calls are cached and depending on the strategy selected in the library you can get the data almost instantly. For example, the library offers a “fetchFast” function that always returns as fast as possible the data even if this one is expired. It will perform the HTTP call to get fresh data which will be ready for the next request. The default is 5 minutes, and our data does not change that fast. However, we have a scenario where it must be very fresh. It is possible to tailor the caching for these cases. Also, it is possible to cache for a longer time. For example, a chart that displays information on a year period could be cached for almost a full day. Here is a screenshot of Chrome’s extension of the Data Access Gateway which shows that for a particular session, most of the data was from the cache.

The persisted cache is also great for returning user. Returning users have a return experience that is also instantly. The data might be old, but the next click performed to update everything.

The experience and numbers vary a lot depending on how the user interacts with the system. However, it is not rare to see that above 92% of requests of information are delivered by the cache. By delivered I mean that is returned to the user regardless if it comes from the memory cache, persisted cache or HTTP request. The other way to see it is that when a user clicks on the interface that only 8% of the data is requested via HTTP (slow). However, if the user stays under the same amount of feature the number can climb easily to 98%. Not only the amount of consumed data is very high at a level fast for the user, it is also very efficient in term of data moved across the network. Again, the number varies greatly depending on how the user interacts with the Netflix Partner Portal. But, it’s not rare to see that only 10% of bytes used by the application are actually coming from HTTP request, while 90% are already cached by the library — this means that on a session where a user performed many actions that instead of having to download about 150megs could have downloaded less than 15 megs of data. Great gain in term of user experience and great gain in term of relaxing our backend. Also, one gain for our user who saves bandwidth. Here is a screenshot of a session recorded by the Chrome’s extension of the Data Access Gateway.

What next?

Like many of you, my main task is about delivering new features and maintaining the actual code. I do not have specific time allowed for improving the performance — but I do. I found that it is our (web developer) duty to ensure that the user gets the features requested in a quality. The non-functional requirements of performance is a must. I often take the liberty of adding a bullet point giving a performance goal before starting to develop the feature. Every little optimization along the journey accumulates. I have been working for 13 months on the system and keep adding once in a while a new piece of code that boost the performance. Like unit testings, or polishing the user interface, or to add telemetry code to get more insight; performance is something that must be done daily and when we step back and look at the big picture that we can see that it was worth it.

Telemetry has a Centerpiece of Your Software

Since my arrival at Netflix, I have been working all my time on the new Partner Portal of Netflix Open Connect. The website is private, so do not worry if you cannot find a way to access its content. I built the new portal with a few key architectural concepts as the foundation and one of it is telemetry. In this article, I will explain what it consists of and why it plays a crucial role in the maintainability of the system as well as how to smartly iterate.

Telemetry is about gathering insight on your system. The most basic telemetry is a simple log that adds an entry to a system when an error occurs. However, a good telemetry strategy reaches way beyond capturing faulty operation. Telemetry is about collecting behaviors of the users, behaviors of the system, misbehavior of the correct programmed path and performance by defining scenario. The goal of investing time into a telemetry system is to raise awareness of what is going on on the client machine, like if you were behind the user’s back. Once the telemetry system is in place, you must be able to know what the user did. You can see telemetry like having someone dropping breadcrumb everywhere.

A majority of systems collects errors and unhandled errors. Logging errors are crucial to clarify which one occurs to fix them. However, without a good telemetry system, it can be challenging to know how to reproduce. Recording which pages the user visited with a very accurate timestamp, as well as with which query string, on which browser, from which link is important. If you are using a framework like React and Redux, knowing which action was called, which middleware execute code and fetched data, as well as the timing of each of these steps, are necessary. Once the data in your system, you can extract different views. You can extract all errors by time and divide them by category of errors, you can see error trends going up and down when releasing a new piece of code.

Handling error is one perspective, but knowing how long a user waited to fetch data is as much important. Knowing the key percentiles (5th, 25th, 50th, 75th, 95th, 99th) of your scenarios indicate how the user perceives your software. Decisions about which part need improvement can be taken with certitude because that it is backed by real data from users that consume your system. It is easier to justify engineering time to improve code that hinders the experience of your customer when you can have hard data. Collecting about scenarios is a source of feature popularity as well. The aggregation of the count by users of a specific scenario can indicate if a feature is worth staying in the system or should be promoted to be more easy to discover. The conclusion of how to interpret the telemetry values are subjective most of the time, but is less opinionate then a raw gut feeling. Always keep in mind that a value may hide an undiscovered reality. For example, a feature may be popular but users hate using it — they just do not have any other alternative.

There are many telemetries and when I unfolded my plan to collect them I created a TypeScript (client-side) library that is very thin with 4 access points. The first one is named “trackError”. Its specialty is to track error and exception. It is simple as having an error name that allows to easily group the error (this is possible with handled error caught in try-catch block) and contains the stack trace. The second one is “trackScenario” which start collecting the time from the start to the end. This function returns a “Scenario” object which can be ended but also having the capability of adding markers. Each marker is within the scenario and allows fined grained sub-steps. The goal is to easily identify what inside a scenario involves slowness. The third access point is trackEvent which take an event name and a second parameter that contain an unstructured object. It allows collecting information about a user’s behavior. For example, when a user sorts a list there is an event “sortGrid” with a data object that has a field that indicates which grid, the direction of the sort, which field is being sorted, etc. With the data of the event, we can generate many reports of how the user is using each grid or more generic which field etc. Finally, it is possible to “trackTrace” which allow specifying with many trace level (error, warning, info, verbose) information about the system. The library is thin, simple to use and has basic functionality like always sending the GIT hash of the code within the library, always sending the navigation information (browser info), having the user unique identifier, etc. It does not much more. In fact, one more thing which is to batch the telemetry and send them periodically to avoid hammering the backend. The backend is a simple Rest API that takes a collection of telemetry message and stores them in an Elastic Search persistence.

A key aspect, like many software architectures and process decision, is to start right from the beginning. There are hundreds of usage of telemetry at the moment in the system and it was not a burden to add them. The reason is that they were added continually during the creation of the website. Similar to writing unit tests, it is not a duty if you do not need to add to write them all at once. While coding the features, I had some reluctances in few decisions, I also had some ideas that were not unanimous.

The aftereffect of having all the data about the user sooth many hot topics by having a reality check about how really the users are using the system. Even when performing a thorough user session to ask how they are using the system, there is nothing like real data. For example, I was able to conclude that some user tries to sort empty grid of data. While this might not be the discovery of the century, I believe it is a great example that shows a behavior that no user would have raised. Another beneficial aspect is monitoring the errors and exceptions and fixing them before users report. In the last month, I have fixed many (minor) errors and less than 15% were raised or identified by a user. When an error occurs, there is no need to consult the user — which is often hard since they can be remote around the world. My daily routine is to sort all error by count by day and see which one is rising. I take the top ones and search for a user who had the issue and looks at the user’s breadcrumb to see how to reproduce locally on my developer’s machine. I fix the bug and push it in production. The next day, I look back at the telemetry to see if the count is reducing. A proactive fixing bug approach is a great sensation. You feel way less trying to put water on fire allowing you to fix properly the issue. Finally, with the telemetry system in place, when my day to day job is getting boresome or that I have a slump in productivity, I take this opportunity to start querying the telemetry data to break into several dimensions with the goal to shed some light about how the system is really used and how it can be improved to provide the best user experience possible.

This article was not technical. I will follow up in a few days with more detail about how to implement telemetry with TypeScript in a React and Redux system.

Using TypeScript and React to create a Chrome Extension Developer Tool

I recently have a side project that I am dogfooding at Netflix. It is a library that handles HTTP request by acting as a proxy to cache requests. The goal is to avoid redundant parallel call and to avoid requesting data that is still valid while being as simple as possible. It has few configurations, just enough to make be able to customize the level of cache per request if required. The goal of this article is not to sell the library but to expose information about how I created a Chrome’s extension that listens to every action and collects insight to help the developer understanding what is going on underneath the UI. This article will explain how I used TypeScript and React to build the UI. For information about how the communication is executed behind the scene, from the library to Chrome, you should refer to this previous article.


Here is an image of the extension at the time I wrote this article.

Chrome’s extension for developer tool requires a panel which is an HTML file. React with Creat-React-App generate a static HTML file that bootstrap React. There is a flavor of create-react-app with TypeScript that works similarly, but with TypeScript. In both case, it generates a build in a folder that can be published as a Chrome’s extension.

The build folder content can be copied and pasted into your distribution folder along with the manifest.json, the contentScript.js, and background.js files that has been discussed in the communication article between your code and Chrome extension.

What is very interesting is, you can develop your Chrome’s extension without being inside the developer tool. By staying outside, it increases your development velocity because you do not need to build which use Webpack — this is slow. It also requires to close and open the extension which at the end consume time for every little change. Instead, you can mock the data and leverage the hot-reload mechanism of create-react-app by starting the server (npm run start) and run the Chrome’s extension as we independent website until you are ready to test the full fledge extension with communication coming from outside your extension.

Running the website with creat-react-app is a matter of running a single command (start), however, you need to indicate to the panel’s code that you do not expect to receive a message from Chrome’s runtime. I handle the two modes by passing an environment variable in the command line. In the package.json file I added the following code:

"start": "REACT_APP_RUNENV=web react-scripts-ts start",

Inside the React, app.tsx file, I added a check that decides whether to subscribe to Chrome’s runtime message listener or to be injected with fake data for the purpose of web development.

if (process.env.REACT_APP_RUNENV === "web") {
    // Inject fake data in the state
} else {
    // Subscribe to postMessage event
}

Finally, using TypeScript and React is a great combination. It clarifies the message that is expected at every point in the communication but also simplifies the code by removing any potential confusion about what is required. Also, React is great in term of simplification of the UI and the state. While the Data Access Gateway Chrome’s extension is small and does not use Redux or another state management, it can leverage the React’s state at the app.tsx. It means that to save and load the user’s data is a matter of simply dumping the state in the Chrome’s localstorage and to restore it — that is it. Nothing more.

public persistState(): void {
  const state = this.state;
  chrome.storage.local.set({ [SAVE_KEY]: state });
}
public loadState(): void {
  chrome.storage.local.get([SAVE_KEY], (result) => {
    if (result !== undefined && result[SAVE_KEY] !== undefined) {
      const state = result[SAVE_KEY] as AppState;
      this.setState(state);
    }
  });
}

To summarize, a Chrome extension can be developed with any front-end framework. The key is to bring the build result along with the required file and make sure you connect in the manifest.json the index generated. React works well, not only because it generates for you the entry point as a simple HTML file which is the format required by Chrome’s extension. TypeScript is not a hurdle because the file generated by the build is JavaScript, hence no difference. React and TypeScript is a great combination. With the ability of developing the extension outside Chrome’s extension you can gain velocity and have a product rapidly in a shape that can be used by your user.

How to communicate from your website to a Chrome Extension

Passing a message from a website to a Chrome Extension is not routine job. Not only the communication between a specific piece of code from the browser to a specific browser is unusual, it is also confusing by the potential type of extension. In this article, I’ll focus on an extension that goes into the Chrome Extension Developer tools. Similar to the “Elements” or “Network” tab, the extension will have its own tab that will be populated by the website. To be more accurate, it could be any website using a specific library.

The illustration is the concept of what is happening. The reality is a little bit more complicated. There is more communication boundary that is required which can be confusing at first. The documentation is great but it lacks guidance for a first time user. The following illustration is what happens in term of communication and with that in mind, the flow should be easier to understand.

Your library that is sending information to your extension is very simple. It consists of using the “window.postMessage” to send an object. The extension will read and parse your payload depending on the source. For my library and extension, named Data Access Gateway, I decided to have the source name “dataaccessgateway-agent”. The name could be anything. Keep in mind that later, you will reuse the name at the extension code to verify that the message is coming from your source.

  window.postMessage({
                source: "dataaccessgateway-agent",
                payload: requestInfo
            }, "*");

The payload may be anything you want but make sure it remains with an object that is not constructed (with “new). For example, if in your payload you have a date, make sure they are not in the payload as an actual Date object but in a more primitive form (string or number). Otherwise, you will receive an exception.

The next step is to configure the manifest file for the extension. The critical detail is to specify two JavaScript files: the background and content_script. The former run regardless of which website is active. It runs in the background of the Chrome Extension from when the extension is loaded until it is unloaded. The latter is a script that the extension injects into the webpage. The injection can be targetted to a specific page or to run on all webpage. In my case, the extension must receive a message from a library, hence I do not know which website might use it and I allow the injection in every page. Because we are having this requirement to be available on every page, the security and the communication is more overwhelming that most information you can find in the basic documentation.

{
    "name": "Data Access Gateway Developer Tool",
    "version": "1.0",
    "description": "Data Access Gateway Developer Tool that allows getting insight about how the data is retrieved",
    "manifest_version": 2,
    "permissions": [
        "storage",
        "http://*/*",
        "https://*/*",
        "<all_urls>"
    ],
    "background": {
        "scripts": [
            "background.js"
        ],
        "persistent": false
    },
    "icons": {
        "16": "images/dagdl16.png",
        "32": "images/dagdl32.png",
    },
    "minimum_chrome_version": "50.0",
    "devtools_page": "index.html",
    "content_security_policy": "script-src 'self' 'unsafe-eval'; object-src 'self'",
    "content_scripts": [
        {
            "matches": [
                "<all_urls>"
            ],
            "js": [
                "contentScript.js"
            ],
            "run_at": "document_start"
        }
    ]
}

The manifest file asks for permissions and specifies the “index.html” which is the file loaded when the Chrome Developer Tool panel is open. We will come back later on the HTML file. The important part is the background and contentScript.js. Both files can be renamed as you wish. Before moving on, it is important to understand that the communication flows in this particular order: postMessage -> contentScript.js -> background.js -> dev tools HTML page. The core of the code will be in the HTML page and the remaining is just a recipe that must be followed to be compliant with Chrome’s security.

The contentScript.js is the file injected into the webpage. The sole purpose of this file is to listen for message passed by “window.postMessage”, to check the payload and make sure this is the one we are interested in and move along to Chrome’s runtime. The following code registers a “message” listener when the webpage loads. The script captures “postMessage” and checks for the source. When is the agent name defined in the previous step, we invoke the sendMessage from the Chrome’s runtime. The invocation passes the message to the background.js file.

window.addEventListener("message", (event) => {
    // Only accept messages from the same frame
    if (event.source !== window) {
        return;
    }

    var message = event.data;

    // Only accept messages that we know are ours
    if (typeof message !== "object" || message === null || !!message.source && message.source !== "dataaccessgateway-agent") {
        return;
    }
    chrome.runtime.sendMessage(message);
});

The next step is to listen to the Chrome’s runtime messages. More code is required. There is a collection of tabs which will handle multiple tabs situation to know where it comes from. There are two listeners. One handle incoming new message and one for new Chrome’s tab. The message dispatches the message to the proper tab, the other listener subscribes to and unsubscribe the tab.

let tabPorts: { [tabId: string]: chrome.runtime.Port } = {};
chrome.runtime.onMessage.addListener((message, sender) => {
    const port = sender.tab && sender.tab.id !== undefined && tabPorts[sender.tab.id];
    if (port) {
        port.postMessage(message);
    }
    return true;
});

chrome.runtime.onConnect.addListener((port: chrome.runtime.Port) => {
    let tabId: any;
    chrome.runtime.onMessage.addListener
    port.onMessage.addListener(message => {
        if (message.name === "init") { // set in devtools.ts
            if (!tabId) {
                // this is a first message from devtools so let's set the tabId-port mapping
                tabId = message.tabId;
                tabPorts[tabId] = port
            }
        }
    });
    port.onDisconnect.addListener(() => {
        delete tabPorts[tabId];
    });
});

The post.postMessage send the payload for the last time. This time, it will be within reach of your Chrome Developer Tools extension. You may remember that in the manifest file we also specified an HTML file. This file can have a JavaScript file specified that will listen to the messages from the background.js script. I am developing the Data Access Gateway Chrome Extension with React, so the index.html starts the index.jsx, this one attach the app.jsx which will have in its constructor the listener.

this.port = chrome.runtime.connect({
        name: "panel"
});

this.port.postMessage({
        name: "init",
        tabId: chrome.devtools.inspectedWindow.tabId
});

this.port.onMessage.addListener((message: Message) => {
        if (message.source === "dataaccessgateway-agent") {
          // Do what you want with the message object
          // E.g. this.setState(newState);
        }
});

chrome.devtools.panels.create(
        "Data Access Gateway",
        "images/dagdl32.png",
        "index.html"
);

Still quite a few lines of code before actually doing something in the Chrome’s extension. The first one is to connect to the tab (port). Then, initializing the communication by sending a post message. Finally, on the connected port to start listening to incoming messages. Finally, we invoke the creation of the panel. As you might have seen, the “addListener” is strongly typed with the object I sent from the initial library call — that is right! TypeScript is supported in each of these steps. You can see all the detail, in TypeScript, in the GitHub Repository of the Data Access Gateway Chrome Extension.

To conclude, the transportation of your object from your library (or website) to the Chrome’s Developer panel is not straightforward. It requires multiple steps which can fail in several places. A trick I learned while developing the extension is that “console.warn” is supported. You can trace and ensure that the data is passing as expected. Also, another debugging trick, you should undock the extension (to have the Developer Tool in a separate window) which will allow you to do “command+i” on Mac or “f12” on Windows to debug the Chrome’s developer tool. This is the only way to not only see your tracing but to set a breakpoint in your code.