How to investigate Azure Redis Timeout performance Error

There is many official resource about how to troubleshoot Redis Azure cache and Azure Redis performance FAQ but I couldn’t see what was problem. So, I’ll try to give some of the steps I did to figure out my problem. My problem might not be like yours, but I didn’t have any problem locally. Problems started to surge when I deployed in production on Azure. My setup was pretty basic with a Standard Web App (S1) and the smallest Azure Redis instance (C0). Even by upgrading to a Azure Redis instance (C1), which is quite more powerful I was still getting Redis timeout performing error.

Before continuing, I read that the C0 instance is just for testing purposed and not for production. The size is 4x bigger, this wasn’t an issue for me but I realized that I was hitting the 250mb limit which was producing Redis to remove items from the cache. This wasn’t a major problem in my case. The biggest change between C0 and C1 is the Megabytes per sec (MB/s) that go from 0.625 to 12.5megs. This is 24x faster. It also handle 12000 request per seconds which is 2x more than C0. In my case, I kept the update even if it didn’t fixed the issue.

The second modification was to change the number of thread. By default, Redis uses the same amount of thread to the amount of processor that the web server is running. A S1 instance is single core, thus was running on a single thread to send request to Azure Redis. This is problematic because requests queues and time out. To change the amount of thread, you can set the value in your Global.asax.cs in the Application_Start method. The value is not clear in the documentation. From my reading, it should be under 100 threads before having some performance issues. I decided to go with 50 which is 50 times more than the default configuration.

ThreadPool.SetMinThreads(50, 50)

Since time out was the problem, I also changed my connection string to handle extend the time before timing out. I put something that is way bigger than the default value of 1 sec. I decided to put 15 seconds which I will probably set down in the future.

xxx.redis.cache.windows.net:6380,password=xxxx,ssl=True,abortConnect=False,connectRetry=5, connectTimeout=15000, synctimeout=15000

I was already using a Lazy loading ConnectionMultiplex as described in the official documentation. To be able to have parallel call while a connection is used, I created a pool of lazy loading ConnectionMultiplex and was cycling between them in a Round-Robin style.

The code looks like:

//Fields in my Redis Cache class:
private const int POOL_SIZE = 10;
private static int connectionIndex = -1; //Because we ++ before using. The first time it will be 0.
private static readonly Object lockPookRoundRobin = new Object();
private static readonly Lazy<ConnectionMultiplexer>[] lazyConnection;

private static ConnectionMultiplexer Connection
{
	get {
		lock (lockPookRoundRobin)
		{
			connectionIndex++;
			if (connectionIndex >= POOL_SIZE)
			{
				connectionIndex = 0;
			}
			return lazyConnection[connectionIndex].Value;
		}
	}
}

//Static constructor to be executed once on the first call
static RedisCache()
{
	lock (lockPookRoundRobin) 
	{
		lazyConnection = new Lazy<ConnectionMultiplexer>[POOL_SIZE];
		var connectionStringCache = System.Configuration.ConfigurationManager.AppSettings["CacheConnectionString"];
		for (int i = 0; i < POOL_SIZE; i++)
		{
			lazyConnection[i] = new Lazy<ConnectionMultiplexer>(() => ConnectionMultiplexer.Connect(connectionStringCache));
		}
	}
}

I have several methods that use Connection which give the next connection to execute a call.

At this point, I was in a better situation but some small requests were still failing. By small I mean request that was returning few kilo bytes of data. At that point, I decided to look at the portal and realized that I was reaching almost the limit of 1 gig of Redis! I was quite surprise because with the production data I was reaching about 200 megs most of the time. This is where I learned the command –bigkeys of Redis. This command gives your some information about which keys is taking a lot of space. From my manual inspection, I knew that I had about hundreds of key that was reaching 700kbs to 2 megs. However, with that tool, I found dozen and dozen of entries around 18 megs. The command look likes this :

"D:\Program Files\Redis\redis-cli.exe" -h xxxx.redis.cache.windows.net -p 6379 -a PASSWORDXXXXX --bigkeys

BigKeys

From there, I was able to do some optimization and reduce my biggest key to less than 1 meg.

At that point, I though I was in a better situation. But few hours later the problem was still there. I noticed a lot of log with the Redis exception. I finally found that since I am using Azure slot that webjobs are executed on the real website as well as the staging slot. That means that I had twice as traffics coming from the webjobs. Instead of about 18 jobs, I had 36 jobs doing everything twice at the same time, hence creating a lot of traffic on Redis. To have you Website Slot not executing webjobs you need to have a slot setting in the portal. It’s “webjobs_stopped” with the value 1. Be sure you mark this setting as a slot setting, this way it won’t get transferred when you switch from staging to production.
webjobs_stopped_1

I was still intrigued to see that on the Azure Portal, under Azure Redis that after few hours, the size was reaching the limit of 1 gig. In dev, nothing went more than 100 megs. I ran again the bigkeys and saw some huge entry. The culprit was Entity Framework. With the time, users were adding a lot of data and some entities were referring each others. For example A->B->A[] and A also has reference to C, D, where D->A[] and so on. So the tree that was serialized was huge when in reality the objects tree wasn’t that huge. With some modification, I was able to get back to around 150 megs. I was still getting error, and this is when I decided to capture with Application Insight all the size when I set value on Redis.

Here is a snapshot of few hours:
ApplicationInsightRedisCacheKeyValuesSize

We can see the key prefix with the percentile and the count. 2 of them looks to be problematic even if they are in low volume (16 and 3). It still odd that with 50 threads and multiple pool that these can be problematic, but it was a good indication that some keys were hosting too much value. To fix this problem, two actions was successful. The first one was to serialize with a max depth. I’ll write a future article about how to modify Json.Net to serialize a limited amount of deepness. This is critical if you are using Entity Framework because reference can be very deep. The second solution was to compress the data sent to Azure Redis. I’ll also write a future article on how this can be implemented. In short, I am compressing every objects sent to Redis using LZ4 encoding. I am getting a very fast compression that cut around 75% of the size. Here is the result in kilo-bytes of the same data with theses 2 changes. The result is stunning. A lot of information is less than 1ko and most of the data is under the suggested limit of 100ko per key.

CacheSizeAfterCompression

The next step was to see the number of get and set. In some webjobs, the amount was very high for small value and those values were used in the same webjob. So, to get around 500 values from the cache, about 10 000 calls were made. The problem was that with that many calls, even if it was small call, with the latency of doing this call from the webserver to the cache server, the time was increasing. A change in the code to improve was to use MGET of Redis to get few values from multiple key with a single call, as well as storing the values into a memory cache for a quick amount of time (I choose 5 minutes) and a small amount of memory (I choose 100 megs). So, the architecture stayed to be Controller -> Service Layer -> Accessor Layer -> Repository Layer but instead of having in Access Layer just the logic to handle Redis, this one is also using System.Runtime.Caching.MemoryCache. The first round to get the data from Redis still take some time but after it’s blazing fast.

I was still having some jobs taking more time that I wanted. More a job takes time, more it has chance to access data on the cache. I added a lot of log and realized that my serialization was still taking some time before accessing Redis. Two solutions was possible. One is to serialized less deep and the other one is to do some parallel operation when setting. To have less serialization deepness you need to trick Json.Net (I am using Json.Net). This can be done with a ContractResolver that count the number of level the serialization goes. This is crucial if you are using Microsoft Entity Framework because references go very deep with object. I found a sweet spot at 5 level deep. The performance improved by 70% in average. The parallel was quick to do, but has a drawback of using the same thread as the web, hence it impacts more the front-end. Microsoft recommends to have your heavy webjob on another web instance. For the moment, I did not. The result is :

  Task.Factory.StartNew(() => {
    // Serialization occur here + set to Redis
} , CancellationToken.None
  , TaskCreationOptions.DenyChildAttach
  , TaskScheduler.Default);

It’s fire and forget. We store and continue. One of my biggest job who was taking 12 minutes went down to 4 minutes. 3x faster! This particular part takes about 2 minutes to run all, which is not that bad. The second drawback is the possibility to bring back some Redis timeout since we do a lot of call to Redis. The third drawback, which made me rollback that change is the one mentioned right before the code snippet : the web was getting slow. All threads were taken by this job leaving nothing for web request. I put the idea on ice, I think with a restrained pool of thread that we can have something interesting here.

With all the current modification, I saw some major improvement in term of Redis timeout. Compression was a key element to reduce the overall load as well as reducing the size of each serialized element. In fact, I went from 160megs to 1 gig to 120megs to 16 megs.
MemoryAzureRedis

Most Azure web jobs performance improved significantly. Here is an example of a task that went from around 50 minutes to a stable 13 minutes. Even if the performance is far from being at the target, the reduction is noticeable and are now under the trigger threshold of 20 minutes which is a first step in the right direction.

WebJobPerformanceImproved

What next? The idea of working with thread/asynchonous set is really appealing since setting into the cache is not something we should be needed to wait for. Limiting the amount of item to cache is also an avenue possible as well as having a faster serialization process.

Publishing your website from VSTS

Connecting VSTS to Azure is not something well documented. You will see few articles that talk about configuring a step in VSTS build, some other article that talk about steps with the old Azure portal. After few hours, I realize that deploying VSTS code to Azure is more easy than initially thought. First, you need to create in VSTS a service endpoint. In the settings of VSTS, select “New service endpoint”. Multiple options are available, the one that is needed is “Azure Classic”. The name doesn’t matter of the connection your choose. What is important is the subscription id and name. I choose to authenticate with the publish setting file.

BourseVirtuelleToAzure

Once done, you need to go in Azure and search for “team” in the browse. Select the Visual Studio Team Service, and connect your MSDN subscription.

BourseVirtuelleToAzure2

The last step is to select under your website’s setting the deployment configuration. You can select multiple sources, but only one at the time can be selected. Choose VSTS.

BourseVirtuelleToAzure3

From there, if you push something, Azure will get the code, build and deploy.

Publishing to Azure Website Slots

Azure allows to create website’s slots once you have your website created. Before everything, what is an Azure website slot? Slots are a way to create a clone of your website on a temporary website. Having this clone let’s you test your code without affecting your production code. You can have custom application settings, like pointing to a second database if you want. It has the advantage to also warm-up without affecting your user. The first call to a website is always slower. With a slot, it’s slow on this one. You can configure to have this one warmed up completely before auto-swap back to the production, where your user see it. Or, you can manually swap it when ready, which allow you to do some manually testing on the slot.

The first step is to create the slot. This is a matter of less than a minute. You can do it directly on the portal.

AzureDeploymentSlots

The second step is to deploy to this slot instead of the main production one. This was the step that is not clear. You have to publish to a slot and then swap. To do so, you need to get the publish settings. This can be downloaded in Azure Portal. Just go on the main website, click slots, select the one you want to deploy and click Get Publish Profile.

DeploymentSlotGetPublishSettings

This is how to deploy a website on Azure with slot. By publishing to a specific slot and then handling the slot manually or automatically (auto-swap). A special note, if you want to use Azure Deployment slots, you need to go in your website deployment slot and configure the Deployment directly on the slot, not the main website.

Learnings from my First Enterprise Project with Facebook React

I played with React in 2015 during a Microsoft hackathon. It was without any application architecture like Flux. It was a first experience that didn’t impressed me. The page I created was a Visual Studio Team Services hub extension that had some textboxes, combobox, graphics and information were pushed through SignalR. The hackathon was 4 days of development and like anything new, some ramp up was needed. Nevertheless, it was working at then end with a sour taste in mouth. I couldn’t really find why I wasn’t in love with React since everybody seemed to love it so much. Argumentation about having a clear separation between component and logic didn’t make a lot of sense since we could do as well since many years, with some rigor. I also found that this argument is a weak one since you can create messy code with React too and have logic in the component. In any cases, I didn’t touch it since this spring with a new project in Visual Studio Team Services where I would need to design the whole page in React Facebook. This time, I wanted to leverage as much as possible and went with an application architecture that was a mix of Flux and Reflux.

Application Architecture a.k.a. Flux Cycle

This was a first big decision to go. Do we use Flux, Reflux, Redux, or any other variant that exist. The community seems to prefer Redux, but this come with some heavier knownledge. Everything needs to be immutable and more restriction about where to change the information. Today, I understand more all the advantage of it, but it still more work to do. Also, beaware that we are using TypeScript so it’s more than just single NPM command to get the boilerplate for a new project. I decided to take a simpler approach that doesn’t need much and that could be done in TypeScript with just React and React Dom library. It ends up being Flux and Reflux. Flux because we are using Action Creator, and Reflux because we do not use Dispatcher. Why Action Creator? Because we wanted to have all business logic, call to the server with Ajax to be made not by the UI component, neither the store. Why no dispatcher because we wanted to have strong dependency between action and the store.

FluxCycle

The main concept is :

  • Store knows where set new values. It’s the only one that can change the store value
  • Components are dummy. No logic, just read from the store and display.
  • Action Creator are the one that call other TypeScrip classes for logic
  • Actions nothing more than an observer pattern. Action Creator is the only one that can invoke them. Store is the only one to listen to them.
  • Almost single store. Some shared concept are possible, we try to avoid. Every Flux cycle has one main store.
  • The page can have as many Flux cycle as required. We try to go by business domain.

Store

The store started to not be immutable. We rapidly find that even with a good rigor, it was easy to have the Action Creator changing values. Since the data passed by to the component was passed by reference, the data given from the component to the action creator could me modified and thus changed automatically in the store. This caused some problem in regard of the ShouldUpdateComponent method. The current properties always look the same as the next properties, which make it harder to compare for performance optimization. We had to refactor the code. We couldn’t simply pass the information cloned ($.extend()) because it’s expensive and also because our classes stored in the store had some method (see more detail later about it).

The store has multiple methods to set information in the tree of data. In Redux, it’s the responsibility of the reducer. In our architecture, we have 1 method per “reducer” or more understandable : one method per data manipulation.

We started as one big store for all the pages. But our page has a wizard to create data, and multiple views. It was heavy to have to handle all the information for all cases. We finished to have one main store that handle the application views, one store per view and one store for the creation wizard. The Flux cycle that was handling the creation of data was doing it’s own business. Every keystroke was doing a full cycle, doing validation, getting data with Ajax (from the Action Creator), etc. When ready to save, pass the information to the main page Flux cycle. The communication between the Flux cycle is done via shared Action that the main Flux Cycle is listening to. This design is good since day one. Clean separation between domain.

Components

At first, some UI logic was in component. This went away very fast. It’s harder to unit test and even if the logic seem to be really just for a single component, this is not true most of the time. It’s also easier to optimize the logic when this one is not in the component. You can cache data, do some batch processing, etc outside the component.

I was and still am a strong believer of not layover the style in seperated CSS files with Flux. From my experience with rich web component, a lot of positions and dimensions logic needs to be dynamically calculated by TypeScript and having it in CSS make it impossible. So the rules we have is to place all left, top, right, bottom, margin, padding in the component as inline style. The CSS only handles theme information like color and the main structure of the page.

We are not using any mixin. It’s official, by Facebook that this is not a good pattern and before this was announced I found it to add complexity for nothing. Shared logic are using TypeScript class at the proper time.

Action Creators

We use the action creator as a “controller” in MVC pattern. It’s the one to receive actions from the component. You click something, the component call the action creators. The action creators is easily testable since it takes some input values and call actions. Like MVC controller, this one is not huge in code, it delegates logic to business logic classes.

Action creator handles synchronous and asynchronous calls. In fact, any calls from the component that needs to get information from the server (Ajax calls) send always 2 actions to the store. One that is rapidly handled, which usually make the UI having a progress bar, and one when we receive the data from the server. The last one remove the progress bar and show the data.

Actions

We have one file per Flux cycle that store all actions for the Flux cycle. Usually it has a dozen of action which are really just events. Nothing to test here. Pretty dry file.

Immutable

Immutability is something discussed a lot and I didn’t took it as serious as a lot of article say. However, I found it harder to handle the store without it. Even if the Facebook document say that you can leverage the ShouldUpdateComponent without being immutable, it’s not straight forward. Comparing values doesn’t work. The only way that we find is to have inside the classes of our store some properties that say when to update or not. The logic was defined in the Action Creators. We are still in the road of making it more and more immutable since our UI is intensive in term of changing.

Object Oriented vs Functional

That is the mistake I did. I am a strong object oriented advocate and I designed our model classes (used in the store) to have properties but also methods and dynamic properties. Properties are fine, but the two others are wrong with React Flux. Properties that are calculated shouldn’t be in the class because they might use logic that make the object being modified in some situation (like with date). Methods are bad for two reasons. First, no one should call them else than the Action Creators. You can hide them by having an interface but the problem is just delayed. It’s delayed because if you need to clone the object, the methods won’t be copied and you end up with just data without prototype methods. At the end, you want to have your stores with just plain properties and all the logic inside classes that only the Action Creator can invokes. It’s simple to test, protect you to have the store or the components to invoke any logic and can be make easily immutable.

Tests

I am a strong believer of automatic tests. React helps but like any framework this can be a mess too. Since we respect the architecture and all logics are in the actions creators and business logic classes we are able to unit tests easily. It still requires some skill to have proper tests that are easily maintainable as well to create classes and methods small to have not too much to stub, but this is like any framework, not just React. Our strategy is to have > 90% coverage on action creator and business logic. We have a good coverage on classes mapping between the server contracts and our store/model classes. Store is tested to be sure information that is sent by action are stored at the right place. The priority is business logic classes, action creators, stores and components. Components are really dummy and tests are longer to test. Since we can easily see the end result on the browser, we do some minimum tests. Of course, we would like to do tests everywhere but reality is that we need to ship software at some point.

Summary

The experience with React can be frustrating. It adds a lot of overhead. Typing in a text box raise a full cycle. You also need to have a lot of boilerplate to do to for each new section that are complex (new Flux cycle). Most thing could be more easily done with just changing the UI with JQuery. However, that often lead to some more spaghetti code too. That said, we often forget that most website are not generating billion like Facebook. React is great, but not for small application. Not for small intranet website. It’s good for intensive UI that change a lot with different kind of source. I also find that the time to get started with React is higher than just doing plain TypeScript with JQuery, that most people know. The integration of React in this new page was good, even if the rest of the application wasn’t React. This is a great news if you have a legacy software and you want to modernize it slowly. React and its different application architectures (Flux models) help to have good practice of dividing your software, but any good architect can also divide an application with any technologies. I understand why it’s popular with the virtual DOM, but like I said, most website doesn’t need to have a high frequency updates.

How to redirect Http to Https only in production

You are working locally without a SSL certificate and in production with certificate,=. The simplest way to work handle both case is to have a configuration that switch depending of if you are in your production server or in your local dev box. Here is two solutions. The first one is well known on Internet but require to have IIS with the rewrite module. This is not a problem with Azure, and even locally it’s not a big problem because it can be downloaded from IIS manager console, under Web Platform. But, you won’t need that. The first solution is to change the web.release.config to add the redirection on the deployed files only. This is done like this:

<?xml version="1.0"?>
<configuration xmlns:xdt="http://schemas.microsoft.com/XML-Document-Transform">
  <system.webServer>
    <rewrite xdt:Transform="Insert">
      <rules>
        <rule name="Redirect HTTP to HTTPS">
          <match url="(.*)" />
          <conditions>
            <add input="{HTTPS}" pattern="off" ignoreCase="true" />
          </conditions>
          <action type="Redirect" url="https://{HTTP_HOST}/{R:1}" redirectType="Permanent"/>
        </rule>
      </rules>
    </rewrite>
  </system.webServer>
</configuration>

The second solution is simpler, because it’s just a change in code. However, the request needs to go to the Asp.Net pipeline which is more demanding for your webserver. You should have the redirection as soon as you can and doing it at IIS level is the best place. Nevertheless, it’s always good to have both solution on hands.

In global.asax.cs:

if (!HttpContext.Current.IsDebuggingEnabled)
{
     filters.Add(new RequireHttpsAttribute());
}

TypeScript comparison of null and undefined

Comparing if something is null or undefined is a trivial task but can take different color depending to whom you talk. This article will try to make it simple to understand. Since TypeScript is built on top of JavaScript, it has to handle the craziness of how JavaScript compare. In this article we will compare 4 types to check if the value is null, undefined or really reflect the values intended. All tests are built with this simple class and utility method:

class TestClass {
    public propString: string;
    public propNumber: number;
    public propBoolean:boolean;
    public propObject:TestClass;
}

function show(propertyName:string, operator:string)
{
    var output = document.getElementById("output");
    output.innerHTML  = output.innerHTML + ("<p class='one-result'>" + propertyName + " is null true with " + operator + "</p>");
}

var trr = new TestClass();

First, let’s test the boolean value when this one is set to nothing. We expect to have this value to be undefined which can be validated by ==, ===, ! and typeof :

if(trr.propBoolean === null)
{
    show("propBoolean", "=== null");
}
if(trr.propBoolean == null)
{
    show("propBoolean", "== null");
}
if(trr.propBoolean)
{
    show("propBoolean", "no operator");
}

if(!!trr.propBoolean)
{
    show("propBoolean", "!!");
}

if(trr.propBoolean === undefined)
{
    show("propBoolean", "=== undefined");
}
if(trr.propBoolean == undefined)
{
    show("propBoolean", "== undefined");
}

if(typeof(trr.propBoolean) === "undefined")
{
    show("propBoolean", "type of === undefined");
}

The output is what expected except the ! operator which doesn’t work with undefined with a boolean value.

propBoolean is null true with == null
propBoolean is null true with === undefined
propBoolean is null true with == undefined
propBoolean is null true with type of === undefined

If we redo the test with boolean, but this time by setting the value to null (trr.propBoolean = null;) we get this result:

propBoolean is null true with === null
propBoolean is null true with == null
propBoolean is null true with == undefined

The result is more surprising. As expected, the typeof undefined is not working anymore since it’s defined to null. However, the comparison the == undefined is true. Also, the direct validation doesn’t work as well as the !!.

Setting the value to true, we have the no operator and the !! that let it through. Setting to false, nothing is printed. So, if we only take the boolean case in consideration and you want to be sure that true or false is set to the variable than you must use 1) == null or 2) == undefined.

Number

If we do the test with number, we have the same result than with boolean when comparing to nothing (undefined) as well as null.

propNumber is null true with == null
propNumber is null true with === undefined
propNumber is null true with == undefined
propNumber is null true with type of === undefined

and :

propNumber is null true with === null
propNumber is null true with == null
propNumber is null true with == undefined

The problem is when we set the number to the value of 1. This one will pass the test of being true if compared with if(numberOfValue1) and also if (!!1).
So, the only way to really validate number type to check if they are null or undefined is to use 1) == null or 2) == undefined.

String

Comparing a string is producing the same value when undefined and null than the boolean and number. The problem is if the value of the string is “0” or “true” than it will goes in the if(stringValueWithZero) or if(stringValueOfTrue).

propString is null true with no operator
propString is null true with !!

It means that you can really truly just verify null or undefined value of a string with the equal operator — again.

Object

The last test is to compare a custom object. As expected, when undefined and null the result is the same. You can also use the if(trr.propObject) and if(!!trr.propObject) without problem. In that case, you have more options.

Summary

From all scenarios cover in this article, it’s obvious that the only save way to check if a value has been set or if this one is set to null that is to compare with the double equal with null or undefined. I prefer the == null because undefined, in JavaScript, could be rewritten with a value. Thus causing some false-true. You can always use the triple equal to check null and undefined, but this is just more typing for the same result. You can have fun at this link to play around with the test : http://typescript.io/aEeZGxas0wg. If you are interested to see all other cases where comparing with if(yourVariable) can be dangerous, you should look at https://dorey.github.io/JavaScript-Equality-Table/ where they show multiple scenario where a true value can occur when not expected. Here is a glimpse:
JavaScriptTrueFalse

You can see many discussion on StackOverFlow about checking the value in TypeScript or in JavaScript which lead to the same conclusion of not trying to use semantic shortcut but to compare with the double ==. TypeScript supports it in strict mode as you can see in this pull request where Anders Hejlsberg is involved (core developer of TypeScript).