Functionize Presents at UCAAT
An explanation of Canary Testing, recently presented by Functionize CEO Tamas Cser at UCAAT, and how it can be implemented into your CI/CD pipeline to gain valuable insights into whether a particular user journey is problematic for your application.
This past week, Functionize CEO, Tamas Cser, was proud to have the opportunity to present at UCAAT – the User Conference on Advanced Automated Testing. Now in its 6th year, UCAAT is one of the biggest conferences in this field. It is organized by ETSI and attracts sponsorship from global firms including Sogeti (a part of Capgemini).
UCAAT aims to connect the users of test automation tools with suppliers and with academia, thereby creating new synergies and opportunities for knowledge sharing. This year, topics covered an impressively broad spectrum, from model-based testing (MBT) in the automotive industry to approaches for automating the testing of virtual reality systems.
Tamas last presented at UCAAT two years ago. Back then, he was amazed to find that no other company was talking about applying AI to solving the issues in test automation. Now, two years later, people are starting to wake up to the power of AI, but many still seem stuck in the automation dark ages.
Tamas’s talk was titled “Advanced Anomaly Detection in Canary Testing: Experimental Methods, LSTM Models and Hybrid Approaches.” This blog will summarize Tamas’s talk for those of you who were unable to attend the conference.
What is Canary Testing?
We are all familiar with the image of the canary in a cage being used by miners to detect pockets of dangerous gas before they killed the miner. This was the inspiration behind Canary Testing. Rather than expose your entire user base to new code, in Canary Testing a small percentage (maybe 5% of users) are moved to the new code. This allows you to compare their experience with that of users still using the old version. In effect, you are using your real users to do your regression testing for you.
The benefits of this are clear – your new code is being tested on a far wider range of devices, running on your production servers, under real-world conditions and being exposed to all the “unexpected” behaviors that only real users can come up with! Also, by comparing the new code with the existing code you can both spot any unexpected behaviors and record any changes/improvements in performance. It is important to note that you still need to do basic testing to ensure the code you are going to release isn’t completely broken, but canary testing adds a new dimension of testing at scale.
How do you do traditional Canary Testing?
All the big tech companies such as Google, Amazon and Facebook routinely use Canary Testing when they are launching new features. Indeed they often do a progressive version of this, where they launch a feature to a small group, then a bigger group, then everyone being served from one data centre, then an entire region, before they restart the process in the next region.
Canary Testing is easy to implement when you are using load balancers (something most people will be doing once they are operating at any sort of scale). You just need to set things so that a given percentage of your traffic is diverted to the servers/containers running the new code. You then use instrumentation to compare the relative performances. Typically you are looking for things like response times, server/database load, compute resource consumed and, of course, any errors. Hopefully before you have launched code for Canary Testing you are pretty sure there’s no catastrophic bugs, but you clearly need to be prepared to rapidly switch users back to the old code if you do find a serious issue.
How can you automate Canary Testing?
Automating Canary Testing may sound like a simple task. Surely all that is needed is some way to automatically compare the experience of existing users with that of those ones on the new code? Well, yes, creating an AI model that learns how your app performs normally and can therefore identify when it is behaving differently is, as AI modelling goes, simple.
But the problem is, the output of that will just be to flag that there is a problem. Given the nature of how Canary Testing is usually run, that’s not much of an advance over having automatic alerts that trigger when the performance of the new code goes outside of certain bounds. It might streamline your process a bit, but it isn’t getting you much further than standard Canary Testing.
So, you may ask, why are Functionize bothering with automated Canary Testing? Well, here at Functionize we had a different view of what Canary Testing should be able to achieve. Rather than simply monitor the bulk performance of all users on the new code and compare it to those on the old code, our vision was to have a system that identified each type of user journey in the system and thus compared like-for-like. By doing this, you not only get much finer-grained data, you also get valuable insights into whether a particular user journey is a problem or not.
Tracking user journeys
The first requirement for creating our vision was to work out how to track specific user journeys through your app. Fortunately, the required tools are now part of the standard Functionize offering. We use a simple Javascript tag in the header of your code (much like Google Analytics). By doing this we are then able to record every interaction taken by every user in your system. The following picture shows an example of the output of this, highlighting the forgot password flow.
Now we have the data we need, we can start trying to identify the specific user journeys within the system. NB, our aim is to do this automatically – clearly it would be possible to take the user stories from your product team and construct the user journeys that way. But the problem is that users seldom behave exactly as predicted! So rather than prime the system to look for certain journeys, we instead use standard experimental methods to identify clusters of user interactions. We use the Akaike Information Criteria to assess the optimum number of clusters. The following pictures shows the result of doing this process for 10,000 users in a sample application, who between them generated 150,000 actions.
As you can see, the AIC graph suggests the optimum number of clusters is 5. These 5 clusters are shown in the graph below.
Predicting user journeys
Now that we can identify specific user journeys, the next step is to try and predict what action a given user will take next. To do this we have turned to a special form of a recursive neural network called a Long Short-Term Memory network. This slightly confusing name means that the network is created from a long chain of short-term memory nodes. The special thing about this sort of neural network is that it retains some memory of previous actions, hence it is ideally suited to predicting future actions. This means LSTM networks are often used in applications such as natural language processing (NLP).
To take a very simple example, as a human you know that if someone tells you “I am a pilot, I fly …”, then the last word is likely to “planes” or something very similar. LSTM networks allow AI to make the same sort of prediction based on their knowledge of what pilots do coupled with their ability to follow how the sentence was constructed.
We took the dataset mentioned above and used it to train an LSTM model. 80% of the data was used for training and 20% for testing. We found that we were able to correctly predict the next step in 85% of cases. This shows we are accurately identifying user journeys in most cases and is a very good result, given the relative complexity of the application.
Identifying anomalies
The final piece of the puzzle is how to detect whether the new code is behaving anomalously. Because we can compare similar user journeys with each other, we can do anomaly detection at a very fine scale. For instance, we can clearly identify users going through the login flow. If the new code causes the database load to increase, you may also see that login times increase. This will be picked up automatically as an anomaly. Equally, the new code may have cleaned up the login flow, in which case you’d expect to see some improvement to login times and server load. Again, this will be flagged, but this time that is the desired response, rather than indicating a potential issue.
Putting it all together
Once you have all these pieces you have the basis for a very powerful tool to add to your CI-CD pipeline. Once code is ready for release it can be automatically rolled out, first to a few canaries, then to more of your user base and finally to all users. At every stage, the system will spot any anomalies and either flag these, or even automatically roll back the release if it is bad enough.
Furthermore, you also gain invaluable insights into how users really use your app. This will be useful both for your product team and also to make sure your tests are covering what they need to before you release code.
Conclusions
As we explained at the beginning, canary testing isn’t the same as normal testing – you still need to test your app before you launch it. But canary testing is a cool approach that allows you to avoid nasty surprises when you release new code. As such, it has been used by the big tech companies for years. Here at Functionize we have taken the idea of canary testing and have added some of our AI magic to make a system that is both powerful and able to automatically identify issues in your released code. As a result, we are now able to streamline your CI/CD pipeline and give you real insights into how customers are interacting with your system.