Automation for the people

Wednesday, January 13th, 2021

Having the number one rated fiber backbone in the world is of course great, but speed and capacity are not the only things that are important to customers – or to us, for that matter. Keeping that network going takes a lot of work because, no matter how hard you try, components fail and unexpected incidents occur.

For us, it’s one of the reasons we take so much care of our customer service and are always looking at ways to improve our Mean Time to Repair. That way, when problems do arise, we are on top of it as quickly as possible.

When thinking about the problem, there are two main types of issues to consider. Event Management is where a problem occurs on the network – how quickly can you identify that problem, the root cause, and get it fixed. While an event might go unnoticed by customers, Incident Management covers those cases where a customer has identified the problem and reported it to us. We want to minimize both, but the former means one or more of our customers is being impacted, and we want to minimize those cases as much as possible!

It’s something we are improving all the time. Where a decade ago we focused on monitoring just the backbone, there is now much more understanding amongst our customers of how small changes on our network or their on-premises equipment impacts the service they receive.

This has driven us to become more granular in the way we analyze and monitor connections, not just on our network but through customer premises equipment such as routers and switches. It was very clear to us from the outset that this information was useful to both us and our customers and could be used to improve our service through better understanding of Mean Time to Failure for components.

Of course, on a network the size of ours, that is a huge number of data points being generated across a range of parameters, and that is where automation has played such a crucial role. For some time now Telia Carrier has automatically generated service tickets and notified customers when a possible issue is identified. Sometimes it is a configuration change at the customer end responsible for the alarm, but in many cases network events are identified before they manifest themselves as incidents and directly impact customer experiences in a negative way.

Collecting this data has allowed us to learn even more about our own network, and we are able to spot patterns. A sequence of events and data points reveal patterns about equipment reaching a point of failure or an early warning of a wider issue. It means that we can take one customer’s experience and use it to assess the action that we need to take when we see it somewhere else.

This all means that we understand Mean Time to Failure with much more granularity than before and our Mean Time to Repair has been steadily decreasing. We make these granular data points accessible to our customers, so they too can better understand the status and performance of network links.

All of these learnings have also helped our first line support. Our engineers understand more than ever the cases they might encounter, and appropriate resolutions, when helping customers. It means we resolve more customer queries faster than ever, at the first level.

Automation has played a significant role in making this a reality. Our analysis of the network is getting ever closer to our customers’ networks and is providing far greater granularity. We’re constantly improving, but we’ll always be striving for an even better customer experience, with more data insights and improved customer support.

Watch this space!

Hasibul Islam, Head of Customer Support