Routing Security: RPKI Update Q2/20

Wednesday, May 6th, 2020

The Current State of RPKI

It’s been three months since we announced that AS1299, as the first Tier-1 transit network in the world, successfully is filtering RPKI invalid announcements from all external BGP sessions. As this is a fast-evolving topic with high interest from the industry at large, we feel this is a good time to share updates on the state of RPKI in general, what Telia’s current implementation looks like and some of the next steps being worked on.

We are thrilled over the current momentum around RPKI adoption in combination with key networks starting to create ROAs for their IP resources, thus making the Internet a more secure place. Recent academia efforts have focused on measuring and monitoring the impact of these deployments, highlighting both the overall progress as well as the benefits being seen already when it comes to limiting propagation of invalid announcements. Using publicly available data and tools with the methodology proposed by C. Testart et al. [1] makes it straightforward to monitor the RPKI invalid count per ASN as seen 2020-05-05 in Figure 1:

It also becomes abundantly clear how broad filtering across all eBGP sessions is required to get the invalid count down, especially in Tier-1 transit providers where it happens to matter the most. We are very happy with the progress and commitments being made by other operators in this category, making the next 12 months look very promising for routing security.

Some of that new-found interest in deploying RPKI can be attributed to recent high-profile BGP hijacks, many of which were seemingly a result of traffic engineering configuration mistakes during this challenging time where everyone is scrambling for capacity. This yet again highlights the fragility of the current trust model and the devastating impact intended or unintended actions can have on Internet stability. We recently wrote a blog post on the changing traffic trends and impacts as seen from AS1299 during this pandemic, which can be found here.

Concerns and Feedback

What is paramount to understand in this context is that RPKI on its own by no means is a complete protection and only really brings value when being deployed alongside IRR based mechanisms and features such as prefix filters, peer-locking and semi-dynamic prefix-limits. While it is our belief that the industry would benefit from every single entity operating in the default-free zone deploying RPKI, there are a couple of things to take into consideration:

The primary focus should be ensuring it is deployed where the impact is the highest, starting with Tier-1 transit providers such as ourselves taking our responsibility and not waiting for customers to apply enough commercial pressure to force a rushed deployment – and especially not during a pandemic
In the spirit of this being a collaborative initiative, it is incumbent on us that have deployed RPKI to dare to share the experiences, mistakes, lessons and pitfalls in order to make the journey shorter and less painful for others. Hopefully this post brings value in that context as well
In large-scale networks, although it having taken years of efforts in applying pressure on vendors, it is sometimes taken for granted that origin validation capabilities are available everywhere. This is certainly not the case, and sometimes only stable in very recent software releases
For all intents and purposes, RPKI is an opt-in solution
Control, in a way, has moved away from anyone with a RADb account to the five RIRs

Focusing on the last point of having to put more trust in the RIRs, there’ll be very few arguments here that it isn’t a legitimate concern. It is naive to suggest that because RIRs have stayed relatively clear from being compromised in the past means they therefore won’t be in the future from political, jurisdictional, technical/operational issues or by attacks with malicious intent. To a certain degree, they may have become an even more attractive target now.

Based on the feedback of a few customers, however, most concerns aren’t even related to RPKI itself. In many cases it has merely highlighted and increased the awareness of already existing situations that can occur. For example, IRR data can be manipulated to achieve the same result with harmful intent that is theoretically achievable if a malicious party would somehow manage to compromise a RIR and remark legitimate announcements as invalid. From our perspective, however, these entities are already trusted with other fundamental resources such as DNS delegation and IP resources.

The fact that compromising a RIR could be achieved in the same way in a pre-RPKI world can, however, not be an excuse to forego making focused and industry-wide efforts to monitor, audit and place demands on the RIRs from an operational, governance, transparency, process and security perspective. Every company’s entitled to their own risk analysis and due diligence, but the last thing we need is another decade of network operators, for whatever valid reason or otherwise, sitting on the sidelines waiting for a perfect solution to appear. Therefore, and while we recognize these concerns, it’s important to reiterate:

RPKI by no means is the perfect solution, but it is a first and constructive step in moving away from the trust based model, albeit with a long way still to go in terms of robustness
Although very little is new or unique to RPKI when it comes to having to trust the integrity of RIRs, concerns should be discussed openly
In retrospect, Telia Carrier should have done a better job in explicitly informing and having a dialogue with customers that had no previous record of invalid announcements that we would have rejected them starting in Q3/19

RPKI has now gained enough momentum for there to be no turning back. There will be multiple and important announcements over the next couple of months to highlight this fact and in turn bring what we believe is enough critical mass to take the next steps in addressing many of the aforementioned concerns should they still exist. In the meantime, and since starting to reject customer invalid prefixes back in Q3/19 we’ve been working on establishing capabilities to monitor and sanity checking ROA data as well as a swift method to, in case of emergency per RIR Trust Anchor, revert back to a pre-RPKI state where all prefixes from that RIR are temporarily remarked as unknown.

We’ve also started to track FAQs on our implementation and hoping the following section can foster additional questions for which we’ll be posting answers on the BGP & Routing page at teliacarrier.com shortly.

Telia Carrier’s Implementation

Testing of various RPKI validators started back in mid-2018 and there have been RTR sessions established to all routers in AS1299 since October the same year, at which point the first non-intrusive step was simply displaying the RPKI status for all prefixes in the Telia Carrier Looking Glass.

Figure 2 – Telia Carrier Looking Glass with prefix RPKI status

This provided ample time to monitor and harden the performance, devise workarounds and request fixes to the few but serious bugs found in mainly router software but also in the various validators. In the months following, effort was put into:

Deploying geographically diverse infrastructure globally to run the RPKI software stack on
Building sanity checks and code logic required to run parallel validators and insuring and enhancing the ability to safeguard for, and recover from, the most likely failure scenarios
Continuous risk mitigation to workaround severe router vendor bugs associated with RPKI
Pushing RPKI related feature requests to the router vendors deployed in AS1299
Fully supporting, and working within, the MANRS program
Collaborating with, and contributing to, RPKI validator developments
Participating in community RPKI sessions, workshops and standardization efforts
Developing automated reporting tools to provide data on invalids per router and origin – enriched with flow data to be able to map (invalid) prefixes to traffic volumes
Tracking which invalid routes were being announced to AS1299, and for customer reports, which of them were having a covering aggregate. For all customers and peers it applied to, this data was shared prior to starting to reject invalids
Building tools to analyze when temporarily rejecting invalid announcements from certain peers and tracking how and where traffic shifts

… and perhaps most importantly having very fruitful exchanges with key contributors; with enough of a critical mass now committed to lead the way into a industry-wide deployment of RPKI.

Towards the end of the second quarter of 2019, Telia Carrier announced to all peers the intent to start rejecting RPKI invalids with a grace period during which only the local preference was reduced. If unable to model the traffic impact properly, this is highly recommended as a general approach for a smoother transition. A perhaps clunky, yet simple, method of modelling the impact is using prefix RPKI status married with historic flow sampling data.

In parallel, invalid announcements were rejected from a group of friendly customers as well as all new ones. Customers that had at any point announced invalids since starting to monitor this in October 2018 were contacted to proactively create awareness and address any concerns around our intent to also start rejecting such announcements for the remaining eBGP sessions for complete coverage without exceptions or compromise.

Figure 2 depicts the progress over time, where there is a clear plateau at around the 80% mark in Q4/19 supporting customers (and customer’s customer’s customers’ etc) in the transition. Around midday CET on February 4, 2019 – the switch was made for all invalid announcements coming in over any external BGP session to be rejected. The area in Figure 3 below represents the invalid count as seen from public resources such as routeviews and RIS and the lines on the secondary axis represent the percentage of ASNs from which Telia were rejecting invalid prefixes:

Figure 3 – Telia Carrier RPKI implementation timeline

In the current setup and at the time of this writing, two different validators are deployed and spread out on diverse hardware globally and with all routers having active sessions to an instance of each validator. The validators currently deployed are Routinator, OpenBSD’s rpki-client with the aforementioned sanity checking between them. Our routers use the combined output of ROAs from all validators as a base for validation. This means that one, or even two, validators can be lost without any impact on performance. The TA updates/refreshes are staggered across geographies and with different update frequencies per validator. Half of the validators are restricted to only using rsync for synchronization, with the other half preferring RRDP when available. We still see occasional diffs in validator output due to this, when delegated repositories don’t keep the two different distribution protocols in sync.

Another fundamental piece to our BGP Routing setup working alongside the validators is the homegrown BGP Filter Server. This server automatically sweeps through more than 10,000 filters twice a day and corrects any inconsistencies across the more than 2,000 Telia IP Transit customers. Besides generating all IPv4 and IPv6 prefix filters, this server also acts as the central nervous system for prefix-limits, AS filters, RTBH enablement, iBGP load-sharing features and now RPKI – using ROAs as well as data from IRR databases as sources. An open-source tool used extensively is pmacct, in this context for handling netflow and RIB data from the network. This makes the flow of automated data querying, reporting/visualization and analytics fairly straightforward.

Figure 5 – Holistic RPKI data setup for reporting and analytics

Separate from RPKI, another tool we recommend deploying is ARTEMIS. It provides monitoring, detection and as well as simple alarm integration and correlation of BGP hijacks. A recent example showing both the usefulness of RPKI and ARTEMIS is a well-known incident in early April 2020, where 8,877 prefixes were hijacked. More than three-quarters of those prefixes had signed ROAs, and were rejected from both downstreams and peers. Because in ARTEMIS we subscribe to changes for Telia-owned prefixes, alerts were received within a few seconds of the announcements:

Figure 6 – ARTEMIS alerts during April 1, 2020 BGP hijack

Summary & Next Steps

In summary, we firmly believe global Internet routing will benefit from widespread adoption of RPKI, but there’s still much work to be done with RPKI itself and other security related mechanisms.
2020 looks to be the year when Tier-1 transit providers embrace and implement this technology in conjunction with key networks starting to register their ROAs.

In the next RPKI post, we’ll address how RTBH (Remote Triggered Blackholing) works in combination with RPKI. We’ll also look at how BMP (BGP Monitoring Protocol) can offer better insights of rejected prefixes in a pre-policy/Adj-RIB-in view for internal analysis, troubleshooting and customer visibility.

Written by
Carl Fredrik Lagerfeldt
Global Peering Manager at Telia Carrier
&
Johan Gustawsson
Head of Network Engineering & Architecture at Telia Carrier