In this article, I’ll explore interrupt latency of a Cortex-A9 under various scenarios — and yes, it’s still on the Zynq-7000, since I still have that board on my desk from the last two articles. An upcoming follow-up article will describe methods of improving worst case latency.

Embedded Systems and Application Processors

With the ever-increasing requirements of modern embedded applications, it’s not surprising that more and more device designs are making the jump from smaller embedded chips to more powerful application class processors. And this isn’t very surprising — even their names could have helped anyone foretell this trend. MCUs, or microcontrollers, are designed to perform control tasks, while microprocessors, on the other hand, are more suited for processing data. Modern applications are all about data.

Often the first question that arises, when starting or migrating a design to an application processor, is about interrupt latency. Interrupt response time is often stated as a simple number, which at the end of the day doesn’t mean much. If I wanted to give a sales pitch, I could say that I can get you to the first instruction of your ISR in less than a 100 cycles, which on your typical MPU would be less than 200 ns. That’s right, nanoseconds. But that doesn’t mean much, because if an application is latency-sensitive, that’s just about the worst case scenario. Worst-case interrupt response time is application-dependent, with a mix of variables affecting the results, including memory type, CPU configuration, and application behaviour.

Interrupt Latency on the Cortex-A

Interrupt latency is mostly affected by how much time it takes for the CPU to fetch the necessary instructions and data in order to process the interrupt entry sequence. On a higher-end platform, this is complicated by the more complex memory hierarchy and memory management. For example, in a Cortex-M, an instruction or data fetch can be done in a well-bounded amount of time, while a Cortex-A requires many more steps when fetching from memory.

Getting an instruction or data will first require a TLB lookup, which may require a page table walk, and which may or may not be cached. And at this point, the access itself hasn’t been done. A memory access must cross the entire memory hierarchy in search of either a cache hit or the underlying memory. Most modern Cortex-A processors are implemented with a 4-level memory hierarchy. The L1 and L2 cache are usually named as such, while the central interconnect would be L3 and external memory located at L4.

Rather than belabouring the point with a long theoretical discussion, I’ve simply measured interrupt latency under a few configurations with varying cache states. Measurements were done by using a software-generated interrupt on a Zynq-7000 running at 667Mhz. The values are quoted in CPU cycles, and includes the measurement and interrupt generation overhead.

The table contains measurements for four different configurations. Two of those configurations feature code and data stored in external DDR memory, and two with code and data in the on-chip RAM (OCRAM). Both configurations were tested with memory marked as cacheable and non-cacheable.

ConfigurationScenarioMinMaxAvg
DDR CachedNormal242242242
DDR CachedTLB Invalidate326496328
DDR CachedBP Invalidate298298298
DDR CachedTLB+BP Invalidate384478396
DDR CachedL1I Clean400414404
DDR CachedL1D Clean396420408
DDR CachedL1 D+I Clean532580546
DDR CachedL2 Clean242242242
DDR CachedL1+L2 Clean158229781844
DDR CachedL1+L2+BP+TLB Clean174628961892
OCRAM CachedNormal240240240
OCRAM CachedTLB Invalidate242242242
OCRAM CachedBP Invalidate298298298
OCRAM CachedTLB+BP Invalidate298298298
OCRAM CachedL1I Clean402424410
OCRAM CachedL1D Clean424462448
OCRAM CachedL1 D+I Clean588620600
OCRAM CachedL2 Clean240240240
OCRAM CachedL1+L2 Clean592620594
OCRAM CachedL1+L2+BP+TLB Clean580626596
DDR UncachedNormal207631822186
DDR UncachedTLB Invalidate210032042210
DDR UncachedBP Invalidate284640102984
DDR UncachedTLB+BP Invalidate287440522996
DDR UncachedL1I Clean281840622980
DDR UncachedL1I+BP+TLB Clean288039962986
OCRAM UncachedNormal686700690
OCRAM UncachedTLB Invalidate692694692
OCRAM UncachedBP Invalidate822822822
OCRAM UncachedTLB+BP Invalidate822822822
OCRAM UncachedL1I Clean822822822
OCRAM UncachedL1I+BP+TLB Clean822822822
Interrupt latency results

Analysis

Let’s look first at the normal configuration, which would be code and data stored in cached external DDR. The best case interrupt latency is 242 cycles, which for a 667 Mhz Zynq is about 0.36 µs — not bad. Worst case in the measured scenarios is with a clean L1 and L2 caches, as well as empty TLB and branch predictors. In that later case, measurements show a worst case of 3996 cycles or nearly 6 µs. Moving code and data into on-chip memory gives a very similar best-case; this makes sense, since this is with a warm L1 cache, so memory access never reaches L2. The worst case is improved to 626 cycles/0.9 µs. A point of note is that on the Zynq, the OCRAM is located at the same level as the L2 cache, so access to the OCRAM from the CPU do not have to cross the central interconnect.

There are also a few oddities that might seem strange at first. One of them is the impact of a cold TLB, which doesn’t count for much, or in some cases at all, if we trust the measurements. For example, when running from on-chip RAM. clearing the TLB seems to have no effect. This is mostly because the code and data fit in a full page, as such only one fetch is required from the page table. In those cases, even if the TLB is flushed before measuring, the access is done prior to the measurement, and isn’t captured in the data. The memory layout for the configuration in external memory is a little different, and requires two table lookups instead of one.

Parting Words

The measurements, however, do not reflect the absolute worst case scenario, as we are considering clean caches with a mostly idle SoC. It’s important when qualifying a design for worst-cast interrupt response time to use the complete application. If possible, I’ll try to simulate a mostly dirty cache with some interconnect contention before the next post, whuch will be about how to mitigate interrupt latency in a Cortex-A based system.

Questions or comments?

Do not hesitate to contact us at blog@jblopen.com. Your questions, comments and, suggestions are appreciated.

6 thoughts on “ARM Cortex-A Interrupt Latency

  1. The results are very interesting!

    What exactly did you use to measure the latency times in clock-cycles?
    One of the internal system timers or debug access?

    1. In this case, the global timer running at half the cpu frequency. I also tried the cycle counter of the performance monitor, but found it could take a considerable time to read.

      1. But your measurements are of internal cpu generated interruptions or were you driving an external interruption pin through the PL?

        If so how were you getting the timestamp for the rising edge of your interruption?

        It would be interesting to know how different interruption types respond with this same analysis.

        1. I was using the software generated interrupt features of the ARM GIC. Since I was mostly interested in the software response type, the exact interrupt source was somewhat irrelevant. Most interrupt sources, even external GPIO have a very short latency compared to the software latency required to process the interrupt.

          As for how it was measured, it’s basically reading the global timer value, triggering the interrupt then in the ISR read the global timer again. It does add a small overhead to all measurement but it’s small compared to the interrupt processing itself.

    1. At this moment we do not have anything we can readily share for measuring ISR latency using the PMU. You may want to be careful when using the PMU, however, as it only counts execution cycles, which is useful to measure the actual CPU load but may differ from the real execution time.

Leave a Reply

Your email address will not be published. Required fields are marked *