Macro for generating delays in machine cycles

Registered by asier

Proposal for adding very usefull macro (not function) to generate delay with specified number of cpu machine cycles.
It is present in some other compilers and named __delay_cycles().
Compiler should translate it to assembler code as follows:
__delay_cycles(1) -> NOP
__delay_cycles(2) -> 2xNOP
__delay_cycles(..) -> calculated assembler loop
__delay_cycles(...) -> calculated assembler loop in loop
I know that it isn't accurate while interrupts are active, but delay is not shorter than expected.

Blueprint information

Status:
Not started
Approver:
None
Priority:
Undefined
Drafter:
asier
Direction:
Needs approval
Assignee:
None
Definition:
New
Series goal:
None
Implementation:
Unknown
Milestone target:
None

Related branches

Sprints

Whiteboard

Let me propose my decision:

#include <stdint.h>

static __inline__ __attribute__((__always_inline__)) void delay_4cycles(uint32_t cy) // +1 cycle
{
 #if ARCH_PIPELINE_RELOAD_CYCLES<2
 # define EXTRA_NOP_CYCLES "nop"
 #else
 # define EXTRA_NOP_CYCLES ""
 #endif

 __asm__ __volatile__
 (
  ".syntax unified" "\n\t" // is to prevent CM0,CM1 non-unified sintax
  "loop%=:" "\n\t"
  " subs %[cnt],#1" "\n\t"
          EXTRA_NOP_CYCLES "\n\t"
  " bne loop%=" "\n\t"
  : [cnt]"+r"(cy) // output: +r means input+output
  : // input:
  : "cc" // clobbers:
 );
}

static __inline__ __attribute__((__always_inline__)) void delay_cycles(uint32_t x)
{
 #define MAXNOPS 4

 if (x<=MAXNOPS)
 {
  if (x==1) {nop();}
  else if (x==2) {nop(); nop();}
  else if (x==3) {nop(); nop(); nop();}
  else if (x==4) {nop(); nop(); nop(); nop();}
 }
 else // because of +1 cycle inside delay_4cycles
 {
  uint32_t rem = (x-1)%MAXNOPS;

  if (rem==1) {nop();}
  else if (rem==2) {nop(); nop();}
  else if (rem==3) {nop(); nop(); nop();}

  if ((x=(x-1)/MAXNOPS)) delay_4cycles(x); // if need more then 4 nop loop is more optimal
 }
}

By @Traumflug

For a calibrated delay loop with microseconds as parameter, see https://github.com/Traumflug/Teacup_Firmware/blob/master/delay-arm.c

Next to interrupts, the prefetch engine is another source of unexpected additional delays. To deal with this, one can add a __ASM (".balign 16"). Then the compiler adds NOPs to make sure code always starts at a 16-byte boundary, giving consistent behavior in the loop. Moving such a loop to a place where it crosses a 16-byte boundary makes it slower by a few clocks without additionally executed instructions, the CPU just sleeps for a clock tick or two.

Also something to consider is that more feature rich Cortex' may simply ignore NOPs. They enter the CPU pipeline then, but get discarded before they consume time. This info is picked up from one of the Cortex-M user manuals.

By David Brown

The delay_cycles function should check that the parameter x is constant:

static __inline__ __attribute__((__always_inline__)) void delay_cycles(uint32_t x)
{
  if (__builtin_constant_p(x)) {
    ... // same as above
  } else {
    delay_4cycles(x / 4);
  }
}

It would be a bit fiddly to try and get the dynamic version cycle-perfect, rather than rounded to a multiple of 4 cycles - and I doubt if it is worth the effort. But this version would still be more accurate than the first version.

And could another instruction be used instead of NOP? Like:

  asm volatile(" add %[x], #0 " : [x] "+r" (x) : )

(?)

Work Items

This blueprint contains Public information 
Everyone can see this information.