Question

我正在为 GPIO 编写 C++ 模板包装器。对于 STM32，我使用 HAL 和 LL 代码作为基础。 GPIO 初始化归结为一系列 read register to temp variable -> Mask pin specific bits in temp -> shift and write pin specific bits in temp -> write temp back to register。寄存器声明为 volatile。

首先对 volatiles 进行所有读取，然后对所有更新，然后对 volatiles 进行所有写入，而不是像现在这样按顺序（在ST的代码，例如）？当然，写入仍然是有序的。

所以从场景A：

uint32_t temp;
temp = struct->reg1;
temp |= ...
temp &= ...
struct->reg1 = temp;
temp = struct->reg2;
temp |= ...
temp &= ...
struct->reg2 = temp;

到场景B：

uint32_t temp1, temp2;
temp1 = struct->reg1;
temp2 = struct->reg2;
temp1 |= ...
temp1 &= ...
temp2 |= ...
temp2 &= ...
struct->reg1 = temp1;
struct->reg2 = temp2;

场景 B 可能会多使用一点（或 4）内存，但不必像我期望的那样经常中断主程序流。是否可以在场景 B 中对代码进行更多优化，例如组合读取或写入？

Answer 1

这不会有任何区别。代码将完全一样高效

void zoo(uint32_t val1, uint32_t val2)
{
    uint32_t moder = GPIOA -> MODER;
    uint32_t otyper = GPIOA -> OTYPER;
    moder &= val1;
    moder |= val2;
    otyper &= val1;
    otyper |= val2;
    GPIOA -> MODER = moder;
    GPIOA -> OTYPER = otyper;
}

void boo(uint32_t val1, uint32_t val2)
{
    uint32_t val = GPIOA -> MODER;
    val &= val1;
    val |= val2;
    GPIOA -> MODER = val;
    val = GPIOA -> OTYPER;
    val &= val1;
    val |= val2;
    GPIOA -> OTYPER = val;
}

并且不存在问题，因为您仅在初始化期间访问 GPIO 的多个寄存器。引脚配置通常仅在程序启动时设置，有时在进入和退出低功耗模式时设置（例如，我们将引脚设置为模拟模式以消耗尽可能少的电流）。在这个阶段，性能不是第一要务。

通常您只会访问一个寄存器：

BSRR - 设置引脚（但此寄存器只写） ODR - 设置和读取我们设置的内容 IDR - 实际引脚电平（只读）

某些 STM 微型计算机中的 BSRR 分为两个寄存器 BRR 和 BSR，但它们也是只写。

IMO 您尝试对完全不需要的东西进行微优化。

https://godbolt.org/z/xWqWo9

Answer 2

<块引用>

首先对 volatiles 进行所有读取，然后对所有更新，然后对 volatiles 进行所有写入，而不是像现在这样按顺序（在ST的代码，例如）？

所以除了检查之外别无他法！以下代码：

 "devDependencies": {
    "customComponent": "git+ssh://git@vs-ssh.visualstudio.com:v3/path/to/component-v1.1.0-gitpkg",
    "@babel/core": "^7.11.6",
    "@babel/preset-env": "^7.11.5",
    "@babel/preset-react": "^7.10.4",

outputs on godbolt with gcc ARM 8.2 -O3 -mlittle-endian -mthumb -mcpu=cortex-m3：

// based on code from https://github.com/ARM-software/CMSIS
#include <stdint.h>
#define __IO volatile
typedef struct
{
  __IO uint32_t CR;
  __IO uint32_t CSR;
} PWR_TypeDef;
#define PERIPH_BASE           ((uint32_t)0x40000000) /*!< Peripheral base address in the alias region */
#define APB1PERIPH_BASE       PERIPH_BASE
#define PWR_BASE              (APB1PERIPH_BASE + 0x7000)
#define PWR                 ((PWR_TypeDef *) PWR_BASE)

#define  PWR_CR_LPDS                         ((uint16_t)0x0001)     /*!< Low-Power Deepsleep */
#define  PWR_CR_PDDS                         ((uint16_t)0x0002)     /*!< Power Down Deepsleep */
#define  PWR_CR_CWUF                         ((uint16_t)0x0004)     /*!< Clear Wakeup Flag */
#define  PWR_CR_CSBF                         ((uint16_t)0x0008)     /*!< Clear Standby Flag */
#define  PWR_CR_PVDE                         ((uint16_t)0x0010)     /*!< Power Voltage Detector Enable */

#define  PWR_CSR_WUF                         ((uint16_t)0x0001)     /*!< Wakeup Flag */
#define  PWR_CSR_SBF                         ((uint16_t)0x0002)     /*!< Standby Flag */
#define  PWR_CSR_PVDO                        ((uint16_t)0x0004)     /*!< PVD Output */
#define  PWR_CSR_EWUP                        ((uint16_t)0x0100)     /*!< Enable WKUP pin */

void func_separate() {
    // just a meaningless example for testing
    uint32_t temp;
    temp = PWR->CR;
    temp &= PWR_CR_LPDS | PWR_CR_PDDS | PWR_CR_CWUF;
    temp |= PWR_CR_CWUF;
    PWR->CR = temp;
    temp = PWR->CSR;
    temp &= PWR_CSR_WUF | PWR_CSR_SBF;
    temp |= PWR_CSR_PVDO | PWR_CSR_EWUP;
    PWR->CSR = temp;
}

void func_together() {
    uint32_t temp1, temp2;
    temp1 = PWR->CR;
    temp2 = PWR->CSR;
    temp1 &= PWR_CR_LPDS | PWR_CR_PDDS | PWR_CR_CWUF;
    temp1 |= PWR_CR_CWUF;
    temp2 &= PWR_CSR_WUF | PWR_CSR_SBF;
    temp2 |= PWR_CSR_PVDO | PWR_CSR_EWUP;
    PWR->CR = temp1;
    PWR->CSR = temp2;
}

唯一的区别是指令的顺序。性能方面没有区别。所以func_separate: ldr r2, .L3 ldr r3, [r2] and r3, r3, #7 orr r3, r3, #4 str r3, [r2] ldr r3, [r2, #4] and r3, r3, #3 orr r3, r3, #260 str r3, [r2, #4] bx lr .L3: .word 1073770496 func_together: ldr r1, .L6 ldr r2, [r1] ldr r3, [r1, #4] and r2, r2, #7 and r3, r3, #3 orr r2, r2, #4 orr r3, r3, #260 str r2, [r1] str r3, [r1, #4] bx lr .L6: .word 1073770496 - 不。

但就可读性而言，更喜欢第一个版本是有道理的。

Answer 3

在这种特定情况下，这无关紧要。

不过，一般来说，建议不要在可以避免的情况下访问多行中的单个硬件寄存器。最好将所有内容写入临时 RAM 变量中，并且只对寄存器进行一次读写操作。

这与执行时间没有太大关系，而是读写硬件寄存器会带来许多副作用，例如清除标志或影响实时。

此外，临时变量上的 temp1 |= ... temp1 &= ... 之类的东西可以很容易地被编译器优化，这很可能使用 CPU 寄存器进行此类分配，而不是堆栈分配。

另一件值得一提的事情是对硬件寄存器的读/写无法优化或重新排序，因为它们是 volatile 限定的。出于这个原因，您需要尽量减少寄存器访问以节省一点执行时间，同时也允许编译器更有效地优化周围的代码。

易失性变量（寄存器）的交错更新？

3 个答案: