Skip to content

Conversation

@Jaromc
Copy link

@Jaromc Jaromc commented Oct 28, 2025

Multi-frame DroneCAN messages with a periph in the loop is experiencing CRC and wrong toggle errors. I believe this is related to #28545. According to DroneCAN and STM32 documentation we should be enabling tx FIFO to maintain the correct order of multi-frame messages.

I can see in the bootloader the FIFO flag is being set. The equivalent is also being set for CANFD. FIFO is not being set though for standard CAN, in my case with an STM32F4xx periph.

From the DroneCAN documentation:

https://dronecan.github.io/Specification/4.1_CAN_bus_transport_layer/

Multi-frame transfers use identical CAN ID for all frames of the transfer, and DroneCAN requires that all frames of a multi-frame transfer should be transmitted in order. Therefore, the CAN controller driver software must ensure that CAN frames with identical CAN ID must be transmitted in their order of appearance in the TX queue. Some hardware will not meet this requirement by default, so the designer must take special care to ensure correct behavior, and apply workarounds if necessary.

stm32f4xx 34.7.1 docs:

Transmit priority
• By identifier
When more than one transmit mailbox is pending, the transmission order is given
by the identifier of the message stored in the mailbox. The message with the
lowest identifier value has the highest priority according to the arbitration of the
CAN protocol. If the identifier values are equal, the lower mailbox number is
scheduled first.

• By transmit request order
The transmit mailboxes can be configured as a transmit FIFO by setting the TXFP
bit in the CAN_MCR register. In this mode, the priority order is given by the
transmit request order. This mode is very useful for segmented transmission.

So we must set the TXFP flag to transmit by request order.

In my testing watching the stats panel in DroneCAN GUI I can see a drop in "Rx Error Wrong Toggle" and "RX Error Bad CRC" on the Autopilot Node when I've applied this patch.

@tpwrules
Copy link
Contributor

I'm not sure the central thesis of this PR is correct. We have logic here to ensure that we don't put more than one message with the same ID into the hardware buffers, thereby avoiding the reordering issue while more closely preserving CAN priority semantics.

I don't doubt that enabling the FIFO changes things up and may smooth out bus traffic, perhaps even because priority is hurting in your application (maybe timeouts are not set up correctly for some messages). There's also some chance the software logic has broken. But in any case it's not a complete fix because the software logic will still avoid putting more than one part of a multi-part message in the hardware buffers at once, thereby negating much of the benefit of FIFO mode.

On the other hand, FDCAN using FIFO mode successfully for a long time is an indication that priority might not matter as much as the bxCAN driver author thought, so we could just turn on FIFO mode and remove the priority preserving logic. It's worth remembering that on average FDCAN is used on the FC side and bxCAN is used on the periph side, and the usage patterns are generally different.

This will require careful thought and testing. I would love to know more about your bus setup and testing conditions to replicate the failure and success cases here, it would be good info to understand what's really going on at the bus level and get data points to make the decision of which CAN driver is right.

@joshanne
Copy link
Contributor

Adding the link to the discord discussion that is also ongoing: https://discord.com/channels/674039678562861068/800819275069390920/1432584750920503429

Copy link
Contributor

@andyp1per andyp1per left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just don't see any downsides to this. We already use FIFO in other areas of serial comms. I agree it may not fully solve the problem, but seems like a step in the right direction.

@tpwrules
Copy link
Contributor

tpwrules commented Oct 28, 2025

The downside is that it breaks CAN priority semantics. A high priority message can get stuck if the next message to transmit is low priority and the bus is busy with medium priority messages.

Just turning on the hardware FIFO will also not allow any more slots to be used due to the priority-based scheduling logic. If we want to go FIFO we should remove that too.

@tpwrules
Copy link
Contributor

tpwrules commented Oct 28, 2025

From Discord:

So a setup of:

  • AutoPilot (H7)
  • Periph (F4xx)
    • MAVLink Forwarding (Targetted Tunnel is a Lower Priority)
    • Inertial Sensor (INS is higher priority in Periph)
  • Companion Computer (with MAVLink Capability)

We're monitoring some inertial data in the AP and passing MAVLink traffic in and out of the companion computer.
We've got Debug stats enabled on Periph and the AutoPilot and we observe "Rx Error Wrong Toggle" and "RX Error Bad CRC" in the autopilot.

When the patch is applied, these Wrong Toggle and Bad CRCs are gone completely.

The best I can tell is the INS is running in a separate thread, and so the FIFO enabling is at least ensuring that all the packets are sent in the order they are intended.

The MAVLink isn't high rate data, we're only talking a handful of HEARTBEATS, MCU_STATUS, etc. It doesn't appear to be timeouts occurring on the Tx side (We've got Tx Timeouts on the AP side we've observed when sending amounts of data that end up causing a timeout to occur).

@tpwrules
Copy link
Contributor

Have you enabled and checked the can_stats.txt on the autopilot? What does that show?

@Jaromc
Copy link
Author

Jaromc commented Oct 29, 2025

Thanks @tpwrules, I've dug into this further today. The priority logic you mentioned doesn't get called in the processTx periph logic. The select function used here isn't added to processTx for for the periph. I'm not sure if there was a reason for omitting it.

I did enable can_stats.txt and it didn't show anything more than the results I saw through the DroneCan GUI stats panel so there's nothing significant to add there.

Today I've experimented adding the select function in and removing this PR. I need to investigate this option further.

@tpwrules
Copy link
Contributor

If it's true the priority logic never ran on periph then it's probably the correct move to just delete the logic and turn on the FIFO in hardware.

However, this will affect bxCAN-based flight controllers. I have had trouble sourcing one to test with. What F4 Periph-with-IMU are you using?

@Jaromc
Copy link
Author

Jaromc commented Oct 29, 2025

I'm using a custom F4 solution but anything that's not a H7 or G4 should replicate it. CAN periph with an IMU attached or flight controller with a modified hwdef. This periph would be ideal but it's EOL.

@Jaromc
Copy link
Author

Jaromc commented Oct 30, 2025

I've done a bit more experimenting with adding in the select function just out of curiosity. Yes it does work however throughput is reduced because we end up using only one mailbox mostly. I know I've hit the throughput limit because my tx_errors and rx_error_oom climb due to hitting the CAN pool limits as the Tx queue grows.

In contrast, enabling FIFO in hardware I use all 3 mailboxes allowing the queue to processed faster.

@tpwrules
Copy link
Contributor

I'm working again on getting hardware to test and going through the code myself. I'm more on the side of just using FIFO mode now but this patch should be augmented to pull out the unused priority logic. Thanks for digging deeper!

I also still need to talk to @tridge .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants